cap: Documentation
Browse Sign In

URN Syntax

Tagged URN grammar, normalization, canonicalization, and CAP direction defaults

Definition

Let U be the set of all normalized Tagged URNs.

A Tagged URN has the structure:

prefix:key1=value1;key2=value2;...

Each element $u \in U$ has:

  • A prefix (e.g., media, cap)
  • Zero or more tags as key-value pairs
  • A well-defined canonical representation

Grammar

tagged-urn  ::= prefix ":" tags?
prefix      ::= identifier
tags        ::= tag (";" tag)*
tag         ::= key ("=" value)?
key         ::= identifier
value       ::= unquoted-value | quoted-value
identifier  ::= [a-z][a-z0-9-]*

Examples:

media:                           # Identity (no tags)
media:pdf;bytes                  # Two tags with implicit * values
cap:in=media:pdf;op=extract;out=media:object
cap:in="media:pdf;bytes";op=extract;out="media:object"

Normalization

All Tagged URNs are normalized on parse. The parser is a state machine with 6 states (ExpectingKey, InKey, ExpectingValue, InUnquotedValue, InQuotedValue, InQuotedValueEscape).

Case Normalization

Component Rule
Prefix Lowercase
Keys Lowercase
Unquoted values Lowercase
Quoted values Preserve case
Input:  CAP:Op=Extract;Format=pdf
Output: cap:format=pdf;op=extract

Input:  cap:key="UPPER"
Output: cap:key="UPPER"

Value-less Tags

A tag without = is treated as having value *:

Input:  media:pdf;bytes
Parsed: media:pdf=*;bytes=*

When serializing, * values serialize back to value-less form:

Internal: {pdf: "*", bytes: "*"}
Output:   media:pdf;bytes

Tag Ordering

Tags are stored in a BTreeMap and serialized in alphabetical key order:

Input:  media:bytes;pdf
Output: media:bytes;pdf      # 'bytes' before 'pdf' alphabetically

Quoting Rules

A value requires quoting if it contains: ;, =, ", \, space, or uppercase characters.

Inside quotes:

  • " is escaped as \"
  • \ is escaped as \\
  • All other characters are literal

Only these two escape sequences are supported.

CAP Direction Tags: Parser vs Canonical Form

CapUrn::from_string requires the cap: prefix but does not require explicit in/out tags in input text. Missing or wildcard direction tags are normalized to media:.

cap:                       -> cap:in=media:;out=media:
cap:in                     -> cap:in=media:;out=media:
cap:out                    -> cap:in=media:;out=media:
cap:in=media:text;out      -> cap:in="media:text";out=media:
cap:in=*;out=*             -> cap:in=media:;out=media:

Canonical strings always include both in and out. Direction tag values are canonicalized through MediaUrn parse and re-serialize, which ensures consistent tag ordering within the quoted media URN value.

Special Values

Tagged URNs support four special value forms:

Value Name Meaning
* Must-have-any Key must be present with any value
? Unspecified No constraint on this key
! Must-not-have Key must be absent
(missing) No constraint Same as ? when used as pattern
  • * (must-have-any): When matching, the instance MUST have this key with some value. The specific value doesn’t matter, but the key must exist.
  • ? (unspecified): No constraint. The pattern doesn’t care whether the instance has this key or what value it has.
  • ! (must-not-have): When matching, the instance MUST NOT have this key. If the instance has this key with any value, matching fails.
  • (missing): On the pattern side, a missing key means “no constraint” (same as ?). On the instance side, a missing key means the key is absent.

Wildcard Truth Table

This table defines matching between an instance and a pattern for a single key:

Instance ↓ \ Pattern → (missing) K=? K=! K=* K=v
(missing)
K=?
K=!
K=*
K=v v=v only

Reading the Table

  • Row: What the instance has for key K
  • Column: What the pattern requires for key K
  • Cell: ✓ = match, ✗ = no match

Key Rules

  1. Pattern missing or ?: Always matches (no constraint)
  2. Pattern !: Matches only if instance is missing or !
  3. Pattern *: Matches only if instance has a value (not missing, not !)
  4. Pattern v: Matches only if instance has exactly v (or * which accepts any)
  5. Instance ?: Always matches (instance doesn’t constrain)

Examples

# Pattern: media:pdf    Instance: media:pdf;bytes
# Pattern has: pdf=*    Instance has: pdf=*, bytes=*
# For key 'pdf': Instance=*, Pattern=* → ✓
# For key 'bytes': Instance=*, Pattern=(missing) → ✓
# Result: MATCH

# Pattern: media:pdf;!audio    Instance: media:pdf;audio=mp3
# For key 'pdf': Instance=*, Pattern=* → ✓
# For key 'audio': Instance=mp3, Pattern=! → ✗
# Result: NO MATCH (instance has audio, pattern forbids it)

Parse Errors

Condition Error
Missing prefix MissingPrefix
Empty prefix EmptyPrefix
Duplicate key DuplicateKey
Numeric key NumericKey
Empty value after = EmptyValue
Unclosed quote UnclosedQuote
Invalid escape sequence InvalidEscape
Whitespace in input Parse error (not trimmed)
Invalid CAP prefix (not cap:) CapUrnError::MissingCapPrefix

The Base Relation $\preceq$

Tagged URNs form a partial order under the relation:

$$ a \preceq b \;\text{iff}\; a \text{ is at least as specific as } b $$

This relation is:

  • Reflexive: $\forall u \in U,\; u \preceq u$
  • Transitive: $(a \preceq b \land b \preceq c) \implies a \preceq c$
  • Antisymmetric: $(a \preceq b \land b \preceq a) \implies a \equiv b$

The identity prefix: (no tags) sits at the top of the order — it accepts any URN with the same prefix.

References