URN Syntax
Tagged URN grammar, normalization, canonicalization, and CAP direction defaults
Definition
Let U be the set of all normalized Tagged URNs.
A Tagged URN has the structure:
prefix:key1=value1;key2=value2;...
Each element $u \in U$ has:
- A prefix (e.g.,
media,cap) - Zero or more tags as key-value pairs
- A well-defined canonical representation
Grammar
tagged-urn ::= prefix ":" tags?
prefix ::= identifier
tags ::= tag (";" tag)*
tag ::= key ("=" value)?
key ::= identifier
value ::= unquoted-value | quoted-value
identifier ::= [a-z][a-z0-9-]*
Examples:
media: # Identity (no tags)
media:pdf;bytes # Two tags with implicit * values
cap:in=media:pdf;op=extract;out=media:object
cap:in="media:pdf;bytes";op=extract;out="media:object"
Normalization
All Tagged URNs are normalized on parse. The parser is a state machine with 6 states (ExpectingKey, InKey, ExpectingValue, InUnquotedValue, InQuotedValue, InQuotedValueEscape).
Case Normalization
| Component | Rule |
|---|---|
| Prefix | Lowercase |
| Keys | Lowercase |
| Unquoted values | Lowercase |
| Quoted values | Preserve case |
Input: CAP:Op=Extract;Format=pdf
Output: cap:format=pdf;op=extract
Input: cap:key="UPPER"
Output: cap:key="UPPER"
Value-less Tags
A tag without = is treated as having value *:
Input: media:pdf;bytes
Parsed: media:pdf=*;bytes=*
When serializing, * values serialize back to value-less form:
Internal: {pdf: "*", bytes: "*"}
Output: media:pdf;bytes
Tag Ordering
Tags are stored in a BTreeMap and serialized in alphabetical key order:
Input: media:bytes;pdf
Output: media:bytes;pdf # 'bytes' before 'pdf' alphabetically
Quoting Rules
A value requires quoting if it contains: ;, =, ", \, space, or uppercase characters.
Inside quotes:
"is escaped as\"\is escaped as\\- All other characters are literal
Only these two escape sequences are supported.
CAP Direction Tags: Parser vs Canonical Form
CapUrn::from_string requires the cap: prefix but does not require explicit in/out tags in input text. Missing or wildcard direction tags are normalized to media:.
cap: -> cap:in=media:;out=media:
cap:in -> cap:in=media:;out=media:
cap:out -> cap:in=media:;out=media:
cap:in=media:text;out -> cap:in="media:text";out=media:
cap:in=*;out=* -> cap:in=media:;out=media:
Canonical strings always include both in and out. Direction tag values are canonicalized through MediaUrn parse and re-serialize, which ensures consistent tag ordering within the quoted media URN value.
Special Values
Tagged URNs support four special value forms:
| Value | Name | Meaning |
|---|---|---|
* |
Must-have-any | Key must be present with any value |
? |
Unspecified | No constraint on this key |
! |
Must-not-have | Key must be absent |
| (missing) | No constraint | Same as ? when used as pattern |
*(must-have-any): When matching, the instance MUST have this key with some value. The specific value doesn’t matter, but the key must exist.?(unspecified): No constraint. The pattern doesn’t care whether the instance has this key or what value it has.!(must-not-have): When matching, the instance MUST NOT have this key. If the instance has this key with any value, matching fails.- (missing): On the pattern side, a missing key means “no constraint” (same as
?). On the instance side, a missing key means the key is absent.
Wildcard Truth Table
This table defines matching between an instance and a pattern for a single key:
| Instance ↓ \ Pattern → | (missing) | K=? | K=! | K=* | K=v |
|---|---|---|---|---|---|
| (missing) | ✓ | ✓ | ✓ | ✗ | ✗ |
| K=? | ✓ | ✓ | ✓ | ✓ | ✓ |
| K=! | ✓ | ✓ | ✓ | ✗ | ✗ |
| K=* | ✓ | ✓ | ✗ | ✓ | ✓ |
| K=v | ✓ | ✓ | ✗ | ✓ | v=v only |
Reading the Table
- Row: What the instance has for key K
- Column: What the pattern requires for key K
- Cell: ✓ = match, ✗ = no match
Key Rules
- Pattern missing or
?: Always matches (no constraint) - Pattern
!: Matches only if instance is missing or! - Pattern
*: Matches only if instance has a value (not missing, not!) - Pattern
v: Matches only if instance has exactlyv(or*which accepts any) - Instance
?: Always matches (instance doesn’t constrain)
Examples
# Pattern: media:pdf Instance: media:pdf;bytes
# Pattern has: pdf=* Instance has: pdf=*, bytes=*
# For key 'pdf': Instance=*, Pattern=* → ✓
# For key 'bytes': Instance=*, Pattern=(missing) → ✓
# Result: MATCH
# Pattern: media:pdf;!audio Instance: media:pdf;audio=mp3
# For key 'pdf': Instance=*, Pattern=* → ✓
# For key 'audio': Instance=mp3, Pattern=! → ✗
# Result: NO MATCH (instance has audio, pattern forbids it)
Parse Errors
| Condition | Error |
|---|---|
| Missing prefix | MissingPrefix |
| Empty prefix | EmptyPrefix |
| Duplicate key | DuplicateKey |
| Numeric key | NumericKey |
Empty value after = |
EmptyValue |
| Unclosed quote | UnclosedQuote |
| Invalid escape sequence | InvalidEscape |
| Whitespace in input | Parse error (not trimmed) |
Invalid CAP prefix (not cap:) |
CapUrnError::MissingCapPrefix |
The Base Relation $\preceq$
Tagged URNs form a partial order under the relation:
This relation is:
- Reflexive: $\forall u \in U,\; u \preceq u$
- Transitive: $(a \preceq b \land b \preceq c) \implies a \preceq c$
- Antisymmetric: $(a \preceq b \land b \preceq a) \implies a \equiv b$
The identity prefix: (no tags) sits at the top of the order — it accepts any URN with the same prefix.
References
tagged-urn-rs/src/tagged_urn.rs— Reference implementation (Rust)capdag/src/urn/cap_urn.rs— Cap URN parsing and direction defaulting