ribbit/TOKENIZER_DESIGN.md

# HopDown Tokenizer Design

## Problem

The regex-based inline parser and serializer can't reliably distinguish
structural delimiters from literal text characters. This causes:
- `toMarkdown` escaping bugs (over-escaping inside inline tags, under-escaping
  in text nodes)
- Round-trip failures (`toHTML(toMarkdown(html)) !== html`)
- Fragile interactions between features (underscore normalization + strikethrough,
  HTML passthrough + escaping)

## Invariants

1. `toHTML` satisfies GFM spec rules 1-15
2. `toMarkdown` always emits the canonical form
3. `toHTML(toMarkdown(html)) === html` (single-pass round-trip)

## Architecture

### Token types

```
text      — literal characters, will be escaped during serialization
delimiter — structural marker (**, *, ~~, `, etc.)
html      — raw HTML tag passthrough
break     — hard line break (<br>)
```

### Inline tokenizer (markdown → tokens)

Scans left-to-right, character by character. Maintains a stack of open
delimiters. Produces a flat token stream:

```
Input:  "hello **bold *nested*** end"
Tokens: [text "hello "] [open **] [text "bold "] [open *] [text "nested"] [close *] [close **] [text " end"]
```

The tokenizer handles:
- Backslash escapes: `\*` → text token containing `*`
- Entity resolution: `&amp;` → text token containing `&`
- Flanking rules: only emit delimiter tokens when flanking conditions are met
- Code spans: `` ` `` opens a code span that consumes everything until the matching `` ` ``
- Links: `[text](url)` parsed as a unit
- Autolinks: `<url>` and bare URLs
- Hard line breaks: trailing spaces or `\` before newline
- HTML tags: `<span>` etc. passed through as html tokens

### Inline parser (tokens → HTML)

Walks the token stream and matches open/close delimiter pairs using a
stack. Produces HTML string. Handles:
- Delimiter pairing with precedence (*** before ** before *)
- Multiple-of-3 rule
- Nesting validation (no em inside em, no links inside links)

### Serializer (DOM → tokens → markdown)

Walks the DOM tree. For each node:
- Text nodes → text tokens (the serializer knows these need escaping)
- Element nodes → look up the tag, emit delimiter tokens + recurse into children
- Unknown elements → recurse into children

Then the token stream is serialized to a string:
- Delimiter tokens → emitted verbatim (they're structural)
- Text tokens → characters that would be misinterpreted as delimiters are
  backslash-escaped. The serializer knows exactly which characters are
  dangerous because it knows what delimiters exist.
- HTML tokens → emitted verbatim

### Why this solves the round-trip problem

The key insight: delimiter tokens and text tokens are different types.
When serializing `<strong>hello *world*</strong>`, the output is:

```
[delim **] [text "hello "] [delim *] [text "world"] [delim *] [delim **]
```

The `*` around "world" are delimiter tokens (from the nested `<em>`).
If instead the text contained a literal `*`:

```
<strong>hello * world</strong>
```

The output would be:

```
[delim **] [text "hello * world"] [delim **]
```

The `*` is a text token. During serialization, the text token scanner
sees `*` and escapes it to `\*` because `*` is a known delimiter character.
The delimiter tokens are never escaped. No ambiguity.

## Files

- `types.ts` — Token type, updated Tag interface
- `tokenizer.ts` — Inline tokenizer (markdown → tokens)
- `serializer.ts` — DOM → tokens → markdown string
- `hopdown.ts` — Orchestrator (block parsing, delegates inline to tokenizer)
- `tags.ts` — Tag definitions (simplified: no more regex patterns)

## Migration

The Tag interface changes:
- `pattern` field removed (tokenizer handles delimiter matching)
- `toMarkdown` returns Token[] instead of string
- `match` stays the same (block-level matching is already clean)
- `toHTML` stays the same

The HopDown public API stays the same:
- `toHTML(markdown)` — unchanged
- `toMarkdown(html)` — unchanged
- `findCompletePair`, `findUnmatchedOpener` — reimplemented on tokenizer
- `getTagForElement`, `getEditableSelector` — unchanged
Reimplement as a tokenizer with GFM parity 2026-04-29 15:48:36 -07:00			`# HopDown Tokenizer Design`

			`## Problem`

			`The regex-based inline parser and serializer can't reliably distinguish`
			`structural delimiters from literal text characters. This causes:`
			- `toMarkdown` escaping bugs (over-escaping inside inline tags, under-escaping
			`in text nodes)`
			- Round-trip failures (`toHTML(toMarkdown(html)) !== html`)
			`- Fragile interactions between features (underscore normalization + strikethrough,`
			`HTML passthrough + escaping)`

			`## Invariants`

			1. `toHTML` satisfies GFM spec rules 1-15
			2. `toMarkdown` always emits the canonical form
			3. `toHTML(toMarkdown(html)) === html` (single-pass round-trip)

			`## Architecture`

			`### Token types`

			```
			`text — literal characters, will be escaped during serialization`
			delimiter — structural marker (*, , ~~, `, etc.)
			`html — raw HTML tag passthrough`
			`break — hard line break (<br>)`
			```

			`### Inline tokenizer (markdown → tokens)`

			`Scans left-to-right, character by character. Maintains a stack of open`
			`delimiters. Produces a flat token stream:`

			```
			`Input: "hello *bold nested*** end"`
			`Tokens: [text "hello "] [open *] [text "bold "] [open ] [text "nested"] [close ] [close *] [text " end"]`
			```

			`The tokenizer handles:`
			- Backslash escapes: `\` → text token containing ``
			- Entity resolution: `&` → text token containing `&`
			`- Flanking rules: only emit delimiter tokens when flanking conditions are met`
			- Code spans: `` ` `` opens a code span that consumes everything until the matching `` ` ``
			- Links: `[text](url)` parsed as a unit
			- Autolinks: `<url>` and bare URLs
			- Hard line breaks: trailing spaces or `\` before newline
			- HTML tags: `<span>` etc. passed through as html tokens

			`### Inline parser (tokens → HTML)`

			`Walks the token stream and matches open/close delimiter pairs using a`
			`stack. Produces HTML string. Handles:`
			`- Delimiter pairing with precedence (* before before *)`
			`- Multiple-of-3 rule`
			`- Nesting validation (no em inside em, no links inside links)`

			`### Serializer (DOM → tokens → markdown)`

			`Walks the DOM tree. For each node:`
			`- Text nodes → text tokens (the serializer knows these need escaping)`
			`- Element nodes → look up the tag, emit delimiter tokens + recurse into children`
			`- Unknown elements → recurse into children`

			`Then the token stream is serialized to a string:`
			`- Delimiter tokens → emitted verbatim (they're structural)`
			`- Text tokens → characters that would be misinterpreted as delimiters are`
			`backslash-escaped. The serializer knows exactly which characters are`
			`dangerous because it knows what delimiters exist.`
			`- HTML tokens → emitted verbatim`

			`### Why this solves the round-trip problem`

			`The key insight: delimiter tokens and text tokens are different types.`
			When serializing `<strong>hello world</strong>`, the output is:

			```
			`[delim *] [text "hello "] [delim ] [text "world"] [delim ] [delim *]`
			```

			The `*` around "world" are delimiter tokens (from the nested `<em>`).
			If instead the text contained a literal `*`:

			```
			`<strong>hello * world</strong>`
			```

			`The output would be:`

			```
			`[delim *] [text "hello world"] [delim **]`
			```

			The `*` is a text token. During serialization, the text token scanner
			sees `` and escapes it to `\` because `*` is a known delimiter character.
			`The delimiter tokens are never escaped. No ambiguity.`

			`## Files`

			- `types.ts` — Token type, updated Tag interface
			- `tokenizer.ts` — Inline tokenizer (markdown → tokens)
			- `serializer.ts` — DOM → tokens → markdown string
			- `hopdown.ts` — Orchestrator (block parsing, delegates inline to tokenizer)
			- `tags.ts` — Tag definitions (simplified: no more regex patterns)

			`## Migration`

			`The Tag interface changes:`
			- `pattern` field removed (tokenizer handles delimiter matching)
			- `toMarkdown` returns Token[] instead of string
			- `match` stays the same (block-level matching is already clean)
			- `toHTML` stays the same

			`The HopDown public API stays the same:`
			- `toHTML(markdown)` — unchanged
			- `toMarkdown(html)` — unchanged
			- `findCompletePair`, `findUnmatchedOpener` — reimplemented on tokenizer
			- `getTagForElement`, `getEditableSelector` — unchanged