119 lines
4.0 KiB
Markdown
119 lines
4.0 KiB
Markdown
|
|
# HopDown Tokenizer Design
|
||
|
|
|
||
|
|
## Problem
|
||
|
|
|
||
|
|
The regex-based inline parser and serializer can't reliably distinguish
|
||
|
|
structural delimiters from literal text characters. This causes:
|
||
|
|
- `toMarkdown` escaping bugs (over-escaping inside inline tags, under-escaping
|
||
|
|
in text nodes)
|
||
|
|
- Round-trip failures (`toHTML(toMarkdown(html)) !== html`)
|
||
|
|
- Fragile interactions between features (underscore normalization + strikethrough,
|
||
|
|
HTML passthrough + escaping)
|
||
|
|
|
||
|
|
## Invariants
|
||
|
|
|
||
|
|
1. `toHTML` satisfies GFM spec rules 1-15
|
||
|
|
2. `toMarkdown` always emits the canonical form
|
||
|
|
3. `toHTML(toMarkdown(html)) === html` (single-pass round-trip)
|
||
|
|
|
||
|
|
## Architecture
|
||
|
|
|
||
|
|
### Token types
|
||
|
|
|
||
|
|
```
|
||
|
|
text — literal characters, will be escaped during serialization
|
||
|
|
delimiter — structural marker (**, *, ~~, `, etc.)
|
||
|
|
html — raw HTML tag passthrough
|
||
|
|
break — hard line break (<br>)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Inline tokenizer (markdown → tokens)
|
||
|
|
|
||
|
|
Scans left-to-right, character by character. Maintains a stack of open
|
||
|
|
delimiters. Produces a flat token stream:
|
||
|
|
|
||
|
|
```
|
||
|
|
Input: "hello **bold *nested*** end"
|
||
|
|
Tokens: [text "hello "] [open **] [text "bold "] [open *] [text "nested"] [close *] [close **] [text " end"]
|
||
|
|
```
|
||
|
|
|
||
|
|
The tokenizer handles:
|
||
|
|
- Backslash escapes: `\*` → text token containing `*`
|
||
|
|
- Entity resolution: `&` → text token containing `&`
|
||
|
|
- Flanking rules: only emit delimiter tokens when flanking conditions are met
|
||
|
|
- Code spans: `` ` `` opens a code span that consumes everything until the matching `` ` ``
|
||
|
|
- Links: `[text](url)` parsed as a unit
|
||
|
|
- Autolinks: `<url>` and bare URLs
|
||
|
|
- Hard line breaks: trailing spaces or `\` before newline
|
||
|
|
- HTML tags: `<span>` etc. passed through as html tokens
|
||
|
|
|
||
|
|
### Inline parser (tokens → HTML)
|
||
|
|
|
||
|
|
Walks the token stream and matches open/close delimiter pairs using a
|
||
|
|
stack. Produces HTML string. Handles:
|
||
|
|
- Delimiter pairing with precedence (*** before ** before *)
|
||
|
|
- Multiple-of-3 rule
|
||
|
|
- Nesting validation (no em inside em, no links inside links)
|
||
|
|
|
||
|
|
### Serializer (DOM → tokens → markdown)
|
||
|
|
|
||
|
|
Walks the DOM tree. For each node:
|
||
|
|
- Text nodes → text tokens (the serializer knows these need escaping)
|
||
|
|
- Element nodes → look up the tag, emit delimiter tokens + recurse into children
|
||
|
|
- Unknown elements → recurse into children
|
||
|
|
|
||
|
|
Then the token stream is serialized to a string:
|
||
|
|
- Delimiter tokens → emitted verbatim (they're structural)
|
||
|
|
- Text tokens → characters that would be misinterpreted as delimiters are
|
||
|
|
backslash-escaped. The serializer knows exactly which characters are
|
||
|
|
dangerous because it knows what delimiters exist.
|
||
|
|
- HTML tokens → emitted verbatim
|
||
|
|
|
||
|
|
### Why this solves the round-trip problem
|
||
|
|
|
||
|
|
The key insight: delimiter tokens and text tokens are different types.
|
||
|
|
When serializing `<strong>hello *world*</strong>`, the output is:
|
||
|
|
|
||
|
|
```
|
||
|
|
[delim **] [text "hello "] [delim *] [text "world"] [delim *] [delim **]
|
||
|
|
```
|
||
|
|
|
||
|
|
The `*` around "world" are delimiter tokens (from the nested `<em>`).
|
||
|
|
If instead the text contained a literal `*`:
|
||
|
|
|
||
|
|
```
|
||
|
|
<strong>hello * world</strong>
|
||
|
|
```
|
||
|
|
|
||
|
|
The output would be:
|
||
|
|
|
||
|
|
```
|
||
|
|
[delim **] [text "hello * world"] [delim **]
|
||
|
|
```
|
||
|
|
|
||
|
|
The `*` is a text token. During serialization, the text token scanner
|
||
|
|
sees `*` and escapes it to `\*` because `*` is a known delimiter character.
|
||
|
|
The delimiter tokens are never escaped. No ambiguity.
|
||
|
|
|
||
|
|
## Files
|
||
|
|
|
||
|
|
- `types.ts` — Token type, updated Tag interface
|
||
|
|
- `tokenizer.ts` — Inline tokenizer (markdown → tokens)
|
||
|
|
- `serializer.ts` — DOM → tokens → markdown string
|
||
|
|
- `hopdown.ts` — Orchestrator (block parsing, delegates inline to tokenizer)
|
||
|
|
- `tags.ts` — Tag definitions (simplified: no more regex patterns)
|
||
|
|
|
||
|
|
## Migration
|
||
|
|
|
||
|
|
The Tag interface changes:
|
||
|
|
- `pattern` field removed (tokenizer handles delimiter matching)
|
||
|
|
- `toMarkdown` returns Token[] instead of string
|
||
|
|
- `match` stays the same (block-level matching is already clean)
|
||
|
|
- `toHTML` stays the same
|
||
|
|
|
||
|
|
The HopDown public API stays the same:
|
||
|
|
- `toHTML(markdown)` — unchanged
|
||
|
|
- `toMarkdown(html)` — unchanged
|
||
|
|
- `findCompletePair`, `findUnmatchedOpener` — reimplemented on tokenizer
|
||
|
|
- `getTagForElement`, `getEditableSelector` — unchanged
|