I hope everyone had a happy Thanksgiving! We certainly did, which is not surprising given how much we have to be thankful for.
The remainder of this post is likely to be of interest mostly to programmers and folks who want to know how things work inside. Later on, as promised, I’ll have some posts dealing with changes to the markup itself. If you came here looking for those, please try again in a day or two; this post isn’t it.
I took the whole of Thanksgiving week off, and spent a good deal of it rewriting Notebook’s markup parser. In order to thinking about extending the markup syntax I needed to re-familiarize myself with the order in which everything is parsed; and then there are some long standing bugs that can’t be fixed without a major rewrite; and on top of that, the output of the existing parser is not nearly as useful as it could be.
I’ll explain that by way of an example. Consider the following markup:
= Important Stuff =
This is a page with a link to some [important stuff]
that you really ought to read!
Now, there are two reasons to parse this page: to render it beautifully for display to the user, and to transform the page in some way. As an example of the latter, if the page named “important stuff” is renamed “not-so-important stuff” we want to update all of the links to it, including the one in the snippet of markup shown here.
For the purposes of rendering, all we really need is the semantic content in the page. But for transformations, such as renaming links, we really want to be able to recreate the page just as it was, with the exception of the changes we meant to make. To do this, we need access to the syntactic content as well. The existing parsed form does a bad job of this: it mingles the semantic and syntactic content in such a way that any transformational code needs to understand the semantics in order to do its job.
Let me explain. Here’s the parsed form of the snippet shown above; it consists of a list of tags and values. The tag indicates the kind of thing found in the input, and the value is intended to provide enough information to render it, and to recreate the input text:
H {1 {Important Stuff}}
NL {\\n\\n}
P {: 0}
TXT {This is a page with a link to some }
LINK {important stuff}
TXT {\\nthat you really ought to read!\\n}
/P {}
The “H” tag is produced by the header; its value tells us that it’s a level 1 header, and that the header string is “Important Stuff”. Similarly, the “LINK” tag tells us that there’s a page link, and the link text is “important stuff”. The “NL” tag indicates that there was a blank line following the header, and that the blank line was indicated by two newline characters.
Note the difference between the “NL” tag on the one hand and the “H” and “LINK” tags on the other. The value of the “NL” tag is precisely the text that was read from the input. The value of the “H” and “LINK” tags includes only a portion of the input text. And herein lies the problem. Our algorithm to rename changed links must parse this text so that it can find the link that needs to be changed, and then put the input text back together again. In order to do this, the algorithm must understand the semantics and syntax of every tag, i.e., it must know that “H” is a header and what the “H” tag’s value looks like, and what the markup for headers is, even though headers are irrelevant to the goal that it’s trying to accomplish.
Here’s the new parsed form:
H {0 20} {level 1 text {Important Stuff}}
P {} {code : indent 0}
TXT {21 55} {}
LBLINK {56 56} {}
LINK {57 71} {}
RBLINK {72 72} {}
TXT {73 104} {}
/P {} {}
Note that each tag now has two values instead of one. The first value is a pair of numbers representing the indices of the matched text within the input string. In some cases, the tag represents a logical point within the input, such as the beginning or end of a paragraph, and doesn’t actually match any characters; in this case, the pair is empty, “{}”. The second value contains any additional semantic information needed to render or transform the specific content. The “H” tag, for example, indicates what the header level is and what the header text is.
Because each tag indicates the span of input text it matches, and since every character of the input text must be matched in order, we can rebuild the input string simply by iterating over the tags and looking at the first value for each, without any concern for the type of each tag. Thus, to rename links the algorithm now looks like this:
For each tag,
If the tag is not "LINK",
Use the indices to copy the input text to the output.
Otherwise, if the link text references the renamed page,
Copy the new page name to the output.
Otherwise,
Copy the old link text to the output unchanged.
Note that there’s no need for this code to understand anything about headers, or indeed about any tag but “LINK”. The new parser output is much more useful, and I think will serve me well in general, though possibly with a little tweaking.