Wiki Parser Thoughts
nav_first.pngFirst: thread:375
Developing a module for Wikidot
Edited: 08 Mar 2011 18:27 by: North
Comments: 0
Tags:
nav_prev.pngPrevious: thread:374
Twitter Cloud code to CSI
Edited: 26 Nov 2010 01:06 by: leiger
Comments: 1
Tags:
Last: thread:31
Black belt project
Edited: 06 Sep 2009 11:06 by: pieterh
Comments: 10
Tags:
nav_last.png
Next: thread:143
Fix-up broken CSI references
Edited: 03 Mar 2010 16:50 by: pieterh
Comments: 5
Tags:
nav_next.png

GabrysGabrys wrote on 24 Mar 2010 09:02

Hello,

so first of all, we have that old bug (http://feedback.wikidot.com/bug:16) of "wrong" order of symbol replacement and include resolving phases, which as explained in comments cannot be simply fixed.

We have a problem with processing modules. [[code]] and [[html]] cannot be used inside modules bodies. You can't nest module bodies.

We also have another problem, which is far more generic in its nature. Imagine the following code:

+ Head

[!-- Commented out.
[[include weird-stuff]] 
--]

Some content

Probably we have weird-stuff included on that page, but we decided to temporarily remove that. Well, that almost works. Imagine the weird-stuff page has the following source:

[[toc]]
And maybe other weird things --]

And many more below ;-)

What!!! --] inside of that page? OK, how does the original page work now:

+ Head

[!-- Commented out.
And maybe other weird things --]

And many more below ;-)
--]

Some content

And accidentally you have something from page weird-stuff included on the original page. That sucks (unless you do it on purpose).

OK, next example. Here's the _template page:

title: **%%title%%**

What to expect? Title in bold? Probably, but, look at this crazy page title:

A** -- A with two asterisks

What a stupid title. OK, what do we get after applying the _template:

title: **A** -- A with two asterisks**

This renders A in bold with the following text after that: — A with two asterisks**. Shit, no good!

You get the point, right? The current parser handles the includes first and then apply wiki-syntax rules. So even if things seem nice in _template a malicious page can simply break it. Look at this:

_template:

[[div style="background: red"]]
%%content%%
[[/div]]

We want all the content on a red background, don't we? OK, page source:

Some text.

I use a lot of divs, and accidentally, here's one div closure left:
[[/div]]

Then I have another paragraphs and stuff.

What do we get:

[[div style="background: red"]]
Some text.

I use a lot of divs, and accidentally, here's one div closure left:
[[/div]]

Then I have another paragraphs and stuff.
[[/div]]

The parser closes the first [[div]] with the first [[/div]], which unfortunately is not the right [[/div]]

How it can work then?

After those examples it becomes clear, that there a few scopes of wiki parsing, that current parser doesn't know about:

  • _template
  • page
  • includes
  • includes in includes

Probably even more intermediate of them.

How we can improve the parser:

First, let's define a set of "rules" to match against text, to recognize tags. For example:

  • **something** — bold, but only whes something is in-line
  • //something// — italic, but only whes something is in-line
  • __something__ — underline, but only whes something is in-line
  • [[div]] new-line something new-line [[/div]] — make a div block
  • [text1 space text] — link to text1, but text1 can't contain space, text can't contain "]", "[" or newline
  • [[module <name> <parameters>]] newline body newline [[/module]]

You get the point, it's basically all the Wikidot syntax. Really all? No. What not to list there:

  • includes
  • symbol replacement (when in _template or ListPages module)

So basically things, that make those "scopes" or "phases".

The general rule in parsing would be:

  1. Do something
  2. Apply rules
  3. Do something
  4. Apply rules
  5. Do something
  6. Apply rules

What if we get for example "%%content%% is cool" as a result of symbol replacement for %%content%%? We need to know to NOT parse that anymore. So we need a notion of marking things as "to-parse" and "parsed". This means, when we get the page source, we need to mark it all with "to-parse". How to do it? We can do it for example this way:

<wiki>
wiki syntax after converting to XML-string:
* all & converted to &amp;
* all < converted to &lt;
* all > converted to &gt;
</wiki>

Then we know — everything inside <wiki> is to be parsed. Then we can inject some XML tags into that, and we know, they are already parsed. For example:

<wiki>
**%%title%%**

%%content%%
</wiki>

This can be parsed this way:

<b><wiki>%%title%%</wiki></b>

<wiki>%%content%%</wiki>

And then "%%content%%" becomes "%%content%% is cool" and %%title%% becomes "Some random** title":

<b>Some random ** title</b>

%%content%% is cool

No "<wiki>" around those, so we know it's final version, not to parse anymore.

Actual process would be a bit more complicated, but this would be the general look.

Let's look at the includes:

<wiki>
[[div style="background: red"]]

+ Hello

[[include some-page]] 

++ Hello 2

Hello guys!

[[/div]]
</wiki>

Let's apply rules:

<div style="background: red">

<h1>Hello</h1>

<wiki>[[include some-page]]</wiki>

<h2>Hello 2</h2>

Hello guys!
</div>

So we have that page already parsed, now let's resolve includes: (assume the include contains malicious [[/div]]).

<div style="background: red">

<h1>Hello</h1>

<wiki>
Include page

[[/div]]
</wiki>

<h2>Hello 2</h2>

Hello guys!
</div>

And now, after another rule-applying, nothing breaks!

<div style="background: red">

<h1>Hello</h1>

Include page

[[/div]]

<h2>Hello 2</h2>

Hello guys!
</div>

Notice the [[/div]] is not wrapped in <wiki> tags, so it won't break any <div>!

But how about using [[include]]s to start and stop things:

[[include start-list]] 

Item 1

[[include list-separator]] 

Item 2

[[include list-separator]] 

Item 3

[[include list-separator]] 

Item 4

[[include end-list]]

start-list source code:

[[table style="some styling"]]
[[row style="some styling"]]
[[cell style="some styling"]]
Header 1
[[/cell]]
[[cell style="some styling"]]
Header 2
[[/cell]]
[[/row]]
[[row]]
[[cell style="some other styling"]]
Item:
[[/cell]]
[[cell style="item-styling"]]

list-separator:

[[/cell]]
[[/row]]
[[row]]
[[cell style="some other styling"]]
Item:
[[/cell]]
[[cell style="item-styling"]]

end-list:

[[/cell]]
[[/row]]
[[/table]]

OK, a word of explanation. Some people use includes to mimic some sophisticated tags, they would like to use. It's quite OK, but they need to assure, that the includes are properly designed and they use start and end includes in the proper way.

So let's parse that:

<wiki>
[[include start-list]] 

Item 1

[[include list-separator]] 

Item 2

[[include list-separator]] 

Item 3

[[include list-separator]] 

Item 4

[[include end-list]] 
</wiki>

Parsing gives (almost) the same (no tags in there):

<wiki>
[[include start-list]] 
</wiki>

Item 1

<wiki>
[[include list-separator]] 
</wiki>

Item 2

<wiki>
[[include list-separator]] 
</wiki>

Item 3

<wiki>
[[include list-separator]] 
</wiki>

Item 4

<wiki>
[[include end-list]] 
</wiki>

Now include resolution:

<wiki>
[[table style="some styling"]]
[[row style="some styling"]]
[[cell style="some styling"]]
Header 1
[[/cell]]
[[cell style="some styling"]]
Header 2
[[/cell]]
[[/row]]
[[row]]
[[cell style="some other styling"]]
Item:
[[/cell]]
[[cell style="item-styling"]]

</wiki>

Item 1

<wiki>
[[/cell]]
[[/row]]
[[row]]
[[cell style="some other styling"]]
Item:
[[/cell]]
[[cell style="item-styling"]]
</wiki>

Item 2

<wiki>
[[/cell]]
[[/row]]
[[row]]
[[cell style="some other styling"]]
Item:
[[/cell]]
[[cell style="item-styling"]]
</wiki>

Item 3

<wiki>
[[/cell]]
[[/row]]
[[row]]
[[cell style="some other styling"]]
Item:
[[/cell]]
[[cell style="item-styling"]]
</wiki>

Item 4

<wiki>
[[/cell]]
[[/row]]
[[/table]]
</wiki>

Apply rules:

<table style="some styling">
<tr style="some styling">
<td style="some styling">
Header 1
</td>
<td style="some styling">
Header 2
</td>
</tr>
<tr>
<td style="some other styling">
Item:
</td>
<td style="item-styling">

Item 1

</td>
</tr>
<tr>
<td style="some other styling">
Item:
</td>
<td style="item-styling">

Item 2

</td>
</tr>
<tr>
<td style="some other styling">
Item:
</td>
<td style="item-styling">

Item 3

</td>
</tr>
<tr>
<td style="some other styling">
Item:
</td>
<td style="item-styling">

Item 4

</td>
</tr>
</table>

Nice, it works. But we needed to parse things that start and stop in different <wiki> sections. OK, a thing to remember when writing the parser.

Let's see another thing.

[[module ListPages author="@URL"]]
* %%title_linked%% by [/pagename/author/%%author%% %%author%%]
[[/module]]

This lists pages by everyone or by author passed by URL. In the ListPages we want to pass %%author%% as a part of the link. Assume the author is "James Kanjo":

Now it works like this, after symbol replacement:

[[module ListPages author="@URL"]]
* %%title%% by [/pagename/author/James Kanjo James Kanjo]
[[/module]]

And then the link to /pagename/author/James is created with text "Kanjo James Kanjo". Which is not good.

Let's see how it could work in the proposed parser:

<wiki>
[[module ListPages author="@URL"]]
* %%title%% by [/pagename/author/%%author%% %%author%%]
[[/module]]
</wiki>

Then:

<module name="ListPages">
  <param name="author"><wiki>@URL</wiki></param>
  <body-as-wiki>
* %%title%% by [/pagename/author/%%author%% %%author%%]
  </body-as-wiki>
</module>

(Indentation added only for clarity.)

Then let's go into ListPages module parsing pass for one particular page with %%author%% = James Kanjo and %%title%% = Great page.

First get <body-as-wiki> and convert this to <wiki>

<wiki>
* %%title%% by [/pagename/author/%%author%% %%author%%]
</wiki>

Then apply rules:

<list-item>
<wiki>%%title%%</wiki> by
<link>
  <to>/pagename/author/<wiki>%%author%%</wiki></to>
  <text><wiki>%%author%%</wiki></text>
</link>
</list-item>

Symbol replacement:

<list-item>Great page by
<link>
  <to>/pagename/author/James Kanjo</to>
  <text>James Kanjo</text>
</link>
</list-item>

And final (simple) convert to HTML:

<li>
Great page by <a href="/pagename/author/James Kanjo">James Kanjo</a>
</li>

Great, isn't it?

Who wants to check if bug:16 will still be a bug in this scenario? :)

Summary

So to cut long story short, this kind of parser takes (more or less) the existing parsing rules, adds a notion of "what's parsed" and "what's to parse" and allows us to decide in a more detailed way in what order different "magic" things happen like include resolution, symbols replacement and other stuff. More over we can apply parsing rules many times without a fear it'll destroy anything (as we already marked the parsed parts as parsed).

Note the examples above are much more simplified, to give you the idea, not to be a proper design test cases. For example new lines problem (one converts to <br/>, two convert to <p>…</p>) was completely ignored in this text.

Please read this carefully, try to understand and find any problem there could be with this parser, because the parser is now my favorite potential solution for bug:16, so we might decide to implement it.


Start a new sub-thread

Include with variables processing By Gabrys 4 Comments 24 Mar 2010 13:53

Comments: 11

Add a New Comment
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License