Gabrys wrote on 24 Mar 2010 09:02
Hello,
so first of all, we have that old bug (http://feedback.wikidot.com/bug:16) of "wrong" order of symbol replacement and include resolving phases, which as explained in comments cannot be simply fixed.
We have a problem with processing modules. [[code]] and [[html]] cannot be used inside modules bodies. You can't nest module bodies.
We also have another problem, which is far more generic in its nature. Imagine the following code:
+ Head
[!-- Commented out.
[[include weird-stuff]]
--]
Some content
Probably we have weird-stuff included on that page, but we decided to temporarily remove that. Well, that almost works. Imagine the weird-stuff page has the following source:
[[toc]]
And maybe other weird things --]
And many more below ;-)
What!!! --] inside of that page? OK, how does the original page work now:
+ Head
[!-- Commented out.
And maybe other weird things --]
And many more below ;-)
--]
Some content
And accidentally you have something from page weird-stuff included on the original page. That sucks (unless you do it on purpose).
OK, next example. Here's the _template page:
title: **%%title%%**
What to expect? Title in bold? Probably, but, look at this crazy page title:
A** -- A with two asterisks
What a stupid title. OK, what do we get after applying the _template:
title: **A** -- A with two asterisks**
This renders A in bold with the following text after that: — A with two asterisks**. Shit, no good!
You get the point, right? The current parser handles the includes first and then apply wiki-syntax rules. So even if things seem nice in _template a malicious page can simply break it. Look at this:
_template:
[[div style="background: red"]]
%%content%%
[[/div]]
We want all the content on a red background, don't we? OK, page source:
Some text.
I use a lot of divs, and accidentally, here's one div closure left:
[[/div]]
Then I have another paragraphs and stuff.
What do we get:
[[div style="background: red"]]
Some text.
I use a lot of divs, and accidentally, here's one div closure left:
[[/div]]
Then I have another paragraphs and stuff.
[[/div]]
The parser closes the first [[div]] with the first [[/div]], which unfortunately is not the right [[/div]]
How it can work then?
After those examples it becomes clear, that there a few scopes of wiki parsing, that current parser doesn't know about:
- _template
- page
- includes
- includes in includes
- …
Probably even more intermediate of them.
How we can improve the parser:
First, let's define a set of "rules" to match against text, to recognize tags. For example:
- **something** — bold, but only whes something is in-line
- //something// — italic, but only whes something is in-line
- __something__ — underline, but only whes something is in-line
- [[div]] new-line something new-line [[/div]] — make a div block
- [text1 space text] — link to text1, but text1 can't contain space, text can't contain "]", "[" or newline
- [[module <name> <parameters>]] newline body newline [[/module]]
You get the point, it's basically all the Wikidot syntax. Really all? No. What not to list there:
- includes
- symbol replacement (when in _template or ListPages module)
So basically things, that make those "scopes" or "phases".
The general rule in parsing would be:
- Do something
- Apply rules
- Do something
- Apply rules
- Do something
- Apply rules
- …
What if we get for example "%%content%% is cool" as a result of symbol replacement for %%content%%? We need to know to NOT parse that anymore. So we need a notion of marking things as "to-parse" and "parsed". This means, when we get the page source, we need to mark it all with "to-parse". How to do it? We can do it for example this way:
<wiki>
wiki syntax after converting to XML-string:
* all & converted to &
* all < converted to <
* all > converted to >
</wiki>
Then we know — everything inside <wiki> is to be parsed. Then we can inject some XML tags into that, and we know, they are already parsed. For example:
<wiki>
**%%title%%**
%%content%%
</wiki>
This can be parsed this way:
<b><wiki>%%title%%</wiki></b>
<wiki>%%content%%</wiki>
And then "%%content%%" becomes "%%content%% is cool" and %%title%% becomes "Some random** title":
<b>Some random ** title</b>
%%content%% is cool
No "<wiki>" around those, so we know it's final version, not to parse anymore.
Actual process would be a bit more complicated, but this would be the general look.
Let's look at the includes:
<wiki>
[[div style="background: red"]]
+ Hello
[[include some-page]]
++ Hello 2
Hello guys!
[[/div]]
</wiki>
Let's apply rules:
<div style="background: red">
<h1>Hello</h1>
<wiki>[[include some-page]]</wiki>
<h2>Hello 2</h2>
Hello guys!
</div>
So we have that page already parsed, now let's resolve includes: (assume the include contains malicious [[/div]]).
<div style="background: red">
<h1>Hello</h1>
<wiki>
Include page
[[/div]]
</wiki>
<h2>Hello 2</h2>
Hello guys!
</div>
And now, after another rule-applying, nothing breaks!
<div style="background: red">
<h1>Hello</h1>
Include page
[[/div]]
<h2>Hello 2</h2>
Hello guys!
</div>
Notice the [[/div]] is not wrapped in <wiki> tags, so it won't break any <div>!
But how about using [[include]]s to start and stop things:
[[include start-list]]
Item 1
[[include list-separator]]
Item 2
[[include list-separator]]
Item 3
[[include list-separator]]
Item 4
[[include end-list]]
start-list source code:
[[table style="some styling"]]
[[row style="some styling"]]
[[cell style="some styling"]]
Header 1
[[/cell]]
[[cell style="some styling"]]
Header 2
[[/cell]]
[[/row]]
[[row]]
[[cell style="some other styling"]]
Item:
[[/cell]]
[[cell style="item-styling"]]
list-separator:
[[/cell]]
[[/row]]
[[row]]
[[cell style="some other styling"]]
Item:
[[/cell]]
[[cell style="item-styling"]]
end-list:
[[/cell]]
[[/row]]
[[/table]]
OK, a word of explanation. Some people use includes to mimic some sophisticated tags, they would like to use. It's quite OK, but they need to assure, that the includes are properly designed and they use start and end includes in the proper way.
So let's parse that:
<wiki>
[[include start-list]]
Item 1
[[include list-separator]]
Item 2
[[include list-separator]]
Item 3
[[include list-separator]]
Item 4
[[include end-list]]
</wiki>
Parsing gives (almost) the same (no tags in there):
<wiki>
[[include start-list]]
</wiki>
Item 1
<wiki>
[[include list-separator]]
</wiki>
Item 2
<wiki>
[[include list-separator]]
</wiki>
Item 3
<wiki>
[[include list-separator]]
</wiki>
Item 4
<wiki>
[[include end-list]]
</wiki>
Now include resolution:
<wiki>
[[table style="some styling"]]
[[row style="some styling"]]
[[cell style="some styling"]]
Header 1
[[/cell]]
[[cell style="some styling"]]
Header 2
[[/cell]]
[[/row]]
[[row]]
[[cell style="some other styling"]]
Item:
[[/cell]]
[[cell style="item-styling"]]
</wiki>
Item 1
<wiki>
[[/cell]]
[[/row]]
[[row]]
[[cell style="some other styling"]]
Item:
[[/cell]]
[[cell style="item-styling"]]
</wiki>
Item 2
<wiki>
[[/cell]]
[[/row]]
[[row]]
[[cell style="some other styling"]]
Item:
[[/cell]]
[[cell style="item-styling"]]
</wiki>
Item 3
<wiki>
[[/cell]]
[[/row]]
[[row]]
[[cell style="some other styling"]]
Item:
[[/cell]]
[[cell style="item-styling"]]
</wiki>
Item 4
<wiki>
[[/cell]]
[[/row]]
[[/table]]
</wiki>
Apply rules:
<table style="some styling">
<tr style="some styling">
<td style="some styling">
Header 1
</td>
<td style="some styling">
Header 2
</td>
</tr>
<tr>
<td style="some other styling">
Item:
</td>
<td style="item-styling">
Item 1
</td>
</tr>
<tr>
<td style="some other styling">
Item:
</td>
<td style="item-styling">
Item 2
</td>
</tr>
<tr>
<td style="some other styling">
Item:
</td>
<td style="item-styling">
Item 3
</td>
</tr>
<tr>
<td style="some other styling">
Item:
</td>
<td style="item-styling">
Item 4
</td>
</tr>
</table>
Nice, it works. But we needed to parse things that start and stop in different <wiki> sections. OK, a thing to remember when writing the parser.
Let's see another thing.
[[module ListPages author="@URL"]]
* %%title_linked%% by [/pagename/author/%%author%% %%author%%]
[[/module]]
This lists pages by everyone or by author passed by URL. In the ListPages we want to pass %%author%% as a part of the link. Assume the author is "James Kanjo":
Now it works like this, after symbol replacement:
[[module ListPages author="@URL"]]
* %%title%% by [/pagename/author/James Kanjo James Kanjo]
[[/module]]
And then the link to /pagename/author/James is created with text "Kanjo James Kanjo". Which is not good.
Let's see how it could work in the proposed parser:
<wiki>
[[module ListPages author="@URL"]]
* %%title%% by [/pagename/author/%%author%% %%author%%]
[[/module]]
</wiki>
Then:
<module name="ListPages">
<param name="author"><wiki>@URL</wiki></param>
<body-as-wiki>
* %%title%% by [/pagename/author/%%author%% %%author%%]
</body-as-wiki>
</module>
(Indentation added only for clarity.)
Then let's go into ListPages module parsing pass for one particular page with %%author%% = James Kanjo and %%title%% = Great page.
First get <body-as-wiki> and convert this to <wiki>
<wiki>
* %%title%% by [/pagename/author/%%author%% %%author%%]
</wiki>
Then apply rules:
<list-item>
<wiki>%%title%%</wiki> by
<link>
<to>/pagename/author/<wiki>%%author%%</wiki></to>
<text><wiki>%%author%%</wiki></text>
</link>
</list-item>
Symbol replacement:
<list-item>Great page by
<link>
<to>/pagename/author/James Kanjo</to>
<text>James Kanjo</text>
</link>
</list-item>
And final (simple) convert to HTML:
<li>
Great page by <a href="/pagename/author/James Kanjo">James Kanjo</a>
</li>
Great, isn't it?
Who wants to check if bug:16 will still be a bug in this scenario? :)
Summary
So to cut long story short, this kind of parser takes (more or less) the existing parsing rules, adds a notion of "what's parsed" and "what's to parse" and allows us to decide in a more detailed way in what order different "magic" things happen like include resolution, symbols replacement and other stuff. More over we can apply parsing rules many times without a fear it'll destroy anything (as we already marked the parsed parts as parsed).
Note the examples above are much more simplified, to give you the idea, not to be a proper design test cases. For example new lines problem (one converts to <br/>, two convert to <p>…</p>) was completely ignored in this text.
Please read this carefully, try to understand and find any problem there could be with this parser, because the parser is now my favorite potential solution for bug:16, so we might decide to implement it.
Include with variables processing | By Gabrys | 4 Comments | 24 Mar 2010 13:53 |
Great examples — and the post was easy to read and understand. It seems like it might do the trick.
The best person to wait for is James. He created the chatroom and the calendar (with help, of course… but he knows how most of it works) — so he'd be the best person to tell you if one of the more complicated Wikidot apps is going to have any negative side-effects as a result of this change in how the parser works.
If this is going to allow nesting of modules as well I will be one very, very happy Wikidot user :D
Not sure how I can possibly recommend Wikidot to any more people than I already do... but I'll try. Every single group project I'm a part of, every semester, I encourage our group to sign up to a wikidot site and we collaborate on there. It's great! :)
~ Leiger - Wikidot Community Admin - Volunteer
Wikidot: Official Documentation | Wikidot Discord server | NEW: Wikiroo, backup tool (in development)
Yes, this one should be solved automatically.
It will definitely help making nesting modules, but not yet enable them :-). But one step closer to this is nice as well, isn't it?
Piotr Gabryjeluk
visit my blog
Definitely.
~ Leiger - Wikidot Community Admin - Volunteer
Wikidot: Official Documentation | Wikidot Discord server | NEW: Wikiroo, backup tool (in development)
Great design, we could make a rid over our renderer/parser, could new parser cover Standard Template Properties design to, or something similar?
Bartłomiej Bąkowski @ Wikidot Inc.
';.;' TeRq (Write PM)
Okay… I'm not sure that this will fix bug:16.
I suspect you would need to create a special XML tag for includes, so that the <wiki> would work within one.
For example:
Which would convert to:
It's hard to say when we don't know how an Include Variables are processed. Is this how they're processed?
Hi James,
Indeed I was thinking about an internal <include> tag for handling includes (not mentioned that for simplification), just like you suggested, which will cover include variables as well. (In fact we also need <template-symbol>, <listpages-symbol> and many more internal tags).
Let me think for a moment and produce an explanation of how include with variables could work :-), I'll post it as a sub-thread or something similar ;-).
Piotr Gabryjeluk
visit my blog
Okay, here's what would happen:
Here's where the problem is about to happen:
I will post a “solution” in a separate post (which would solve the infamous bug:16)
With includes, and I have stated this before, we need to change the process order FROM THIS:
TO THIS:
HOWEVER: We need a way to cater for variables already processed, so that we don't allow “nested variables”.
SOLUTION: How about a <wiki-novars> tag. This will perform the same steps of the <wiki> tag, but it will not process detected variables.
Here's what would happen with the <wiki-novars> tag:
The final output is correct!!!
As I understand <wiki>…</wiki> is parser internal tag, parser create it while processing content to mark what it need to parse later.
Bartłomiej Bąkowski @ Wikidot Inc.
';.;' TeRq (Write PM)
Sounds very promising. The process-thinking behind it is mostly over my head, but I'm looking forward to working with the product of all this analysis.
This seems like such a cool project — certainly involves a lot of dedication. What's happening these days? Any updates you can share? Thanks.