From server-side perspective, Javascript is always faster but because of the bandwidth (often several dozen kilobytes overhead per user) we want to think about server-side parsing. The fact is, CPU is relatively cheap.

So let’s get a picture of what we are looking at. According to “Wikimedia in figures“, Wikipedia processes more than 2,500 page views per second, and receives an average of three edits per second.

That’s pretty cool, so how fast do we need to be? I’d like to make a confession here. I actually tried to make a Creole parser in C++ as a PHP extension before benchmarking PHP’s regular expressions. I even “unrolled the loop” so to speak, but guess what? Unbelievably, the C++ extension was slower than the PHP version! Over 100,000 runs, the C++ parser executed in about 0.6 seconds, while the PHP parser executed in about 0.2 seconds. That’s almost three times faster. This surprised and shocked me. After a few days’ thought I discovered a few ways I could optimize the C++ code. But I never really achieved a significant performance breakthrough. I might at some point try the regex implementation described at “Regular Expression Matching Can Be Simple And Fast“, but on the surface I must admit I am a little confused.

I’d also like to point out that it took me about eight hours of brain-ripping study to figure out how to write a C++ extension. The instructions are like a game where some of what you read doesn’t work and some does, but there are bugs, and you need to put it all in a box and shake it up and… and… I caught myself wondering if it was all really worth the trouble.

The short answer is, no, it isn’t. If we can parse 10,000 to 100,000 pages per second server side, using PHP regex on a tiny DigitalOcean starter droplet, we could theoretically handle the rendering for Wikipedia on a pageview by pageview basis. So on the surface no we do not need to get any faster than PHP regex for parsing Wiki markup.

But what if we wanted to be little punks and make a few optimizations? Well certainly a faster regex library, written in C (see ‘simple and fast’ above) would tuck us in nicely. But I mean what if we wanted to think outside the box and do something that didn’t require brain surgery on the internals of the Zend PHP engine?

First, let me ask you, how much faster is needing to do something three times than 2,500 times? Why, it’s over 800 times faster, of course! So why don’t we just render to HTML on edit? This is absurdly simple, isn’t it? If a page has dynamic content, or includes, this can be marked in the meta. But most pages don’t require this. Even if we assume half of all pages contain dynamic content, we’re still looking at a 100% increase in speed.

Another trick is piping the rendering. Material is written into a pipe, such as to a file, and then a program detects this and processes the file for rendering. This is mainly to avoid the server blowing up when the number of renders per second exceeds machine capability. But with a trick like rendering on edit, who needs a trick like this? Another way to look at this would be caching. The user is redirected to an existing html file in a cache directory somewhere. This is even faster than normal. The page could be pulled and rendered once the existing page got too old (and was deleted). These are all ways to increase the performace of your page. But again, nothing would be as fast as preferring a render on edit. Just save the markup and HTML into the same table and use one or the other depending on a “dynamic page content” variable.

One final thought. Since markup is usually much more compact than HTML, and since you can store headers, boilerplates and so forth in Javascript, having a site written entirely in Javascript would offset the large initial bandwidth drain. Say your boilerplate was 300 bytes per page and each user visited 10 pages on a visit. That means even if your javascript was 300k, after just ten repeat visits it would have paid for itself. This is in addition to being able to send markup to the browser, which is usually much more compact than HTML. <strong> vs. “**” anyone?

So what’s faster? PHP is extremely fast, but you should benchmark it. And when it’s not fast, the error is usually in the algorithm or the assumption and not with PHP.

If you find yourself needing to write an extension in C/C++ for speed, first try benchmarking and re-working your assumptions in your algorithms. If that doesn’t work then try Javascript. Only use C/C++ if you need absolute code security (and frankly a wiki’s rendering engine is not it’s primary asset, or it probably isn’t a popular wiki).

By Serena

One thought on “The Fastest Wiki Markup Parser”

Leave a Reply

Your email address will not be published. Required fields are marked *