As you should know by now, we are working hard on speeding up noblogs lately, in particular we are focusing on the page load time for our users. As I described in
a preceding post, we created a sharded setup for noblogs, which allowed us to radically reduce page generation times. Thus we decided the next step would be implementing a noblogs-wide CDN. In this post I’ll explain you what is a CDN, why we decided to build our own CDN, and how we technically did it.
What is a CDN, and why is it a good thing
However, each single request for a noblogs page requires your browser to download several additional files, including CSS stylesheets and javascript source files. Those files are obviously the same for any blog on noblogs, but if you visit two different blogs your browser will download first
http://foo.noblogs.org/wp-includes/prototype.js
and then
http://bar.noblogs.org/wp-includes/prototype.js
the same file will be downloaded twice as the url is different, and thus the browser cache cannot be used.
If the URL was the same for both locations, i.e.
http://some-domain.tld/wp-includes/prototype.js
the second time your browser will be able to use its cache data to avoid transferring the file content, or even to avoid performing the request at all. A detailed explanation of how the http cache works is well below the scope of this post, and I think the best way to understand it is to read the fucking manual. However, this will save load time for the page, and reduce the amount of bytes exchanged between the server and the client. Such domains, used for hosting static files that are commonly used by web applications is becoming more and more widespread, and these are the so-called Content Delivery Networks (CDN). All big internet players like Google, Akamai, Facebook, AOL and so on have their own CDN for commonly used files, and they encourage everybody to use them. But this is not an option if you want to guarantee the privacy of your users.
Why public CDNs are evil
If every web page you visit today uses the same cdn, the owner of that cdn will potentially know your entire browsing history, due to the fact that your browser sends them a whole lot of information that may allow the owner of the CDN to identify you and also the HTTP Referer header, which will show the URL that originated the call. Thus we may say, without exxageration, that CDNs are huge tracking systems, disguised as a convenient way to speed up page load times and to reduce bandwidth usage. Which is by the way the same thing that the “facebook like” and “Google +1” buttons do. Some self-promotion: if you want to know something more about web tracking technologies, just read
my slides for the 2010 Hackmeeting.
Setting up the A/I CDN
We registered a dedicated domain for our CDN, ai-cdn.net, so that no requests to the cdn will contain cookies from different a/i subsites. This is useful for a series of reasons, including reducing the number of bytes transferred between the CDN and your browser. The set-up of the CDN is very simple and basically deals with the HTTP Cache headers. One thing you must do if you want to distribute your CDN on multiple servers is to make sure that cache headers sent from all servers are the same. In particular, you want the Last-Modified and Etag headers to be the same across all your cluster; since these two headers ideally depend on the last modification time of the file on the filesystem, you should touch(1) the files at the same timestamp on all servers. Which timestamp to choose? Well, a sensible choice is to use the time of the last commit; if you are using git, a single bash one-liner will do the job; something along the lines of:
cd ${ROOT}/www && git ls-tree -r --name-only \
| xargs -I @ sh -c \
'touch -m -d "$(git log --pretty=format:"%ci" -n1 @)" @'
Once we did this, noblogs was ready to use its cdn via the OSSDL CDN Off Linker plugin. A few tweaks here and there (mainly to the frontend webserver configuration, as setting the Expire header to sensible values) and all blogs hosted here are faster to load and consume less of your bandwidth.
figata.