Archive for the ‘scaling’ Category

Half-Asynch / Half-Synch Processing Model Added to Thrift

Sunday, June 10th, 2007

Synchronous processing and asynchronous processing have different strengths and weaknesses. Asynchronous processing is often confusing to system developers, but it scales really well. Synchronus processing (i.e. multi-threaded) is easy to add into a traditional program but is often resource intensive.

A good analogy of Asynch vs Synch programming is writing a SAX XML parser vs a DOM parser… DOM is easy to code but heavyweight. SAX is more complicated to code since but way faster and less resource intensive.

When it comes to web services they need to be able to support many simultaneous clients and perform complicated backend processing.

A great backend component we’ve mentioned before is memcached. This uses asynchronus processing so it can support tens of thousands of connections and is extremely fast because it’s actual work it to store and retrieve data from an in memory hash table.

But sometimes you need to perform processing intensive requests like searching a large index or image processing… to do this using an asynchronous processing model would be tricky and ineffective. At the same time, reading and writing the request over a socket using a synchronous processing model (one thread per request) would be a waste (imagine 200 56k clients connecting to you at the same time, that means 200 threads). The best solution is to perform the network IO using asynchronous processing and request processing using multiple threads (synchronus processing).

The ACE toolkit defined this design pattern years ago and I’ve used it quite effectively in the past however ACE is a very heavyweight only c++ library and its learning curve is pretty steep. This is why we are using thrift instead, since its interoperable with many languages and contains a lightweight c++ toolkit.

We are building some pretty cool web services using Thrift we recently worked with facebook to implement Half-Synch/Half-Asynch support to its c++ toolkit. As a result we can now support thousands of long lived connections while processing requests in a large thread pool.. Best of both worlds!

[del.icio.us] [Digg] [dzone] [Google] [Mixx] [Reddit] [StumbleUpon]
Writen by jake

is scaling easy? it can be.

Thursday, May 31st, 2007

The ruby on rails folk say scaling is easy, and they are correct, but there are many different components to scale. Scaling web servers horizontally using reverse proxy tools like pound or perlbal and caching with memcached and varnish will get you pretty far but real web applications need to scale their content and not just their service. Take Flickr for example, how do you scale millions of photos? Or Twitter, can’t put all those tweets in a single database. Thats why the key to real scaling is considering your service a year down the road, what considerations can you make now to make it easier to scale later on (if you are lucky enough).

The most surefire way to scale your content is to make it easier to federate or partition your data. That means following these simple rules:

1. Keep away from sequential primary keys in your database. Use UUIDs they can be generated globally from anywhere and with no chance of collision, you can more easily move to a multi-master database model this way if you have to, or split your data into partitioned chunks based on hashing the UUID.

2. Don’t use stored procs (ever!). Thankfully most of us are used to not having stored procs in mysql so this isn’t a big deal, but if you split up your database into smaller pieces you can’t use a traditional stored proc to search across them all, not to mention its bad to put business logic in your model layer.

3. Think about using special search tools, like lucene for searching across specific types of data. Related to the above rule searching across you data is hard when you split it up into pieces but tools like lucene make it easy to create small meta-indexes of your data which can easily fit a lot more info than a big innodb table.

4. Don’t store binary data in your db, unless you like pain you should never store things like images in a database. Just store the path to it. I’ve found it easy to take the MD5 of the image and use that as the name, since you can then partition your images evenly across many directories (and eventually disks). Or just use amazon s3 :)

5. Finally, Only store what you need. Scaling becomes much harder when you build a lot of complexity and normalization into your data model. Keep it simple stupid. People don’t like complicated apps, believe me I know :)

There are some great talks about scaling data here.

[del.icio.us] [Digg] [dzone] [Google] [Mixx] [Reddit] [StumbleUpon]
Writen by jake