Archive

Archive for the ‘facebook’ Category

Announcing: Thrudb – Document Oriented Database Services

November 4th, 2007

There has been a lot of talk recently about how traditional relational databases no longer fit the bill for web development. This is certainly a bit over the top since every site I’ve ever built or seen built uses a RDBMS. But I think the point is that not a lot has changed in the world of data storage since the 70’s. SQL, DDL and Referential Integrity are ideas that all came before the onset of the web. Databases are just big spreadsheets really but is that the best storage structure web data?.

A new breed of databases and data services have emerged to in recent years to address this. The first product I came across was an XMLDB and XQuery but this system was built to offer everything a regular database offers PLUS a bunch of new features like on the fly indexing of any field. The problem with this kind of approach is it ends up complicating the API. Not to mention XML and performance don’t really fit together. I’m a big believer is simple/fast software components that can be put together to create powerful/fast systems. Google is the best known example of this. They are built to be massively parallel, so much so that there was no way a RDMBS would work. Instead they first built the Google File System which splits their data into 64MB chunks and spreads it across thousands of machines making at least 3 copies of any chunk for redundancy. Then they use techniques like MapReduce to create indexes of these documents, split it into index shards and spread those across their network too. Finally they have services that run on these machines that coordinate searches across their index shards returning the document ids and fetches them from the document store.

They have also built a system called BigTable, which is a Column Oriented Database, which splits a table into columns rather than rows making is much simpler to distribute and parallelize.

So why are these systems any better than a relational database? Well for one thing they make it much easier to scale horizontally, meaning you can slap on another box to the network and increase your database capacity. This is exactly how webservers scale but anyone who has tried to scale their website will tell you it’s never as easy to scale your database as it is your webservers, since traditional databases are inherently monolithic.

Another benefit is your data structures can be sparsely populated and linked across any number of facets in these systems. The story of del.icio.us or flickr trying to scale using tagging and mysql is a great read because it illustrates the problem you run into when using fixed schema’s to hold dynamic/fluid data that wants to be searched, mashed-up, split up and grouped any which way.

Ok, so how do I as a developer address this… Isn’t it obvious? Build a solution from open source components!

I never would have attempted this if it weren’t for Facebook’s Thrift project. It provides much of what I needed to get this off the ground. Specifically the ability to build services that can communicate with almost any language. They used it internally to build much of their infrastructure like search and the Facebook platform itself. Thrift on the surface looks like a stripped down version of CORBA. You define structures and services in a IDL and use its code compiler to generate object definitions and a client/server interface. But Thrift offers soo much more. Most importantly, the ability to transmit your objects over any protocol be it binary, xml, json as well as over any transport (tcp socket, http, file).  Another big benefit of Thrift is you can adjust your structure definitions over time while keeping backwards compatibility with your previous definition. BINGO. This is a big deal because one of big reasons I keep using databases like mysql is so I can adjust my schema as I find bottlenecks or bugs. In fact Google has built a very similar system to Thrift which is how they store data on GFS, using compressed serialized objects they call protocol buffers.

Ok, so I had a development platform, Thrift, now just add a few months of late night coding and a little Memcached, Spread, CLucene and Brackup and I ended up with…

Thrudb is a set of simple services built on top of Facebook’s Thrift framework that provides indexing and document storage services for building and scaling websites. Its purpose is to offer web developers flexible, fast and easy-to-use services which can enhance or replace traditional data storage and access layers.

Thrudb Features:

  • Client libraries for most languages
  • Multi-master replication
  • Incremental backups and redo logging
  • Multiple storage backends (S3 included)
  • Built for horizontal scalability
  • Simple and powerful search api (Lucene)

Thrudb solves a lot of problems for me. Biggest of all is, now with Thrudb, I can use Amazon EC2 as a stable server farm since my backend database writes directly to S3. In fact, I’ve successfully moved Junkdepot from a traditional hosting facility using a mysql database to multiple EC2 instances using Thrudb in a week.

check it out: http://thrudb.googlecode.com

I’m not saying Thrudb is complete and production ready, but I do think its pretty reliable and simple to try out. I’m hoping you the reader can help make it better with your testing, coding and insight…

jake database, ec2, facebook, thrift, thrudb, web

Thrift now available in Perl

July 30th, 2007

Facebook’s thrift dev team has been pretty busy recently with the facebook platform and all.  But I guess things are now under control because they just announced the Thrift SVN repository with a number of patches applied to the latest revision including my Perl implementation. We’ve been using thrift for a number of months now and I’m really happy with its performance and service oriented approach…  Thrift is really going to change web development for the better.  We will be releasing a number of powerful services soon after the release of our next project which are built on Thrift.

Keep watching…

jake facebook, thrift

Junkdepot on Facebook

July 9th, 2007

This past weekend we integrated the facebook platform into junkdepot, or is it the other way around? Overall it was quite simple to do. The ability to add a huge social network into your app was too tempting to ignore! When you add the app into your profile it will place your recent listings into your profile and when you use the app from facebook it shows our google mashup with options to filter the map based on your friends, groups, affiliations, or everyone together.

Facebook has a great classifieds service but doesn’t let people from the outside view the listings. When you buy or sell something, your reach is only as wide as your social network. With Junkdepot you can reach everyone. We submit your listings to the big classified services like: google base, hi5, vast and edgeio (not to mention your profile). You can choose to search listings from your friends, classmates, groups or everyone else in our interactive junk map.

Don’t worry, we use Rapleaf reputations to keep things ethical and of course it’s free as beer. Try it out!

jake facebook, junkdepot, platform

Half-Asynch / Half-Synch Processing Model Added to Thrift

June 10th, 2007

Synchronous processing and asynchronous processing have different strengths and weaknesses. Asynchronous processing is often confusing to system developers, but it scales really well. Synchronus processing (i.e. multi-threaded) is easy to add into a traditional program but is often resource intensive.

A good analogy of Asynch vs Synch programming is writing a SAX XML parser vs a DOM parser… DOM is easy to code but heavyweight. SAX is more complicated to code since but way faster and less resource intensive.

When it comes to web services they need to be able to support many simultaneous clients and perform complicated backend processing.

A great backend component we’ve mentioned before is memcached. This uses asynchronus processing so it can support tens of thousands of connections and is extremely fast because it’s actual work it to store and retrieve data from an in memory hash table.

But sometimes you need to perform processing intensive requests like searching a large index or image processing… to do this using an asynchronous processing model would be tricky and ineffective. At the same time, reading and writing the request over a socket using a synchronous processing model (one thread per request) would be a waste (imagine 200 56k clients connecting to you at the same time, that means 200 threads). The best solution is to perform the network IO using asynchronous processing and request processing using multiple threads (synchronus processing).

The ACE toolkit defined this design pattern years ago and I’ve used it quite effectively in the past however ACE is a very heavyweight only c++ library and its learning curve is pretty steep. This is why we are using thrift instead, since its interoperable with many languages and contains a lightweight c++ toolkit.

We are building some pretty cool web services using Thrift we recently worked with facebook to implement Half-Synch/Half-Asynch support to its c++ toolkit. As a result we can now support thousands of long lived connections while processing requests in a large thread pool.. Best of both worlds!

jake c++, coding, facebook, programming, scaling, thrift, web service

Facebook’s Thrifty

May 27th, 2007

When I first read about the facebook platform last week I have to admit I wasn’t all that excited. Sure it’s great to see facebook open its doors to 3rd parties, but I’m not itching to code the next social slideshow widget. What’s got got me all fired up however, is something else facebook launched a couple of months ago which largely went unnoticed, called Thrift. It’s what they built their platform with.

Thrift is essentially a framework for building web services that can be accessed by most languages. It’s similar to CORBA in that you define a interface file and thrift will generate stubs implementing that interface in any of it’s supported languages.

  • C++
  • PHP
  • Python
  • Java
  • Ruby
  • And recently Perl (thanks to us)

Once you have these stubs you can build the service backend in the above language of your choice and access it from any of the other languages.

This brings us to rule #1 of programming. Knowing when to use the right tool for the right job.

You would (hopefully) never write a web frontend layer in c++ when php was built specifically for that purpose. You would also (hopefully) never build a search engine backend in php since it will be difficult to maximize performance while minimizing memory and cpu time.

So with thrift you can build the search engine in c++ and access it via php (which is what facebook does).

The default transport mechanism is a compact binary format but its fully extensible to any format (xml,json).

But thats just the beginning. You can do some really cool things with Thrift, like log all messages to file and play them back (instant redo logs). Version your data structures so you can still keep backwards compatibility with older stored data.

Thrift comes with some top notch c++ code to quickly build scalable backend c++ services. We are using thrift today as the search backend for junkdepot.com.

I hope to start a series of artcles on how to use Thrift as an alternative to standard LAMP. I really think thrift will become the backbone of next generation web services. We also will be releasing some of the services we’ve built with thrift as open source projects soon.

Feel free to ask questions.

-Jake

jake TR Site, coding, facebook, thrift