Archive for the ‘thrift’ Category

Thrift moving to Apache

Thursday, January 24th, 2008

Facebook is looking to move thrift to the apache incubator.  They have submitted a proposal to the apache and are awaiting approval.  This is pretty much a shoe in since there are really no dependencies on thrift besides boost.  Plus the hadoop team is very keen on integrating thrift into hbase and probably hadoop as a whole…

Looks like I’m on their list of initial commiters which would be a nice plus, as I will continue contributing features and target languages as thrudb develops.

[del.icio.us] [Digg] [dzone] [Google] [Mixx] [Reddit] [StumbleUpon]
Writen by jake

Video: Thrift Technical Discussion

Monday, January 21st, 2008

Mark Slee and David Reiss from Facebook explain the design and implementation details of the Thrift project.

One interesting point of this talk… Mark mentions a soon to be open sourced service called “Scribe” that is based their news feed architecture which performs, what sounds like, a distributed work queue for processing thrift message logs… sweet.

Video thumbnail. Click to play
Click To Play

This talk was given at Seneca College Oct 25 2007 as part of the Free Software & Open Source Symposium.original video link

[del.icio.us] [Digg] [dzone] [Google] [Mixx] [Reddit] [StumbleUpon]
Writen by jake

Announcing Thruqueue: Persistant message queue for Thrudb

Friday, December 28th, 2007

I’ve just checked in a new Thrudb service that I’ve been working on for the past few days called Thruqueue. I’m sure you can guess by the name that it’s yet another message queue service. But this one has some great features that I think makes it stand out.

No hard limits - Create as many queues you like, send messages as large as you like, send as many messages as you like.

Persistant queues - Under the hood Thruqueue is exploiting Thrift’s powerful redo logging capabilities so queues are really managed logs, one log per queue. At specified intervals the logs are pruned to maintain disk space, this means the memory profile of thruqueue stays small since only a few items from each queue lives in memory at any given time.

Unique Queues - I’ve also added the ability to create unique queues which essentially means no duplicate messages can exist in the queue at once.

Fast! - I’ve done almost no performance optimization but my initial tests look very promising in 1 second I can write then read ~1200 small messages.

Thrift - Want a client in your favorite language? just run: thrift -favlanguage Thruqueue.thrift

Whats missing:

Replication - Do you really need this? I could hook this puppy up to spread but I’m not sure I see the benifit.

Redundancy - Throxy? TBD

I know that there are certainly a lot of message queues out there but all of then are either non-persistant, cost money, require an underlying rdbms, or cost money. Let me know what you think.

[del.icio.us] [Digg] [dzone] [Google] [Mixx] [Reddit] [StumbleUpon]
Writen by jake

Thrudb Tutorials

Tuesday, December 11th, 2007

I’ve been adding a thrudb tutorial to show how to use thrudb in a number of popular languages (thanks to thrift)

The bookmarks tutorial loads in a export file from del.icio.us, searches it and removes it.

This should be enough to get you going with thrudb and thrift.

Implemented in Ruby, Php, Perl and Java

Enjoy!

[del.icio.us] [Digg] [dzone] [Google] [Mixx] [Reddit] [StumbleUpon]
Writen by jake

Working with thrift structures

Friday, November 23rd, 2007

While developing thrudb I’ve been using thrift a lot and have a couple tricks to share.

First trick is how to do simple reflection. Thrift lets you do things like serialize a structure to a binary string and store it on disk. The problem is thrift doesn’t store the structure’s definition along with it since this information would bloat the message and frankly goes against the design of thrift, which allows loose structure definitions (see section 4 of the thrift whitepaper)

To get around this we need to encode the type of structure we have as a field in the struct itself.

Lets start with an example: Say I want to store a mixed list of Email and RSS articles in a file for backup purposes or better yet in thrudb.
Heres our thrift definition file:

#this is a thrift definition

enum  ObjectType {
    UNKNOWN      = 0,
    EMAIL            = 1,
    RSS _ARTICLE = 2
}

struct SimpleObject {
   100:ObjectType     type = UNKNOWN
}

struct Email {
    1:string subject,
    2:string to_address,
    3:string from_address,
    4:i32     date,
    5:string body,
    100: ObjectType    type=EMAIL
}

struct RssArticle {
    1:string uri,
    2:string title,
    3:string body,
    4:i32     date,
    100:ObjectType type=RSS_ARTICLE
}

So what we did here is set the 100th parameter to be the struct type, then assigned it an default enumeration key from the list of possible types so when a struct is instantiated its type is automatically set. This information is included when a struct is serialized to disk, so when we read the message back we can use our DUMMY stuct “SimpleObject” to check it’s type. The SimpleObject stuct will ignore all the other fields in the message, only loading the 100th param (enum key). Now we know which structure to allocate.

Heres a pseudo example of this in action:

    $serialized_object = get_random_serialized_object();
    $type_obj = new SimpleObject( $serialized_object );

    switch($type_obj->type){

     case EMAIL:

           return new Email( $serialized_object );

     case RSS_ARTICLE:

           return new RssArticle( $serialized_object );

     default:

           print "Unknown type!";

     };

}
[del.icio.us] [Digg] [dzone] [Google] [Mixx] [Reddit] [StumbleUpon]
Writen by jake

Announcing: Thrudb - Document Oriented Database Services

Sunday, November 4th, 2007

There has been a lot of talk recently about how traditional relational databases no longer fit the bill for web development. This is certainly a bit over the top since every site I’ve ever built or seen built uses a RDBMS. But I think the point is that not a lot has changed in the world of data storage since the 70’s. SQL, DDL and Referential Integrity are ideas that all came before the onset of the web. Databases are just big spreadsheets really but is that the best storage structure web data?.

A new breed of databases and data services have emerged to in recent years to address this. The first product I came across was an XMLDB and XQuery but this system was built to offer everything a regular database offers PLUS a bunch of new features like on the fly indexing of any field. The problem with this kind of approach is it ends up complicating the API. Not to mention XML and performance don’t really fit together. I’m a big believer is simple/fast software components that can be put together to create powerful/fast systems. Google is the best known example of this. They are built to be massively parallel, so much so that there was no way a RDMBS would work. Instead they first built the Google File System which splits their data into 64MB chunks and spreads it across thousands of machines making at least 3 copies of any chunk for redundancy. Then they use techniques like MapReduce to create indexes of these documents, split it into index shards and spread those across their network too. Finally they have services that run on these machines that coordinate searches across their index shards returning the document ids and fetches them from the document store.

They have also built a system called BigTable, which is a Column Oriented Database, which splits a table into columns rather than rows making is much simpler to distribute and parallelize.

So why are these systems any better than a relational database? Well for one thing they make it much easier to scale horizontally, meaning you can slap on another box to the network and increase your database capacity. This is exactly how webservers scale but anyone who has tried to scale their website will tell you it’s never as easy to scale your database as it is your webservers, since traditional databases are inherently monolithic.

Another benefit is your data structures can be sparsely populated and linked across any number of facets in these systems. The story of del.icio.us or flickr trying to scale using tagging and mysql is a great read because it illustrates the problem you run into when using fixed schema’s to hold dynamic/fluid data that wants to be searched, mashed-up, split up and grouped any which way.

Ok, so how do I as a developer address this… Isn’t it obvious? Build a solution from open source components!

I never would have attempted this if it weren’t for Facebook’s Thrift project. It provides much of what I needed to get this off the ground. Specifically the ability to build services that can communicate with almost any language. They used it internally to build much of their infrastructure like search and the Facebook platform itself. Thrift on the surface looks like a stripped down version of CORBA. You define structures and services in a IDL and use its code compiler to generate object definitions and a client/server interface. But Thrift offers soo much more. Most importantly, the ability to transmit your objects over any protocol be it binary, xml, json as well as over any transport (tcp socket, http, file).  Another big benefit of Thrift is you can adjust your structure definitions over time while keeping backwards compatibility with your previous definition. BINGO. This is a big deal because one of big reasons I keep using databases like mysql is so I can adjust my schema as I find bottlenecks or bugs. In fact Google has built a very similar system to Thrift which is how they store data on GFS, using compressed serialized objects they call protocol buffers.

Ok, so I had a development platform, Thrift, now just add a few months of late night coding and a little Memcached, Spread, CLucene and Brackup and I ended up with…

Thrudb is a set of simple services built on top of Facebook’s Thrift framework that provides indexing and document storage services for building and scaling websites. Its purpose is to offer web developers flexible, fast and easy-to-use services which can enhance or replace traditional data storage and access layers.

Thrudb Features:

  • Client libraries for most languages
  • Multi-master replication
  • Incremental backups and redo logging
  • Multiple storage backends (S3 included)
  • Built for horizontal scalability
  • Simple and powerful search api (Lucene)

Thrudb solves a lot of problems for me. Biggest of all is, now with Thrudb, I can use Amazon EC2 as a stable server farm since my backend database writes directly to S3. In fact, I’ve successfully moved Junkdepot from a traditional hosting facility using a mysql database to multiple EC2 instances using Thrudb in a week.

check it out: http://thrudb.googlecode.com

I’m not saying Thrudb is complete and production ready, but I do think its pretty reliable and simple to try out. I’m hoping you the reader can help make it better with your testing, coding and insight…

[del.icio.us] [Digg] [dzone] [Google] [Mixx] [Reddit] [StumbleUpon]
Writen by jake

Thrift now available in Perl

Monday, July 30th, 2007

Facebook’s thrift dev team has been pretty busy recently with the facebook platform and all.  But I guess things are now under control because they just announced the Thrift SVN repository with a number of patches applied to the latest revision including my Perl implementation. We’ve been using thrift for a number of months now and I’m really happy with its performance and service oriented approach…  Thrift is really going to change web development for the better.  We will be releasing a number of powerful services soon after the release of our next project which are built on Thrift.

Keep watching…

[del.icio.us] [Digg] [dzone] [Google] [Mixx] [Reddit] [StumbleUpon]
Writen by jake

Half-Asynch / Half-Synch Processing Model Added to Thrift

Sunday, June 10th, 2007

Synchronous processing and asynchronous processing have different strengths and weaknesses. Asynchronous processing is often confusing to system developers, but it scales really well. Synchronus processing (i.e. multi-threaded) is easy to add into a traditional program but is often resource intensive.

A good analogy of Asynch vs Synch programming is writing a SAX XML parser vs a DOM parser… DOM is easy to code but heavyweight. SAX is more complicated to code since but way faster and less resource intensive.

When it comes to web services they need to be able to support many simultaneous clients and perform complicated backend processing.

A great backend component we’ve mentioned before is memcached. This uses asynchronus processing so it can support tens of thousands of connections and is extremely fast because it’s actual work it to store and retrieve data from an in memory hash table.

But sometimes you need to perform processing intensive requests like searching a large index or image processing… to do this using an asynchronous processing model would be tricky and ineffective. At the same time, reading and writing the request over a socket using a synchronous processing model (one thread per request) would be a waste (imagine 200 56k clients connecting to you at the same time, that means 200 threads). The best solution is to perform the network IO using asynchronous processing and request processing using multiple threads (synchronus processing).

The ACE toolkit defined this design pattern years ago and I’ve used it quite effectively in the past however ACE is a very heavyweight only c++ library and its learning curve is pretty steep. This is why we are using thrift instead, since its interoperable with many languages and contains a lightweight c++ toolkit.

We are building some pretty cool web services using Thrift we recently worked with facebook to implement Half-Synch/Half-Asynch support to its c++ toolkit. As a result we can now support thousands of long lived connections while processing requests in a large thread pool.. Best of both worlds!

[del.icio.us] [Digg] [dzone] [Google] [Mixx] [Reddit] [StumbleUpon]
Writen by jake

Facebook’s Thrifty

Sunday, May 27th, 2007

When I first read about the facebook platform last week I have to admit I wasn’t all that excited. Sure it’s great to see facebook open its doors to 3rd parties, but I’m not itching to code the next social slideshow widget. What’s got got me all fired up however, is something else facebook launched a couple of months ago which largely went unnoticed, called Thrift. It’s what they built their platform with.

Thrift is essentially a framework for building web services that can be accessed by most languages. It’s similar to CORBA in that you define a interface file and thrift will generate stubs implementing that interface in any of it’s supported languages.

  • C++
  • PHP
  • Python
  • Java
  • Ruby
  • And recently Perl (thanks to us)

Once you have these stubs you can build the service backend in the above language of your choice and access it from any of the other languages.

This brings us to rule #1 of programming. Knowing when to use the right tool for the right job.

You would (hopefully) never write a web frontend layer in c++ when php was built specifically for that purpose. You would also (hopefully) never build a search engine backend in php since it will be difficult to maximize performance while minimizing memory and cpu time.

So with thrift you can build the search engine in c++ and access it via php (which is what facebook does).

The default transport mechanism is a compact binary format but its fully extensible to any format (xml,json).

But thats just the beginning. You can do some really cool things with Thrift, like log all messages to file and play them back (instant redo logs). Version your data structures so you can still keep backwards compatibility with older stored data.

Thrift comes with some top notch c++ code to quickly build scalable backend c++ services. We are using thrift today as the search backend for junkdepot.com.

I hope to start a series of artcles on how to use Thrift as an alternative to standard LAMP. I really think thrift will become the backbone of next generation web services. We also will be releasing some of the services we’ve built with thrift as open source projects soon.

Feel free to ask questions.

-Jake

[del.icio.us] [Digg] [dzone] [Google] [Mixx] [Reddit] [StumbleUpon]
Writen by jake