Home > database, ec2, facebook, thrift, thrudb, web > Announcing: Thrudb - Document Oriented Database Services

Announcing: Thrudb - Document Oriented Database Services

November 4th, 2007

There has been a lot of talk recently about how traditional relational databases no longer fit the bill for web development. This is certainly a bit over the top since every site I’ve ever built or seen built uses a RDBMS. But I think the point is that not a lot has changed in the world of data storage since the 70’s. SQL, DDL and Referential Integrity are ideas that all came before the onset of the web. Databases are just big spreadsheets really but is that the best storage structure web data?.

A new breed of databases and data services have emerged to in recent years to address this. The first product I came across was an XMLDB and XQuery but this system was built to offer everything a regular database offers PLUS a bunch of new features like on the fly indexing of any field. The problem with this kind of approach is it ends up complicating the API. Not to mention XML and performance don’t really fit together. I’m a big believer is simple/fast software components that can be put together to create powerful/fast systems. Google is the best known example of this. They are built to be massively parallel, so much so that there was no way a RDMBS would work. Instead they first built the Google File System which splits their data into 64MB chunks and spreads it across thousands of machines making at least 3 copies of any chunk for redundancy. Then they use techniques like MapReduce to create indexes of these documents, split it into index shards and spread those across their network too. Finally they have services that run on these machines that coordinate searches across their index shards returning the document ids and fetches them from the document store.

They have also built a system called BigTable, which is a Column Oriented Database, which splits a table into columns rather than rows making is much simpler to distribute and parallelize.

So why are these systems any better than a relational database? Well for one thing they make it much easier to scale horizontally, meaning you can slap on another box to the network and increase your database capacity. This is exactly how webservers scale but anyone who has tried to scale their website will tell you it’s never as easy to scale your database as it is your webservers, since traditional databases are inherently monolithic.

Another benefit is your data structures can be sparsely populated and linked across any number of facets in these systems. The story of del.icio.us or flickr trying to scale using tagging and mysql is a great read because it illustrates the problem you run into when using fixed schema’s to hold dynamic/fluid data that wants to be searched, mashed-up, split up and grouped any which way.

Ok, so how do I as a developer address this… Isn’t it obvious? Build a solution from open source components!

I never would have attempted this if it weren’t for Facebook’s Thrift project. It provides much of what I needed to get this off the ground. Specifically the ability to build services that can communicate with almost any language. They used it internally to build much of their infrastructure like search and the Facebook platform itself. Thrift on the surface looks like a stripped down version of CORBA. You define structures and services in a IDL and use its code compiler to generate object definitions and a client/server interface. But Thrift offers soo much more. Most importantly, the ability to transmit your objects over any protocol be it binary, xml, json as well as over any transport (tcp socket, http, file).  Another big benefit of Thrift is you can adjust your structure definitions over time while keeping backwards compatibility with your previous definition. BINGO. This is a big deal because one of big reasons I keep using databases like mysql is so I can adjust my schema as I find bottlenecks or bugs. In fact Google has built a very similar system to Thrift which is how they store data on GFS, using compressed serialized objects they call protocol buffers.

Ok, so I had a development platform, Thrift, now just add a few months of late night coding and a little Memcached, Spread, CLucene and Brackup and I ended up with…

Thrudb is a set of simple services built on top of Facebook’s Thrift framework that provides indexing and document storage services for building and scaling websites. Its purpose is to offer web developers flexible, fast and easy-to-use services which can enhance or replace traditional data storage and access layers.

Thrudb Features:

  • Client libraries for most languages
  • Multi-master replication
  • Incremental backups and redo logging
  • Multiple storage backends (S3 included)
  • Built for horizontal scalability
  • Simple and powerful search api (Lucene)

Thrudb solves a lot of problems for me. Biggest of all is, now with Thrudb, I can use Amazon EC2 as a stable server farm since my backend database writes directly to S3. In fact, I’ve successfully moved Junkdepot from a traditional hosting facility using a mysql database to multiple EC2 instances using Thrudb in a week.

check it out: http://thrudb.googlecode.com

I’m not saying Thrudb is complete and production ready, but I do think its pretty reliable and simple to try out. I’m hoping you the reader can help make it better with your testing, coding and insight…

jake database, ec2, facebook, thrift, thrudb, web

Viewing 2 Comments

 
close Reblog this comment
blog comments powered by Disqus