Archive for the ‘thrudb’ Category

Great Thrudb Write-up

Friday, December 28th, 2007

The folks over at AideRSS have been big supporters of thrudb since the week it was launched (almost 2 month ago).

They have recently released 2 public amazon EC2 AMIs with thrudb pre-installed so folks can get started quickly.

And now their CTO, Ilya Grigorik, has written up a great article on thrudoc, the document storage service in thrudb.

I’m really happy to have such great support from these guys.  I hope thrudb can help improve their already great service!

[del.icio.us] [Digg] [dzone] [Google] [Mixx] [Reddit] [StumbleUpon]
Writen by jake

Amazon SimpleDB : Super! but too simple?

Saturday, December 15th, 2007

Finally a big company gets it: Schema-less document oriented databases are the wave of future.

With the announcement yesterday of Amazon SimpleDB, a new way of storing and querying data has finally hit the mainstream so many of us have been trying to reach. I believe this kind of technology is a game-changer since it allows simple flexible storage and retrieval of multi-faceted data (describes most data on the web). That being said there appears to be a number of issues with the beta release that will hopefully be ironed out in months to come.
Here’s what we know so far:

- REST and SOAP APIs

- Domains represent a collection of documents, similar to S3 Buckets

- “items” or documents can contain upto 256 key-value pairs (called attributes)

- Multiple attributes with the same name allowed e.g. (type=flag, color=red, color=white, color=blue)

- Create, remove or update items and item attributes

- Attribute values limited to 1024 characters

- Very simple query language for searching domains; i.e ( =, !=, <, > <=, >=, STARTS-WITH, AND, OR, NOT, INTERSECTION AND UNION )

- No free text search capabilities

- Query time limited to 5 seconds, error thrown if query takes any longer.

- Query results can be limited and paged (total possible results are not returned)

- No sorting capabilities

- Eventual constancy model used for writes. This means if you update a document and instantly query it you may not get the un-updated document.

- Pay as you go based on storage and query utilization.

My big concerns here are the limits on sorting, freetext search, eventual consistency model and attribute size. I would use this for service for things like tag search, user preference storage and other non-critical meta data but not sure it would be useful or reliable enough to store things like a username, encrypted-password and email info.

One great thing that Amazon put into their intro doc which paralells the thrudb design was the following:

Developers can run their applications in Amazon EC2 and store their data objects in Amazon S3. Amazon SimpleDB can then be used to query the object metadata from within the application in Amazon EC2 and return pointers to the objects stored in Amazon S3.

This is exactly the way ThruDB’s thrudoc and thrucene services are intended to work together. However since thrucene is built on lucene they offer atomic writes, no hard limits, free text search and sorting :)

I am excited to get my hands on SimpleDB and I will defiantly use it sometimes as an alternative search interface to thrucene, however, I think the Amazon engineers had to compromise too many things in order to provide a ubiquitous database for everyone. I’m sure they will address a number of the limitations in the months to come. Either way its an exciting time for us data storage geeks :)

Now if only Google would release BigTable

[del.icio.us] [Digg] [dzone] [Google] [Mixx] [Reddit] [StumbleUpon]
Writen by jake

Working with thrift structures

Friday, November 23rd, 2007

While developing thrudb I’ve been using thrift a lot and have a couple tricks to share.

First trick is how to do simple reflection. Thrift lets you do things like serialize a structure to a binary string and store it on disk. The problem is thrift doesn’t store the structure’s definition along with it since this information would bloat the message and frankly goes against the design of thrift, which allows loose structure definitions (see section 4 of the thrift whitepaper)

To get around this we need to encode the type of structure we have as a field in the struct itself.

Lets start with an example: Say I want to store a mixed list of Email and RSS articles in a file for backup purposes or better yet in thrudb.
Heres our thrift definition file:

#this is a thrift definition

enum  ObjectType {
    UNKNOWN      = 0,
    EMAIL            = 1,
    RSS _ARTICLE = 2
}

struct SimpleObject {
   100:ObjectType     type = UNKNOWN
}

struct Email {
    1:string subject,
    2:string to_address,
    3:string from_address,
    4:i32     date,
    5:string body,
    100: ObjectType    type=EMAIL
}

struct RssArticle {
    1:string uri,
    2:string title,
    3:string body,
    4:i32     date,
    100:ObjectType type=RSS_ARTICLE
}

So what we did here is set the 100th parameter to be the struct type, then assigned it an default enumeration key from the list of possible types so when a struct is instantiated its type is automatically set. This information is included when a struct is serialized to disk, so when we read the message back we can use our DUMMY stuct “SimpleObject” to check it’s type. The SimpleObject stuct will ignore all the other fields in the message, only loading the 100th param (enum key). Now we know which structure to allocate.

Heres a pseudo example of this in action:

    $serialized_object = get_random_serialized_object();
    $type_obj = new SimpleObject( $serialized_object );

    switch($type_obj->type){

     case EMAIL:

           return new Email( $serialized_object );

     case RSS_ARTICLE:

           return new RssArticle( $serialized_object );

     default:

           print "Unknown type!";

     };

}
[del.icio.us] [Digg] [dzone] [Google] [Mixx] [Reddit] [StumbleUpon]
Writen by jake

Announcing: Thrudb - Document Oriented Database Services

Sunday, November 4th, 2007

There has been a lot of talk recently about how traditional relational databases no longer fit the bill for web development. This is certainly a bit over the top since every site I’ve ever built or seen built uses a RDBMS. But I think the point is that not a lot has changed in the world of data storage since the 70’s. SQL, DDL and Referential Integrity are ideas that all came before the onset of the web. Databases are just big spreadsheets really but is that the best storage structure web data?.

A new breed of databases and data services have emerged to in recent years to address this. The first product I came across was an XMLDB and XQuery but this system was built to offer everything a regular database offers PLUS a bunch of new features like on the fly indexing of any field. The problem with this kind of approach is it ends up complicating the API. Not to mention XML and performance don’t really fit together. I’m a big believer is simple/fast software components that can be put together to create powerful/fast systems. Google is the best known example of this. They are built to be massively parallel, so much so that there was no way a RDMBS would work. Instead they first built the Google File System which splits their data into 64MB chunks and spreads it across thousands of machines making at least 3 copies of any chunk for redundancy. Then they use techniques like MapReduce to create indexes of these documents, split it into index shards and spread those across their network too. Finally they have services that run on these machines that coordinate searches across their index shards returning the document ids and fetches them from the document store.

They have also built a system called BigTable, which is a Column Oriented Database, which splits a table into columns rather than rows making is much simpler to distribute and parallelize.

So why are these systems any better than a relational database? Well for one thing they make it much easier to scale horizontally, meaning you can slap on another box to the network and increase your database capacity. This is exactly how webservers scale but anyone who has tried to scale their website will tell you it’s never as easy to scale your database as it is your webservers, since traditional databases are inherently monolithic.

Another benefit is your data structures can be sparsely populated and linked across any number of facets in these systems. The story of del.icio.us or flickr trying to scale using tagging and mysql is a great read because it illustrates the problem you run into when using fixed schema’s to hold dynamic/fluid data that wants to be searched, mashed-up, split up and grouped any which way.

Ok, so how do I as a developer address this… Isn’t it obvious? Build a solution from open source components!

I never would have attempted this if it weren’t for Facebook’s Thrift project. It provides much of what I needed to get this off the ground. Specifically the ability to build services that can communicate with almost any language. They used it internally to build much of their infrastructure like search and the Facebook platform itself. Thrift on the surface looks like a stripped down version of CORBA. You define structures and services in a IDL and use its code compiler to generate object definitions and a client/server interface. But Thrift offers soo much more. Most importantly, the ability to transmit your objects over any protocol be it binary, xml, json as well as over any transport (tcp socket, http, file).  Another big benefit of Thrift is you can adjust your structure definitions over time while keeping backwards compatibility with your previous definition. BINGO. This is a big deal because one of big reasons I keep using databases like mysql is so I can adjust my schema as I find bottlenecks or bugs. In fact Google has built a very similar system to Thrift which is how they store data on GFS, using compressed serialized objects they call protocol buffers.

Ok, so I had a development platform, Thrift, now just add a few months of late night coding and a little Memcached, Spread, CLucene and Brackup and I ended up with…

Thrudb is a set of simple services built on top of Facebook’s Thrift framework that provides indexing and document storage services for building and scaling websites. Its purpose is to offer web developers flexible, fast and easy-to-use services which can enhance or replace traditional data storage and access layers.

Thrudb Features:

  • Client libraries for most languages
  • Multi-master replication
  • Incremental backups and redo logging
  • Multiple storage backends (S3 included)
  • Built for horizontal scalability
  • Simple and powerful search api (Lucene)

Thrudb solves a lot of problems for me. Biggest of all is, now with Thrudb, I can use Amazon EC2 as a stable server farm since my backend database writes directly to S3. In fact, I’ve successfully moved Junkdepot from a traditional hosting facility using a mysql database to multiple EC2 instances using Thrudb in a week.

check it out: http://thrudb.googlecode.com

I’m not saying Thrudb is complete and production ready, but I do think its pretty reliable and simple to try out. I’m hoping you the reader can help make it better with your testing, coding and insight…

[del.icio.us] [Digg] [dzone] [Google] [Mixx] [Reddit] [StumbleUpon]
Writen by jake