Archive

Archive for the ‘programming’ Category

Amazon SimpleDB : Super! but too simple?

December 15th, 2007

Finally a big company gets it: Schema-less document oriented databases are the wave of future.

With the announcement yesterday of Amazon SimpleDB, a new way of storing and querying data has finally hit the mainstream so many of us have been trying to reach. I believe this kind of technology is a game-changer since it allows simple flexible storage and retrieval of multi-faceted data (describes most data on the web). That being said there appears to be a number of issues with the beta release that will hopefully be ironed out in months to come.
Here’s what we know so far:

- REST and SOAP APIs

- Domains represent a collection of documents, similar to S3 Buckets

- “items” or documents can contain upto 256 key-value pairs (called attributes)

- Multiple attributes with the same name allowed e.g. (type=flag, color=red, color=white, color=blue)

- Create, remove or update items and item attributes

- Attribute values limited to 1024 characters

- Very simple query language for searching domains; i.e ( =, !=, <, > <=, >=, STARTS-WITH, AND, OR, NOT, INTERSECTION AND UNION )

- No free text search capabilities

- Query time limited to 5 seconds, error thrown if query takes any longer.

- Query results can be limited and paged (total possible results are not returned)

- No sorting capabilities

- Eventual constancy model used for writes. This means if you update a document and instantly query it you may not get the un-updated document.

- Pay as you go based on storage and query utilization.

My big concerns here are the limits on sorting, freetext search, eventual consistency model and attribute size. I would use this for service for things like tag search, user preference storage and other non-critical meta data but not sure it would be useful or reliable enough to store things like a username, encrypted-password and email info.

One great thing that Amazon put into their intro doc which paralells the thrudb design was the following:

Developers can run their applications in Amazon EC2 and store their data objects in Amazon S3. Amazon SimpleDB can then be used to query the object metadata from within the application in Amazon EC2 and return pointers to the objects stored in Amazon S3.

This is exactly the way ThruDB’s thrudoc and thrucene services are intended to work together. However since thrucene is built on lucene they offer atomic writes, no hard limits, free text search and sorting :)

I am excited to get my hands on SimpleDB and I will defiantly use it sometimes as an alternative search interface to thrucene, however, I think the Amazon engineers had to compromise too many things in order to provide a ubiquitous database for everyone. I’m sure they will address a number of the limitations in the months to come. Either way its an exciting time for us data storage geeks :)

Now if only Google would release BigTable

jake amazon, database, programming, thrudb

Working with thrift structures

November 23rd, 2007

While developing thrudb I’ve been using thrift a lot and have a couple tricks to share.

First trick is how to do simple reflection. Thrift lets you do things like serialize a structure to a binary string and store it on disk. The problem is thrift doesn’t store the structure’s definition along with it since this information would bloat the message and frankly goes against the design of thrift, which allows loose structure definitions (see section 4 of the thrift whitepaper)

To get around this we need to encode the type of structure we have as a field in the struct itself.

Lets start with an example: Say I want to store a mixed list of Email and RSS articles in a file for backup purposes or better yet in thrudb.
Heres our thrift definition file:

#this is a thrift definition

enum  ObjectType {
    UNKNOWN      = 0,
    EMAIL            = 1,
    RSS _ARTICLE = 2
}

struct SimpleObject {
   100:ObjectType     type = UNKNOWN
}

struct Email {
    1:string subject,
    2:string to_address,
    3:string from_address,
    4:i32     date,
    5:string body,
    100: ObjectType    type=EMAIL
}

struct RssArticle {
    1:string uri,
    2:string title,
    3:string body,
    4:i32     date,
    100:ObjectType type=RSS_ARTICLE
}

So what we did here is set the 100th parameter to be the struct type, then assigned it an default enumeration key from the list of possible types so when a struct is instantiated its type is automatically set. This information is included when a struct is serialized to disk, so when we read the message back we can use our DUMMY stuct “SimpleObject” to check it’s type. The SimpleObject stuct will ignore all the other fields in the message, only loading the 100th param (enum key). Now we know which structure to allocate.

Heres a pseudo example of this in action:

    $serialized_object = get_random_serialized_object();
    $type_obj = new SimpleObject( $serialized_object );

    switch($type_obj->type){

     case EMAIL:

           return new Email( $serialized_object );

     case RSS_ARTICLE:

           return new RssArticle( $serialized_object );

     default:

           print "Unknown type!";

     };

}

jake programming, thrift, thrudb

Javascript Arrays vs Object Literal

June 28th, 2007

Recently, I’ve been learning to use Javascript object literals for holding similar sets of data as opposed to using arrays. They are much more manageable and flexible than simpler than arrays and I think even easier to read. Below is an a variable holding form validation data. I can simply loop through these just like I would an array, I can output the values, check against them and even call a function or even set an event listener. With arrays I am limited to mostly common data types like string, int, boolean and such. The fact that I can make references to functions is really cool and allows a pretty flexible and powerful system. Next time you have to work with arrays, consider object literal.

var emptyValues = [{name:'firstname', id:'firstname', ce:checkEmpty, eid:'first_error', defVal: 'First'},{name:'lastname', id:'lastname', ce:checkEmpty, eid:'last_error', defVal: 'Last'},{name:'phone', id:'phone', ce:checkEmpty, eid:'phone_error', defVal:'Phone'},{name:'message', id:'message', ce:checkEmpty, eid:'message_error', defVal: 'Message'},{name:'name', id:'name', ce:checkEmpty, eid:'name_error', defVal: 'Full Name'},{name:'email', id:'email', ce:checkEmpty, eid:'error_email', defVal:'Email'}];

Rich JavaScript, programming, web

Half-Asynch / Half-Synch Processing Model Added to Thrift

June 10th, 2007

Synchronous processing and asynchronous processing have different strengths and weaknesses. Asynchronous processing is often confusing to system developers, but it scales really well. Synchronus processing (i.e. multi-threaded) is easy to add into a traditional program but is often resource intensive.

A good analogy of Asynch vs Synch programming is writing a SAX XML parser vs a DOM parser… DOM is easy to code but heavyweight. SAX is more complicated to code since but way faster and less resource intensive.

When it comes to web services they need to be able to support many simultaneous clients and perform complicated backend processing.

A great backend component we’ve mentioned before is memcached. This uses asynchronus processing so it can support tens of thousands of connections and is extremely fast because it’s actual work it to store and retrieve data from an in memory hash table.

But sometimes you need to perform processing intensive requests like searching a large index or image processing… to do this using an asynchronous processing model would be tricky and ineffective. At the same time, reading and writing the request over a socket using a synchronous processing model (one thread per request) would be a waste (imagine 200 56k clients connecting to you at the same time, that means 200 threads). The best solution is to perform the network IO using asynchronous processing and request processing using multiple threads (synchronus processing).

The ACE toolkit defined this design pattern years ago and I’ve used it quite effectively in the past however ACE is a very heavyweight only c++ library and its learning curve is pretty steep. This is why we are using thrift instead, since its interoperable with many languages and contains a lightweight c++ toolkit.

We are building some pretty cool web services using Thrift we recently worked with facebook to implement Half-Synch/Half-Asynch support to its c++ toolkit. As a result we can now support thousands of long lived connections while processing requests in a large thread pool.. Best of both worlds!

jake c++, coding, facebook, programming, scaling, thrift, web service

is scaling easy? it can be.

May 31st, 2007

The ruby on rails folk say scaling is easy, and they are correct, but there are many different components to scale. Scaling web servers horizontally using reverse proxy tools like pound or perlbal and caching with memcached and varnish will get you pretty far but real web applications need to scale their content and not just their service. Take Flickr for example, how do you scale millions of photos? Or Twitter, can’t put all those tweets in a single database. Thats why the key to real scaling is considering your service a year down the road, what considerations can you make now to make it easier to scale later on (if you are lucky enough).

The most surefire way to scale your content is to make it easier to federate or partition your data. That means following these simple rules:

1. Keep away from sequential primary keys in your database. Use UUIDs they can be generated globally from anywhere and with no chance of collision, you can more easily move to a multi-master database model this way if you have to, or split your data into partitioned chunks based on hashing the UUID.

2. Don’t use stored procs (ever!). Thankfully most of us are used to not having stored procs in mysql so this isn’t a big deal, but if you split up your database into smaller pieces you can’t use a traditional stored proc to search across them all, not to mention its bad to put business logic in your model layer.

3. Think about using special search tools, like lucene for searching across specific types of data. Related to the above rule searching across you data is hard when you split it up into pieces but tools like lucene make it easy to create small meta-indexes of your data which can easily fit a lot more info than a big innodb table.

4. Don’t store binary data in your db, unless you like pain you should never store things like images in a database. Just store the path to it. I’ve found it easy to take the MD5 of the image and use that as the name, since you can then partition your images evenly across many directories (and eventually disks). Or just use amazon s3 :)

5. Finally, Only store what you need. Scaling becomes much harder when you build a lot of complexity and normalization into your data model. Keep it simple stupid. People don’t like complicated apps, believe me I know :)

There are some great talks about scaling data here.

jake programming, scaling, web