Introduction

This documentation project is the result of work by Shane R. Spencer and is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.

Many of the techniques used in this documentation are not solely associated with Shane R. Spencer and for the most part are the intellectual property of the nameless horde known as humanity. If you feel as though you might like some attribution please make yourself known.

Licensing

As previously mentioned.. this work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.

This license lets others remix, tweak, and build upon this project even for commercial purposes, as long as they credit the author and license their new creations under the identical terms. This license is often compared to “copyleft” free and open source software licenses. All new works based on yours will carry the same license, so any derivatives will also allow commercial use. This is the license used by Wikipedia, and is recommended for materials that would benefit from incorporating content from Wikipedia and similarly licensed projects.

Background

Key-Value stores (KV) have existed for a good long time and have subsequently served humanity and its more technically inclined people very well.

The history and of KV based storage methods is well beyond the scope of this document. Briefly however, the idea of quick and fast access to stored values relating to a named key is fundamental to how modern databases work. More recently we’ve all seen an increase in fun new storage techniques associated with the NoSQL genre which, simply put, breaks away from the relational database structure typically associated with database management systems that directly support the SQL standards for describing a query and processing results.

Relational database solutions typically do not concentrate on large amounts of data per “row”, hence the name of this document. Hopefully some benchmarks, storage comparisons, and network analysis will help developers understand the crazy method I’ve been calling “Big Fat Happy Data”!

The initial use of these methods is being utilized in the project Informadiko which is currently based around TornadoWeb, MongoDB and Xapian. Mongrel2 and Brubeck are also being evaluated for the project.

Reference Technology

This document focuses on using MongoDB and Redis as KV stores. MongoDB has a very rich query parser that helps the developer keep things simple while still using a KV based solution and Redis does a great job at keeping data at the ready, handling atomic operations, and allowing blazing fast access to key data. This is of course very general since both projects have a huge list of pros and cons that the Internet is more than happy to point out. However the developers of both solutions are techodweebs with a good nose for what to avoid when developing reliable database solutions.

MongoDB
Created by 10gen as a feature rich KV store that can be seen as a document store. Documents are stored in collections that can be mapped to database servers in a myriad of ways. The most common and simple setup involves a single database server hosting one or more collections simultaneously.
Redis
Created by `Salvatore Sanfilippo`_ and is currently sponsored by VMWare. Redis is an known as an in-memory KV store however data is backed onto the disk for extra persistence.

Other technologies used in this document to help comparatively describe the data layout as well as offer some fun benchmarks include the popular relational database management system (RDBMS) MySQL as well as the object relationship manager (ORM) used by Django. These two products may appear to get the shaft a bit.. however the should be considered very valuable projects. Both have inspired projects as well as been part of the foundation for a very high percentage of websites, custom applications, and large enterprise scale solutions. Maybe not Django so much on the last two since it’s a web framework, however the ORM itself has powered many ideas completely unrelated to web interfaces.

MySQL
Wildly popular as an Open Source database solution. Allows for SQL query syntax and multiple procedural languages. Very mature.
Django
A very useful tool that lives up to its slogan ‘The Web framework for perfectionists with deadlines’. The object relationship manager used by Django is inspired by and in turn inspires many other projects with a similar goal. It is well written, lazy (that’s a good thing), and makes good use of both RDBMS and NoSQL database backends. To use Django with MongoDB it is recommended to use ‘Django MongoDB Engine’_ and the associated prerequisites to allow the Django ORM to operate cleanly on top of many NoSQL like backends.

Reference Specification

The initial reference specification used for the data is for an arbitrary information storage and retrieval system you would use for collecting forms data or storing searchable information for later queries. The project this was developed around originally started out as Django+MySQL then moved to Django+MongoDB and is currently using Tornado+MongoDB and the techniques described in this document.

The specification (not schema) is defined as follows and referenced in the next topic:

Account Database
Stores information about account
Collection Database

Stores information about a collection of data

Has account references
Typically only associated with one account If associated with +1 collection then a through table is used
Criteria Database

Stores information about fields available to a specific collection

Has a single account reference
Even though this is for a specific collection it may be smart indexing to reference the account as the first key to help support spreading the database and keeping account information isolated to a specific area
Has collection references
Typically only associated with one collection If associated with +1 collection then a through table is used
User Database

Stores global user information (username, password, email, hat color)

Has account references
Associated with multiple accounts
User Account Profile
Since we associate with multiple accounts we need information about timezone

preferences when rendering this users content.

Has a single account reference

Has a single user reference

User Collection Profile
Since we associate with multiple accounts we also would like to provide per

collection preferences for things like time zones, if this collection is bookmarked, and other flags a user can have against a specific collection

Has a single account reference

Has a single collection reference

Has a single user reference

O.K. good. Now we have a quick and dirty starting point.

Table Of Contents

Previous topic

Big Fat Happy Data

Next topic

Reference Schema

This Page