.. Big Fat Happy Data documentation intro file. Who what when where and why. .. include:: refs.rst Introduction ============ This documentation project is the result of work by `Shane R. Spencer`_ and is licensed under the `Creative Commons Attribution-ShareAlike 3.0 Unported License`_. Many of the techniques used in this documentation are not solely associated with `Shane R. Spencer`_ and for the most part are the intellectual property of the nameless horde known as humanity. If you feel as though you might like some attribution please make yourself known. Licensing --------- As previously mentioned.. this work is licensed under the `Creative Commons Attribution-ShareAlike 3.0 Unported License`_. This license lets others remix, tweak, and build upon this project even for commercial purposes, as long as they credit the author and license their new creations under the identical terms. This license is often compared to “copyleft” free and open source software licenses. All new works based on yours will carry the same license, so any derivatives will also allow commercial use. This is the license used by Wikipedia, and is recommended for materials that would benefit from incorporating content from Wikipedia and similarly licensed projects. Background ---------- Key-Value stores (KV) have existed for a good long time and have subsequently served humanity and its more technically inclined people very well. The history and of KV based storage methods is well beyond the scope of this document. Briefly however, the idea of quick and fast access to stored values relating to a named key is fundamental to how modern databases work. More recently we've all seen an increase in fun new storage techniques associated with the NoSQL genre which, simply put, breaks away from the relational database structure typically associated with database management systems that directly support the SQL standards for describing a query and processing results. Relational database solutions typically do not concentrate on large amounts of data per "row", hence the name of this document. Hopefully some benchmarks, storage comparisons, and network analysis will help developers understand the crazy method I've been calling *"Big Fat Happy Data"*! The initial use of these methods is being utilized in the project `Informadiko`_ which is currently based around `TornadoWeb`_, `MongoDB`_ and `Xapian`_. `Mongrel2`_ and `Brubeck`_ are also being evaluated for the project. Reference Technology -------------------- This document focuses on using `MongoDB`_ and `Redis`_ as KV stores. MongoDB has a very rich query parser that helps the developer keep things simple while still using a KV based solution and Redis does a great job at keeping data at the ready, handling atomic operations, and allowing blazing fast access to key data. This is of course very general since both projects have a huge list of pros and cons that the Internet is more than happy to point out. However the developers of both solutions are techodweebs with a good nose for what to avoid when developing reliable database solutions. .. glossary:: `MongoDB`_ Created by `10gen`_ as a feature rich KV store that can be seen as a document store. Documents are stored in collections that can be mapped to database servers in a myriad of ways. The most common and simple setup involves a single database server hosting one or more collections simultaneously. `Redis`_ Created by `Salvatore Sanfilippo`_ and is currently sponsored by `VMWare`_. Redis is an known as an in-memory KV store however data is backed onto the disk for extra persistence. Other technologies used in this document to help comparatively describe the data layout as well as offer some fun benchmarks include the popular relational database management system (RDBMS) `MySQL`_ as well as the object relationship manager (ORM) used by `Django`_. These two products may appear to get the shaft a bit.. however the should be considered very valuable projects. Both have inspired projects as well as been part of the foundation for a very high percentage of websites, custom applications, and large enterprise scale solutions. Maybe not Django so much on the last two since it's a web framework, however the ORM itself has powered many ideas completely unrelated to web interfaces. .. glossary:: `MySQL`_ Wildly popular as an Open Source database solution. Allows for SQL query syntax and multiple procedural languages. Very mature. `Django`_ A very useful tool that lives up to its slogan 'The Web framework for perfectionists with deadlines'. The object relationship manager used by Django is inspired by and in turn inspires many other projects with a similar goal. It is well written, lazy (that's a good thing), and makes good use of both RDBMS and NoSQL database backends. To use Django with MongoDB it is recommended to use 'Django MongoDB Engine'_ and the associated prerequisites to allow the Django ORM to operate cleanly on top of many NoSQL like backends. Reference Specification ----------------------- The initial reference specification used for the data is for an arbitrary information storage and retrieval system you would use for collecting forms data or storing searchable information for later queries. The project this was developed around originally started out as Django+MySQL then moved to Django+MongoDB and is currently using Tornado+MongoDB and the techniques described in this document. The specification (not schema) is defined as follows and referenced in the next topic: **Account Database** Stores information about account **Collection Database** Stores information about a collection of data Has **account** references Typically only associated with one account If associated with +1 collection then a through table is used **Criteria Database** Stores information about fields available to a specific collection Has a single **account** reference Even though this is for a specific collection it may be smart indexing to reference the account as the first key to help support spreading the database and keeping account information isolated to a specific area Has **collection** references Typically only associated with one collection If associated with +1 collection then a through table is used **User Database** Stores global user information (username, password, email, hat color) Has **account** references Associated with multiple accounts **User Account Profile** Since we associate with multiple accounts we need information about timezone preferences when rendering this users content. Has a single **account** reference Has a single **user** reference **User Collection Profile** Since we associate with multiple accounts we also would like to provide per collection preferences for things like time zones, if this collection is bookmarked, and other flags a user can have against a specific collection Has a single **account** reference Has a single **collection** reference Has a single **user** reference O.K. good. Now we have a quick and dirty starting point.