The days of one database system for the entire enterprise, are over. Now even simple mobile applications demand more than one database. The good news is that we have entered a golden age of open-source NoSQL databases. Developers have access to great open-source database technologies with robust communities behind them.
The difficulty is knowing which NoSQL database to choose for a particular use case.
"NoSQL: a broad class of data management systems where the data is partitioned across a set of servers, where no server plays a privileged role." --Emin Gün Sirer
Types and examples of NoSQL databases based on data model, include the following:
- Columnar: Accumulo, Cassandra, Druid, HBase, Vertica
- Document: Clusterpoint, Apache CouchDB, Couchbase, DocumentDB, HyperDex, Lotus Notes, MarkLogic, MongoDB, OrientDB, Qizx
- Key-value: CouchDB, Oracle NoSQL Database, Dynamo, FoundationDB, HyperDex, MemcacheDB, Redis, Riak, FairCom c-treeACE, Aerospike, OrientDB, MUMPS
- Graph: Allegro,Neo4J, InfiniteGraph, OrientDB, Virtuoso, Stardog Multi-model:OrientDB, FoundationDB, ArangoDB, Alchemy Database, CortexDB
Some common characteristics of NoSQL databases:
- Not using the relational model (nor the SQL language)
- Open source
- Designed to run on large clusters
- Based on the needs of 21st century web properties
- No schema, allowing fields to be added to any record without controls
According to the Forum for Innovation, 90% of the world's data was created in the last 2 years.
NoSQL databases are increasingly being used in big data and real-time web applications. NoSQL databases provide a mechanism for storage and retrieval of data that is modeled in different ways than the tabular relations used in relational databases. Such databases have existed since the late 1960s, but did not obtain the "NoSQL" moniker until the surge of popularity in the early twenty-first century, triggered by the needs of modern Web companies like Twitter, Facebook, Google and Amazon.com.
Motivations for this non-relational approach include: the requirement to store and access much larger volumes of data quicker, simpler "horizontal" scaling to clusters of machines, which is a problem for relational databases, the ability to store schema-less data and more granular control over availability.
Although Big Data has been the primary driver for NoSQL’s rise, it is not the only reason to use NoSQL databases. Many NoSQL databases are designed to run well on large clusters, which makes them more attractive for large data volumes. But often people select NoSQL to benefit from better performance, easier interactions with their applications and the ability to use schema-less data.
Big Data has been defined by the four “V”s: Volume, Velocity, Variety, and Value. These become a reasonable test to determine whether you should add NoSQL capabilities to your information architecture.
Volume. While volume indicates more data, it is the granular nature of the data that is unique. Big Data requires processing high volumes of low-density data, that is, data of unknown value, such as Twitter data feeds, clicks on a web page and network traffic.
Value. Data has intrinsic value, but it must be discovered. There are a range of quantitative and investigative techniques to derive value from data, from discovering consumer preferences or sentiment, to making relevant offers by location, or for identifying a piece of equipment that is about to fail.
Velocity. A fast rate that data is received and perhaps acted upon. The highest velocity data normally streams
directly into memory versus being written to disk.
As an example, eCommerce applications seek to combine mobile device location and personal preferences to make time sensitive offers. Operationally, mobile application experiences have large user populations, increased network traffic, and the expectation for immediate response.
Variety. New unstructured and semi-structured data types, such as text, audio, and video require additional processing to derive meaning from the content and the supporting metadata. Frequent or real-time schema changes are an enormous burden for both transactional and analytical environments.
The data structures used by NoSQL databases (e.g. document key-value or graph) differ from those used by default in relational databases, making some operations much faster using a NoSQL database. The particular suitability of a given NoSQL database depends on the the type data to be stored, how it will be accessed and whether ACID (Atomicity, Consistency, Isolation, Durability) or joins are required. The data structures used by NoSQL databases are considered to be "more flexible" than relational database tables.
Many NoSQL stores sacrifice consistency, in the sense of the CAP(Consistency,Availability,Partition tolerance) theorem, in favor of availability, partition tolerance, and speed. Most NoSQL stores lack true ACID transactions, although a few recent systems including ArangoDB, CouchDB and OrientDB have made them central to their designs. Instead they provide the concept of "eventual consistency" in which database changes are propagated to all nodes "eventually" (typically within milliseconds), so queries for data might not return updated data immediately.
- Consistency (all nodes see the same data at the same time)
- Availability (a guarantee that every request receives a response about whether it succeeded or failed)
- Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures)
- Atomicity: all or nothing (of the n actions): commit or rollback
- Consistency: transactions never observe or cause inconsistent data
- Isolation: transactions are not aware of concurrent transactions
- Durability: acknowledged transactions persist in all events
ACID is a set of properties that guarantee that database transactions are processed reliably. In the context of databases, a single logical operation on the data is called a transaction. For example, a transfer of funds from one bank account to another, even involving multiple changes such as crediting one account and debiting another, is a single transaction.
The degree to which the capability is fully supported in a manner similar to most SQL databases or the degree to which it meets the needs of a specific application varies.
Not all NoSQL systems live up to the promise of "eventual consistency" and partition tolerance, in experiments with network partitioning, some NoSQL databases exhibited lost writes and other forms of data loss. Fortunately, some NoSQL systems do provide concepts such as Write-ahead logging to avoid data loss.
Handling Relational Data in a NoSQL Database
Because most NoSQL databases lack the ability for JOIN queries, the data and access methods generally need to be designed differently. The three main techniques for handling relational data in a NoSQL database without JOIN support are as follows:
Caching/replication/non-normalized data Instead of only storing foreign keys, it's common to store actual foreign values along with the model's data. For example, each blog comment might include the username in addition to a user id, thus providing easy access to the username without requiring another lookup. When a username changes however, this will now need to be changed in many places in the database. Thus this approach works better when reads are much more common than writes.
Nesting data With document databases like MongoDB, it's common to put more data in a smaller number of collections. For example in a blogging application, one might choose to store comments within the blog post document so that with a single retrieval one gets all the comments. In this approach a single document contains all the data you need for a specific task.
Multiple queries Instead of retrieving all of the data with one query, it's common to do several queries in order to get all of the desired data. NoSQL queries are often much faster than relational database queries, so the cost of having to do additional queries may be acceptable. If an excessive number of queries would be necessary, one of the other two approaches would be more appropriate.
There are several more aspects to consider when choosing a NoSQL database, such as durability, availability, consistency, scalability, and security. Your particular use case will determine whether ad hoc queryability is important or if mapreduce will suffice.
Often the best approach is to use multiple databases, with specific applications using the particular database that is most appropriate for the type of data, performance and acccess requirements.
Virtual machine hosts and “the cloud” are becoming so cheap and easy these days as to become largely trivial problems driven as much by preference as necessity. In order to scale your organization's applications in this age, choosing the right database or databases for a particular task will more than likely alleviate your true bottleneck.
For more information about choosing and implementing the ideal database systems for your particular use case, get in touch.