Azure CosmosDB – the new dog in the Cloud. Surely our old friend SQL is good enough?

For years, as developers we have chosen by default some form of SQL or relational database management system  – RDBMS to store application data – probably because that is all there was at the time. But now we are entering the age of the cloud and alternatives are beginning to appear. so how does RDBMS fair?

Reasonably well – If we consider SQL Azure (which is a cloud version of Microsoft SQL Server) – it supports encryption both in transit and at rest. It automates much that a DBA would do, such as – table indexing,  backup and replication to other instances. We still have to distribute (or normalise) our data across related tables – so nothing has changed there. If we are using C#, then we can configure our code to automatically generate the database table structure (schema) using Entity Framework in ‘code first’ mode.

So what’s wrong with RDBMS? The simple fact is that it has serious trouble in ‘scaling out’ or ‘horizontally scaling’ for truly global data storage – meaning we can’t simply add a new 2nd server when needing more writable resources. It can only  ‘scale up’ by configuring the cloud service to provide more resources –  allocating more processing power or memory to the server. The bottom line is that there is only one writable database on one server on one virtual machine and a number of read only copies.

Note: To be fair, there are a few workarounds. One way is having identical writable databases on separate servers and then a choice of processes to keep them in sync. Another way is implementing a technique known as ‘sharding’. This is where there are again multiple databases but the client application filters (or partitions) the incoming data using a particular column or field and hard codes the distribution of records across the databases. Both methods often add complexity or custom logic to the client applications to support them.

So what is CosmosDB, what is different about it, and under what circumstances can we consider it? CosmosDB is Microsoft’s newer cloud native database. It comes in different forms. In this article we will discuss its document database form – in the Azure Portal this is known as the SQL API. Note: It also has other supporting APIs: MongoDB, Cassandra, Table & Graph.

At the heart of the CosmosDB SQL API, are JavaScript Object Notation (JSON) text based documents. These documents store our data in a structure that we define (no fixed schema) and so in theory every document could be different. Each document has a maximum size of 2MB. Once setup, our Cosmos DB instance has a hierarchical structure in the form: <my-account>.documents.azure.com/<my-database>/<my-collection>/<my-first-document.json>

If we now have a data record to store – instead of distributing across many relational tables, the complete record can be stored in a single document. When we query for records, it simply returns the contents of the JSON documents. Compare this to RDBMS which first has to work out how to join up the data from the separate tables before it can consider returning anything. Hence reading data from CosmosDB can have extremely low latency, meaning we can get the results at lightning speeds. By the way – we don’t have to create indexes, as this is done for us!

Another advantage is that these documents can be copied or replicated to many CosmosDB instances. Each copy can be further edited and replicated back. Hence no particular document is the master. All databases can be writable and in theory we can have as many database instances as we like. Hence we can ‘scale up’ and ‘scale out’.
Additionally, we can inform CosmosDB, at the point of creation, how to partition the data. This means that we can specify a field in our document as a partition key and it can place documents with the same partition on the same server and other partition values onto other servers. Also CosmosDB has five levels of consistency compared to the one in most RDBMS – but we wont go into this today!

So we can execute SQL like queries, or use code, to save documents, return one or more documents, or even return portions (projections) of documents – and through the SDK easily convert to and from JSON to in-memory objects. But what if we require information that is scattered across many different documents? In this scenario the client application may have to download each document separately and picking out what is needed (joins between documents are not yet supported). So why not avoid this by duplicating this data into another single document instead?

We have become so fixated in normalising into separate tables that the idea of duplicating can make us feel very uncomfortable. And we have done this without a care as to how applications might read or write data. Of course we would make certain that we created sufficient table indexes to avoid full table scans!

In CosmosDB we should rethink – planning our JSON documents so that they are optimised for both read and write tasks we want our application to perform.

Let’s look at an example of how we can typically store data relating to a persons contact details in a 4 table relational database.
We can obtain a person record by executing a single query of the database but incurring an expensive join between the 4 tables.

 

SELECT p.FirstName, p.LastName, a.City, cd.Detail
FROM Person p
JOIN ContactDetail cd ON cd.PersonId = p.Id
JOIN ContactDetailType on cdt ON cdt.Id = cd.TypeId
JOIN Address a ON a.PersonId = p.Id

If we then wanted to update a field in each table then we would have to submit an individual query update for each table. Hence 5 server round trips

So how does it compare with storing a single person record in CosmosDB.
Here we embed the complete record in one document which we can obtain from a single query. We can then update anything in the record and save it back in a single operation. Job done!

  {
   "id": "1",
    "firstName": "Thomas",
    "lastName": "Andersen",
    "addresses": [
        {            
            "line1": "100 Some Street",
            "line2": "Unit 1",
            "city": "Seattle",
            "state": "WA",
            "zip": 98012
        }
    ],
    "contactDetails": [
        {"email: "thomas@andersen.com"},
        {"phone": "+1 555 555-5555", "extension": 5555}
    ] 
}

The big assumption is that the person has a limited or bounded number of addresses or contact details. If however, we needed to additionally store the date and time for each login, then the corresponding embedded list  and would grow over time without bounds, possibly growing into Megabytes in size. Every time we wanted to edit a single field we would have to download this large document, find the field, then post it back. Hence for performance and code complexity reasons it would be wise move login details out of this document and into one or more other documents that can be referenced.

Let’s look at an example of how we could store a many to many unbounded relationship in SQL and also in CosmosDB.

Author documents:
{"id": "a1", "name": "Thomas Andersen", "books": ["b1, "b2", "b3"]}
{"id": "a2", "name": "William Wakefield", "books": ["b1", "b4"]}

Book documents: 
{"id": "b1", "name": "Azure Cosmos DB 101", "authors": ["a1", "a2"]}
{"id": "b2", "name": "Azure Cosmos DB for RDBMS Users", "authors": ["a1"]}
{"id": "b3", "name": "Learn about Azure Cosmos DB", "authors": ["a1"]}
{"id": "b4", "name": "Deep Dive in to Azure Cosmos DB", "authors": ["a2"]}

Here we have six individual documents that have embedded items that reference each other by using an id field.  This can work well for a small number of items but might not be optimal if the unbounded number of books or authors increased indefinitely.

Note: references between documents are not checked or enforced. Compare that to enforced Primary Key – Foreign Key relationships in relational databases that maintain database integrity.

In summary, if we only have bounded relationships, then for many reasons CosmosDB can seriously challenge RDBMS. For unbounded relationships, this can take a bit more thought but can still work very well.

For further reading please click here for the excellent article on which this blog is based.

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *