Star Schemas, OLAP queries and NoSQL Databases

Traditional data warehouse design largely revolves around the venerable star schema wherein you have a fact table containing measures and references to surrounding dimension tables. For example, an OrderFact table containing references to dimensions such as Customer and Product. This design works well for performing fast OLAP type queries. The speed of the queries is predicated on the use of SQL JOIN syntax.

When using a NoSQL data store such as MongoDB or DynamoDB you make gains in the ability to model less structured data with the trade-off of doing without a fast join syntax. How does this play out when you are building a micro-service that is exposing OLAP type queries?

Lets take the example of a problem domain where you have requests for work, people make the requests and things (equipment, services, furniture) that are involved in performing the requests. The traditional data warehouse approach would be to create (at least) three tables:

PeopleDimension
  id
  name
  location

ThingDimension
  id
  name
  location

RequestFact
  person_id
  thing_id
  timestamp
  priority

Replicating this model in a NoSQL database allows you to have a compact representation of the data, but what if you want to display a dashboard of people and things with the number of requests associated with each?

Person (Requests)    
  Joe (2)                
  Bill (0)            
  Karen (1)                    

Thing (Requests)
  Thing1 (0)
  Thing2 (1)
  Thing3 (1)
  Thing4 (1)

There are a number of approaches that you could take:

1) Create a star schema and use document references in the "Fact" table.

This approach minimizes the storage requirement for the data. However the trade-off is that you end up performing the joins and aggregation in the application layer.

for each person in people {
  ... get a count of requests for each person
}

for each thing in things {
  .... get a count of request for each thing
}

2) De-normalize the data model

This approach minimizes processing overhead somewhat at the expense of storage requirements and/or array mutability.

{
  request_id : 1,
  timestamp : ISODate("2015-02-02"),
  person: {
    name: "Karen",
    location: "HQ"
  },
  thing: {
    name: "Thing3",
    location: "HQ"
  }
}

However in this case you still have pretty high processing overhead for getting the counts of request per person and per thing. You might be better off with a structure like this:

{
  person_id: 1,
  name: "Karen",
  location: "HQ",
  requests: [
    {
      request_id: 1,
      timestamp: ISODate("2015-02-02")
    }
  ]
}

{ 
  thing_id: 1,
  name: "Thing3",
  location: HQ,
  requests: [
    request_id: 1,
    timestamp: ISODate("2015-02-02")
  ]
}

In this case, getting the count of requests is relatively straight-forward, but the mutability of the thing and people collections is quite high since you will be adding and removing members of the requests sub-documents in both collections every time a request is created or completed.

3) A Hybrid Approach

A hybrid approach would be to store data in a star schema and then materializing a report collection such as:

[
  {
    report_id: 1,
    report_date: ISODate("2015-02-02"),
    report_decription: "Thing_Request_Count",
    data: [
        {thing: "Thing1", request_count: 0},
        {thing: "Thing2", request_count: 1},
        {thing: "Thing3", request_count: 1},
        {thing: "Thing4", request_count: 1}
    ]
  },
  {
    report_id: 2,
    report_date: ISODate("2015-02-02"),
    report_decription: "Person_Request_Count",
    data: [
        {person: "Joe", request_count: 2},
        {person: "Bill", request_count: 0},
        {person: "Karen", request_count: 1},
    ]
  }
]

This approach is appropriate for when an up-to-date count of requests per thing and person does not need to be real-time since this document collection can be materialized on a scheduled basis (e.g. once per hour or once per day). If this type of query needs to be accessible in real-time or near real-time then you may be better off using a relational datastore for OLAP queries or pursuing a selective de-normalization and/or application level aggregation strategy.