Clustering

Even though it's possible to run ViyaDB on a single node, it's usually required to run several database instances in order to be able to:

  • Scale in case aggregated dataset doesn't fit into a single machine's RAM.
  • Scale out concurrent queries by spreading them between different workers.
  • Failover on data center failure.

This page describes how ViyaDB works when running in a cluster mode.

Architecture

ViyaDB implements shared nothing architecture. That means that the data is partitioned and spread between independent workers.

Clustering

Components

Consul

Consul is used as a coordination service as well as for storing cluster configuration.

There were three different options for coordination service at the beginning: Zookeeper, Consul and etcd, but the choice was made in favor of Consul for its:

  • Installation and configuration simplicity.
  • Great Web interface.
  • Out of the box service registration and lookup support.

Worker

Worker is a ViyaDB process that keeps a partition of data in RAM. Every such process is attached to it's own CPU core (using CPU affinity) for achieving better cache locality.

Every worker listens on its own HTTP port, where it provides a service for loading and querying the data it keeps.

Controller

Controller is the main process that launches ViyaDB workers on a single node, and takes care of keeping them alive. In addition, this process is responsible for coordination between database nodes using Consul. One of cluster controllers becomes a leader.

Controller can act as query aggregator, which means that it can be queried directly, and a query will be sent to workers containing required data partitions, then the result will be combined and returned back to a user.

Communication with workers is performed using HTTP API.

Running

Let's run a cluster of ViyaDB nodes, and load 1 billion of user events into it.

Consul configuration

First, you must have Consul installed somewhere.

Create the following two keys in Consul, which represent ViyaDB cluster configuration:

viyadb/clusters/cluster001/config

{
  "replication_factor": 1,
  "workers": 32,
  "tables": ["events"]
}

viyadb/tables/events/config

{
  "name": "events",
  "dimensions": [
    {"name": "app_id"},
    {"name": "user_id", "type": "uint"},
    {
      "name": "event_time", "type": "time",
      "format": "millis", "granularity": "day"
    },
    {"name": "country"},
    {"name": "city"},
    {"name": "device_type"},
    {"name": "device_vendor"},
    {"name": "ad_network"},
    {"name": "campaign"},
    {"name": "site_id"},
    {"name": "event_type"},
    {"name": "event_name"},
    {"name": "organic", "cardinality": 2},
    {"name": "days_from_install", "type": "ushort"}
  ],
  "metrics": [
    {"name": "revenue" , "type": "double_sum"},
    {"name": "users", "type": "bitset", "field": "user_id", "max": 4294967295},
    {"name": "count" , "type": "count"}
  ],
  "partitioning": {
    "columns": [
      "app_id"
    ]
  }
}

ViyaDB nodes

Then, let's launch four AWS EC2 instances of type r4.2xlarge. This instance type has 8 vCPU, which will allow us to have 8 workers per host (32 workers in total). This will translate into having 32 partitions if using a single replica per partition.

ViyaDB is built against g++-7, that's why it's easier to use one of Ubuntu Artful (17.10) AMI provided by Canonical. Make sure, all TCP ports of ViyaDB nodes are open to each other. Also, make sure there’s enough disk space for downloading and extracting several tens of gigabytes of JSON files representing user events.

Once all four machines are running, proceed with the following steps on every one of the nodes.

Install g++-7 by running the following commands:

apt-get update
apt-get install g++-7
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 60 --slave /usr/bin/g++ g++ /usr/bin/g++-7

Create file /etc/viyadb/store.json with the following contents (replace consul-host with real Consul hostname):

{
  "supervise": true,
  "workers": 8,
  "cluster_id": "cluster001",
  "consul_url": "http://consul-host:8500",
  "state_dir": "/var/lib/viyadb"
}

Download ViyaDB package, and extract it:

curl -sS -L https://github.com/viyadb/viyadb/releases/download/v1.0.0-beta/viyadb-1.0.0-beta-linux-x86_64.tgz | tar -zxf -

Run ViyaDB instance:

cd /opt/viyadb-1.0.0-beta-linux-x86_64
./bin/viyadb /etc/viyadb/store.json

Loading data

There's utility under bin/ directory of ViyaDB package called vsql, which can be used for sending SQL queries to ViyaDB cluster.

Open the SQL shell utility, and connect it to one of ViyaDB hosts:

/opt/viyadb-1.0.0-beta-linux-x86_64/bin/vsql viyadb-host 5555

We have 1 billion of user events stored in gzipped TSV format in AWS S3. To load them, issue the following query:

ViyaDB> COPY events (app_id, user_id, event_time, country, city, device_type, device_vendor, ad_network, campaign, site_id, event_type, event_name, organic, days_from_install, revenue, count) FROM 's3://viyadb-test/events/input/'

Querying

Disclaimer: all queries were recorded when running for the second time to exclude query compilation time from the total query running time.

Count different event types for a given list of applications:

ViyaDB> SELECT event_type,count FROM events WHERE app_id IN ('com.skype.raider', 'com.adobe.psmobile', 'com.dropbox.android', 'com.ego360.flatstomach') ORDER BY event_type DESC

event_type  count
----------  -----
session     117927202
install     263444
inappevent  200466796
impression  58431
click       297

Time taken: 0.530 secs

Find 5 top cities in USA by install count for a specific application:

ViyaDB> SELECT city,count FROM events WHERE app_id='com.dropbox.android' AND country='US' AND event_type='install' ORDER BY count DESC LIMIT 5;

city           count
----           -----
Clinton        28
Farmington     20
Madison        18
Oxford         18
Highland Park  18

Time taken: 0.171 secs

Find top 10 ad networks for a specific application:

ViyaDB> SELECT ad_network, count FROM events WHERE app_id='kik.android' AND event_type='install' AND ad_network <> '' ORDER BY count DESC LIMIT 10;

ad_network    count
----------    -----
Twitter       1257
Facebook      1089
Google        904
jiva_network  96
yieldlove     79
i-movad       66
barons_media  57
cpmob         50
branovate     35
somimedia     34

Time taken: 0.089 secs

Find user retention (number of distinct user sessions) for the first week, per three major ad networks:

ViyaDB> SELECT ad_network, days_from_install, users FROM events WHERE app_id='kik.android' AND event_time BETWEEN '2015-01-01' AND '2015-01-31' AND event_type='session' AND ad_network IN('Facebook', 'Google', 'Twitter') AND days_from_install < 8 ORDER BY days_from_install, users DESC

ad_network  days_from_install  users
----------  -----------------  -----
Twitter     0                  33
Google      0                  20
Facebook    0                  14
Twitter     1                  31
Google      1                  18
Facebook    1                  13
Twitter     2                  30
Google      2                  17
Facebook    2                  12
Twitter     3                  29
Google      3                  14
Facebook    3                  11
Twitter     4                  27
Google      4                  13
Facebook    4                  10
Twitter     5                  26
Google      5                  13
Facebook    5                  10
Twitter     6                  23
Google      6                  12
Facebook    6                  9
Twitter     7                  22
Google      7                  12
Facebook    7                  9

Time taken: 0.033 secs

Real-Time Data Processing

To see the complete real-time data processing architecture based on ViyaDB cluster, please visit next section.