Usage

The following section describes how to configure and run a ViyaDB instance on a single node.

General

Interaction with ViyaDB instance is performed using REST API. Sometimes, it doesn't look like REST (See Data Ingestion or Querying sections below), but it can always be thought as a resource you're sending a request to is an action itself.

Configuring DB Instance

This section explains how to configure single instance of ViyaDB. This is not needed when running in a clustered environment.

Store descriptor preresents a configuration file of a single ViyaDB instance. The format is the following:

{
  "query_threads": 1,
  "cpu": [ ... ],
  "workers": 1,
  "http_port": 5000,
  "tables": [ ... ],
  "statsd": {
    "host":  ... ,
    "port": 8125,
    "prefix": "viyadb.%h."
  },
  "supervise": false
}

Parameters:

  • query_threads - Optional number of threads that will serve queries, defaults to 1.
  • cpu_list - Optional list of zero-based CPU indices that this process will use, defaults to all available CPUs.
  • workers - Optional number of workers to start, defaults to the number of available CPUs.
  • port - All workers will be assigned a different port number, starting from this one (defaults to 5000).
  • tables - List of table descriptors.
  • statsd - If specified, some metrics will be reported to the given Statsd host.
  • supervise - Optionally, run a supervisor on top of worker processes. This setting must be true if number of workers is greater than 1.

Creating Tables

Table descriptors can be either a part of store descriptor, or they can be created on-demand using REST API.

{
  "name": "<table name>",
  "dimensions": [ ... ],
  "metrics": [ ... ],
  "watch": {
    "directory": "<path to files>",
    "extensions": [".tsv"]
  }
}

Parameters:

  • name - Table name
  • dimensions - List of dimension descriptors
  • metrics - List of metric descriptors
  • watch - Optional configuration that enables watching directory for new files, and loading them automatically.

To create a table, issue the following command:

curl --data-binary @table.json http://<viyadb-host>:<viyadb-port>/tables

Explanation in SQL terms

An analogy can be drawn between ViyaDB tables and SQL aggregation queries. For example, consider the following SQL statement:

sql SELECT app_id, SUM(revenue) FROM events GROUP BY app_id

This example translates to ViyaDB table events that has a single dimension app_id and a single metric revenue.

Dimensions

There are four types of dimensions:

  • String
  • Numeric
  • Boolean
  • Time
  • Microtime

String Dimension

String dimension is a basic one, and it's used to describe things like: country, user agent, event name, etc. Description format is as follows:

{
  "name": "<dimension name>",
  "field": "<input field name>",
  "type": "string",
  "length": ... ,
  "cardinality": ... ,
  "cardinality_guard": {
    "dimensions": ["<other dimension>", ... ],
    "limit":  ... 
  }
}

Parameters:

  • name - Column name
  • field - Use alternative input field name (defaults to the column name itself)
  • type - Must be string (or can be omitted, since it's default)
  • length - Optionally, specify maximum length for a column value (values exceeding this limit will be stripped).
  • cardinality - Number of distinct values this column holds (optional, but it's recommended to set)

cardinality_guard allows defining a rule of how many distinct values is it possible to store per set of other dimensions. For example, we can decide to store at most 200 distinct event names per event date, per country. All other events will be accounted still, but they will be marked as __exceeded. This is really important option, especially when incoming events are not controlled by yourself, and you don't want your database memory to explode because someone decided to sent some random values.

Cardinality protection

Cardinality protection is built into ViyaDB, which basically means that you can (and should) define the maximum number of distinct elements of any given dimension. This not only allows for filtering out irrelevant values (while still keeping record of their metrics as "Other"), but also makes possible doing optimizations that improve database performance and save memory.

Dimension cardinality can be applied either on a dimension independently or based on a set of other dimensions. For instance, you can disallow more than 100 different event names coming from a single mobile application per single day.

Numeric Dimension

This dimension allows to store numbers as a non-metric column. Column description format is the following:

{
  "name": "<dimension name>",
  "field": "<input field name>",
  "type": "<type>"
}

Parameters:

  • name - Column name
  • field - Use alternative input field name (defaults to the column name itself)
  • type - Numeric type (see below)

Supported numeric types are:

  • byte (-128 to 127)
  • ubyte (0 to 255)
  • short (-32768 to 32767)
  • ushort (0 to 65535)
  • int (-2147483648 to 2147483647)
  • uint (0 to 4294967295)
  • long (-9223372036854775808 to 9223372036854775807)
  • ulong (0 to 18446744073709551615)
  • float (floating point 32 bit number)
  • double (floating point 64 bit number)

Deprecation note

Previously, only uint and ulong types were supported through numeric and max specifications. You should upgrade to the new numeric types if you're still using the old way.

Boolean Dimension

Stores either boolean value of 'true' or 'false'.

{
  "name": "<dimension name>",
  "field": "<input field name>",
  "type": "boolean"
}

Parameters:

  • name - Column name
  • field - Use alternative input field name (defaults to the column name itself)
  • type - Must be boolean.

Time and Microtime Dimensions

This dimension allows to store UTC time. The difference between the two is that time precision is up to seconds, while microtime precision is up to microseconds.

{
  "name": "<dimension name>",
  "field": "<input field name>",
  "type": "time|microtime",
  "format":  ... ,
  "granularity": "<time unit>",
  "rollup_rules": [ ... ]
}

Parameters:

  • name - Column name
  • field - Use alternative input field name (defaults to the column name itself)
  • type - The type is set according to the required precision
  • format - See below
  • granularity - When specified, this time granularity will be used for rolling up events during data ingestion
  • rollup_rules - Rules for dynamic period based rollup

Parse format is one of the following:

  • posix - Input time is in POSIX time format (seconds since UNIX epoch)
  • millis - Input time is in milliseconds since UNIX epoch
  • micros - Input time is in microseconds since UNIX epoch
  • Any other string that uses strptime modifiers for describing format

Supported time units:

  • year
  • month
  • week
  • day
  • hour
  • minute
  • second

Dynamic rollup rules are defined using the following format:

{
  "after": "<num> <time unit>"
  "granularity": "<time unit>"
}

For example, if the rules are:

[{
   "after": "3 month",
   "granularity": "week"
 }, {
   "after": "1 year",
   "granularity": "month"
}]

Then events time column will change granularity dynamically to weekly after 3 months, to monthly after 1 year.

Metrics

There are three supported metric types:

  • Value
  • Count
  • Bitset

Value Metric

Value metric is a numeric value combined with an aggregation function. List of supported numeric types:

  • byte (-128 to 127)
  • ubyte (0 to 255)
  • short (-32768 to 32767)
  • ushort (0 to 65535)
  • int (-2147483648 to 2147483647)
  • uint (0 to 4294967295)
  • long (-9223372036854775808 to 9223372036854775807)
  • ulong (0 to 18446744073709551615)
  • float (floating point 32 bit number)
  • double (floating point 64 bit number)

Supported functions:

  • sum
  • max
  • min
  • avg

The format of defining a value metric is the following:

{
  "name": "<metric name>",
  "field": "<input field name>",
  "type": "<value_type>_<function>"
}

Count Metric

This type of metric just counts number of incoming rows. To define it, use the following format:

{
  "name": "<metric name>",
  "field": "<input field name>",
  "type": "count"
}

Bitset Metric

This metric allows storing numeric values in a memory optimized set structure, which supports operations like intersect and union. This makes possible to run queries like: "what is a value cardinality for a given set of dimensions filtered by a perdicate?". For instance: counting unique mobile app users, which installed an application on last month split by country.

Metric format:

{
  "name": "<metric name>",
  "field": "<input field name>",
  "type": "bitset"
}

Data Ingestion

Loading from TSV files

Save load descriptor into load.json file:

{
  "table": "target_table",
  "format": "tsv",
  "type": "file",
  "columns": [...],
  "file": "/path/to/data.tsv"
}

Post this load descriptor to a running ViyaDB instance:

curl --data-binary @load.json http://<viyadb-host>:<viyadb-port>/load

Important notes:

  • The data.tsv file must be accessible to the ViyaDB instance you're loading the data into.
  • If columns parameter is not provided, order of columns in .tsv file must be as follows: first dimensions, then metrics as they appear in the table descriptor.
  • columns parameter must be used whenever one of table column defines field attribute.

Querying

Supported query types are:

  • Aggregate
  • Search
  • Select

To query, a query request must be submitted using POST method:

curl --data-binary @query.json http://<viyadb-host>:<viyadb-port>/query

Read further to learn more on different query types.

Query compilation

The first time a query is introduced to ViyaDB, a highly optimized C++ source code is generated, and the compiled version of this code is used for physical data access. That allows to minimize branch mis-predictions, increase the level of CPU cache locality, etc. That means that the first execution of a new query type will take some extra time needed for compiling it, all the subsequent queries of the same type will use already compiled version.

Aggregate Query

This kind of query aggregates records selected using filter predicate, and returns them to user (optionally, sorted and/or limited). It's important to know that aggregation is done in a memory, therefore all the result set must fit in.

Query format:

{
  "type": "aggregate",
  "table": "<table name>",
  "select": [ ... ],
  "filter":  ... ,
  "having": ...,
  "sort": [ ... ] ,
  "skip": 0,
  "limit": 0
}

Parameters:

  • table - Table name
  • select - List of parameters describing how to select a column (see below)
  • filter - Filter description (see below)
  • having - Post-aggregation filter description (similar to SQL HAVING clause). The format is the same as in filter attribute.
  • sort - Optional result sorting configuration (see below)
  • skip - Optionally, skip this number of output records
  • limit - Optionally, limit result set size to this number

Column Selector

Column is either dimension or metric. The format of selecting either of them is the following:

{
  "column": "<column name>",
  "format": "<time format>",
  "granularity": "<time granularity>"
}

Parameters:

  • column - Dimension or metric name

Time column has two additional optional parameters:

  • format - Time output format (check strptime documentation for available modifiers). By default, UTC epoch timestamp will be sent
  • granularity - Rollup results by this time unit (see time dimension configuration for supported time units)

Query Filters

Filter is one of mandatory parameters in a query, which allows skipping irrelevant records. There are four different filter types.

Value Operator Filter
{
  "op": "<filter operator>",
  "column": "<column name>",
  "value": "<filter value>"
}

Parameters:

  • op - Filter kind specified by operator
  • column - Dimension or metric name
  • value - Value that filter operates on

Supported operators are:

  • eq - Equals
  • ne - Not equals
  • lt - Less than
  • le - Less or equals to
  • gt - Greater than
  • ge - Greater or equals to
Negation Filter
{
  "op": "not",
  "filter":  ... 
}

Parameters:

  • op - Must be not
  • filter - Other filter descriptor
Inclusion Filter
{
  "op": "in",
  "column": "<column name>",
  "values": ["value1", ... ]
}

Parameters:

  • op - Must be in
  • column - Dimension or metric name
  • values - List of values to filter on
Composite Filter
{
  "op": "<composition operator>",
  "filters": [ ... ]
}

Parameters:

  • op - One of composition operators: or, and
  • filters - List of other filters to compose
Filter Example

Below is an example of using different filter types:

{
  "type": "aggregate",
  "table": "activity",
  "select": [
    {"column": "install_date"},
    {"column": "ad_network"},
    {"column": "install_country"},
    {"column": "installs_count"},
    {"column": "launches_count"},
    {"column": "inapps_count"}
  ],
  "filter": {"op": "and", "filters": [
    {"op": "eq", "column": "app_id", "value": "com.teslacoilsw.notifier"},
    {"op": "ge", "column": "install_date", "value": "2015-01-01"},
    {"op": "lt", "column": "install_date", "value": "2015-01-30"},
    {"op": "in", "column": "install_country", "values": ["US", "IL", "RU"]}
  ]},
  "having": {"op": "gt", "column": "inapps_count", "value": "10"}
}

Sorting Results

You can ask to sort output results by set of columns, and specify sort order on each of them. Column sort configuration goes as follows:

{
   "column": "<column name>",
   "ascending": true|false
}

ascending parameter can be ommitted, in this case the sort order will be descending.

SQL Syntax

SQL syntax for querying aggregations looks similar to standard SQL query with only distinction that there's no need to specify GROUP BY clause, because all metrics are pre-aggregated already:

SELECT <dimension1>, ... <metric1>, ...
FROM <table>
[WHERE <condition>]
[HAVING <condition>]
[ORDER BY <column1> [ASC|DESC], ...]
[LIMIT <number>, <number> | LIMIT <number> OFFSET <number>]

For example, the following query finds all event types number for two mobile applications for period of 2017 January, where total number of a single event type is greater than 100:

SELECT
  event_type, count
FROM
  events
WHERE
  app_id IN ('com.skype.raider', 'com.dropbox.android') AND
  install_time BETWEEN '2017-01-01' AND '2017-01-31'
HAVING
  count > 100
ORDER BY
  event_type DESC

Search Query

This query type retrieves dimension values by a given set of filters. This feature can come in handy when implementing a field values type assist, when developing an analytics user interface filter.

The basic format is the following:

{
  "type": "search",
  "table": "<table name>",
  "dimension": "<dimension name>",
  "term": "<search term>",
  "limit": 0,
  "filter":  ...
}

SQL Syntax

SQL syntax for the Search Query is:

SELECT SEARCH(<dimension>, <term>)
FROM <table>
[WHERE <condition>]
[LIMIT <number>, <number> | LIMIT <number> OFFSET <number>]

For example, the following query looks up for all event name types starting from order_, where revenue of a single event is greater than $1:

SELECT
  SEARCH(event_name, 'order_')
FROM
  events
WHERE
  revenue > 1.0
LIMIT 10

Select Query

This query is useful for viewing raw data as it's stored in ViyaDB without the post-aggregation step. Please note that there is a pre-aggregation step during data ingestion, which cannot be reverted.

To run the query use the following syntax, which is similar to the Aggregate Query syntax:

{
  "type": "aggregate",
  "table": "<table name>",
  "select": [ ... ],
  "filter":  ... ,
  "skip": 0,
  "limit": 0
}

SQL Syntax

SQL syntax for the Select Query is:

SELECT RAW(<dimension1>, ... <metric1>, ...)
FROM <table>
[WHERE <condition>]
[LIMIT <number>, <number> | LIMIT <number> OFFSET <number>]

For example, the following query returns first 100 pre-aggregated rows stored in ViyaDB table events:

SELECT RAW('*') FROM events LIMIT 100

Metadata Queries

There are several endpoints that can be used for retreiving information about database status.

Database Metadata Query

The following query returns basic information about the database:

curl http://<viyadb-host>:<viyadb-port>/database/meta

Table Metadata Query

To see all available table fields, their types as long as some basic statistics use the following query:

curl http://<viyadb-host>:<viyadb-port>/tables/<table name>/meta