Usage¶

The following section describes how to configure and run a ViyaDB instance on a single node.

General¶

Interaction with ViyaDB instance is performed using REST API. Sometimes, it doesn't look like REST (See Data Ingestion or Querying sections below), but it can always be thought as a resource you're sending a request to is an action itself.

Configuring DB Instance¶

This section explains how to configure single instance of ViyaDB. This is not needed when running in a clustered environment.

Store descriptor preresents a configuration file of a single ViyaDB instance. The format is the following:

{
  "query_threads": 1,
  "cpu": [ ... ],
  "workers": 1,
  "http_port": 5000,
  "tables": [ ... ],
  "statsd": {
    "host":  ... ,
    "port": 8125,
    "prefix": "viyadb.%h."
  },
  "supervise": false
}

Parameters:

query_threads - Optional number of threads that will serve queries, defaults to 1.
cpu_list - Optional list of zero-based CPU indices that this process will use, defaults to all available CPUs.
workers - Optional number of workers to start, defaults to the number of available CPUs.
port - All workers will be assigned a different port number, starting from this one (defaults to 5000).
tables - List of table descriptors.
statsd - If specified, some metrics will be reported to the given Statsd host.
supervise - Optionally, run a supervisor on top of worker processes. This setting must be true if number of workers is greater than 1.

Creating Tables¶

Table descriptors can be either a part of store descriptor, or they can be created on-demand using REST API.

{
  "name": "<table name>",
  "dimensions": [ ... ],
  "metrics": [ ... ],
  "watch": {
    "directory": "<path to files>",
    "extensions": [".tsv"]
  }
}

Parameters:

name - Table name
dimensions - List of dimension descriptors
metrics - List of metric descriptors
watch - Optional configuration that enables watching directory for new files, and loading them automatically.

To create a table, issue the following command:

curl --data-binary @table.json http://<viyadb-host>:<viyadb-port>/tables

Explanation in SQL terms

An analogy can be drawn between ViyaDB tables and SQL aggregation queries. For example, consider the following SQL statement:

sql SELECT app_id, SUM(revenue) FROM events GROUP BY app_id

This example translates to ViyaDB table events that has a single dimension app_id and a single metric revenue.

Dimensions¶

There are four types of dimensions:

String
Numeric
Boolean
Time
Microtime

String Dimension¶

String dimension is a basic one, and it's used to describe things like: country, user agent, event name, etc. Description format is as follows:

{
  "name": "<dimension name>",
  "field": "<input field name>",
  "type": "string",
  "length": ... ,
  "cardinality": ... ,
  "cardinality_guard": {
    "dimensions": ["<other dimension>", ... ],
    "limit":  ... 
  }
}

Parameters:

name - Column name
field - Use alternative input field name (defaults to the column name itself)
type - Must be string (or can be omitted, since it's default)
length - Optionally, specify maximum length for a column value (values exceeding this limit will be stripped).
cardinality - Number of distinct values this column holds (optional, but it's recommended to set)

cardinality_guard allows defining a rule of how many distinct values is it possible to store per set of other dimensions. For example, we can decide to store at most 200 distinct event names per event date, per country. All other events will be accounted still, but they will be marked as __exceeded. This is really important option, especially when incoming events are not controlled by yourself, and you don't want your database memory to explode because someone decided to sent some random values.

Cardinality protection

Cardinality protection is built into ViyaDB, which basically means that you can (and should) define the maximum number of distinct elements of any given dimension. This not only allows for filtering out irrelevant values (while still keeping record of their metrics as "Other"), but also makes possible doing optimizations that improve database performance and save memory.

Dimension cardinality can be applied either on a dimension independently or based on a set of other dimensions. For instance, you can disallow more than 100 different event names coming from a single mobile application per single day.

Numeric Dimension¶

This dimension allows to store numbers as a non-metric column. Column description format is the following:

{
  "name": "<dimension name>",
  "field": "<input field name>",
  "type": "<type>"
}

Parameters:

name - Column name
field - Use alternative input field name (defaults to the column name itself)
type - Numeric type (see below)

Supported numeric types are:

byte (-128 to 127)
ubyte (0 to 255)
short (-32768 to 32767)
ushort (0 to 65535)
int (-2147483648 to 2147483647)
uint (0 to 4294967295)
long (-9223372036854775808 to 9223372036854775807)
ulong (0 to 18446744073709551615)
float (floating point 32 bit number)
double (floating point 64 bit number)

Deprecation note

Previously, only uint and ulong types were supported through numeric and max specifications. You should upgrade to the new numeric types if you're still using the old way.

Boolean Dimension¶

Stores either boolean value of 'true' or 'false'.

{
  "name": "<dimension name>",
  "field": "<input field name>",
  "type": "boolean"
}

Parameters:

name - Column name
field - Use alternative input field name (defaults to the column name itself)
type - Must be boolean.

Time and Microtime Dimensions¶

This dimension allows to store UTC time. The difference between the two is that time precision is up to seconds, while microtime precision is up to microseconds.

{
  "name": "<dimension name>",
  "field": "<input field name>",
  "type": "time|microtime",
  "format":  ... ,
  "granularity": "<time unit>",
  "rollup_rules": [ ... ]
}

Parameters:

name - Column name
field - Use alternative input field name (defaults to the column name itself)
type - The type is set according to the required precision
format - See below
granularity - When specified, this time granularity will be used for rolling up events during data ingestion
rollup_rules - Rules for dynamic period based rollup

Parse format is one of the following:

posix - Input time is in POSIX time format (seconds since UNIX epoch)
millis - Input time is in milliseconds since UNIX epoch
micros - Input time is in microseconds since UNIX epoch
Any other string that uses strptime modifiers for describing format

Supported time units:

year
month
week
day
hour
minute
second

Dynamic rollup rules are defined using the following format:

{
  "after": "<num> <time unit>"
  "granularity": "<time unit>"
}

For example, if the rules are:

[{
   "after": "3 month",
   "granularity": "week"
 }, {
   "after": "1 year",
   "granularity": "month"
}]

Then events time column will change granularity dynamically to weekly after 3 months, to monthly after 1 year.

Metrics¶

There are three supported metric types:

Value
Count
Bitset

Value Metric¶

Value metric is a numeric value combined with an aggregation function. List of supported numeric types:

byte (-128 to 127)
ubyte (0 to 255)
short (-32768 to 32767)
ushort (0 to 65535)
int (-2147483648 to 2147483647)
uint (0 to 4294967295)
long (-9223372036854775808 to 9223372036854775807)
ulong (0 to 18446744073709551615)
float (floating point 32 bit number)
double (floating point 64 bit number)

Supported functions:

sum
max
min
avg

The format of defining a value metric is the following:

{
  "name": "<metric name>",
  "field": "<input field name>",
  "type": "<value_type>_<function>"
}

Count Metric¶

This type of metric just counts number of incoming rows. To define it, use the following format:

{
  "name": "<metric name>",
  "field": "<input field name>",
  "type": "count"
}

Bitset Metric¶

This metric allows storing numeric values in a memory optimized set structure, which supports operations like intersect and union. This makes possible to run queries like: "what is a value cardinality for a given set of dimensions filtered by a perdicate?". For instance: counting unique mobile app users, which installed an application on last month split by country.

Metric format:

{
  "name": "<metric name>",
  "field": "<input field name>",
  "type": "bitset"
}

Data Ingestion¶

Loading from TSV files¶

Save load descriptor into load.json file:

{
  "table": "target_table",
  "format": "tsv",
  "type": "file",
  "columns": [...],
  "file": "/path/to/data.tsv"
}

Post this load descriptor to a running ViyaDB instance:

curl --data-binary @load.json http://<viyadb-host>:<viyadb-port>/load

Important notes:

The data.tsv file must be accessible to the ViyaDB instance you're loading the data into.
If columns parameter is not provided, order of columns in .tsv file must be as follows: first dimensions, then metrics as they appear in the table descriptor.
columns parameter must be used whenever one of table column defines field attribute.

Querying¶

Supported query types are:

Aggregate
Search
Select

To query, a query request must be submitted using POST method:

curl --data-binary @query.json http://<viyadb-host>:<viyadb-port>/query

Read further to learn more on different query types.

Query compilation

The first time a query is introduced to ViyaDB, a highly optimized C++ source code is generated, and the compiled version of this code is used for physical data access. That allows to minimize branch mis-predictions, increase the level of CPU cache locality, etc. That means that the first execution of a new query type will take some extra time needed for compiling it, all the subsequent queries of the same type will use already compiled version.

Aggregate Query¶

This kind of query aggregates records selected using filter predicate, and returns them to user (optionally, sorted and/or limited). It's important to know that aggregation is done in a memory, therefore all the result set must fit in.

Query format:

{
  "type": "aggregate",
  "table": "<table name>",
  "select": [ ... ],
  "filter":  ... ,
  "having": ...,
  "sort": [ ... ] ,
  "skip": 0,
  "limit": 0
}

Parameters:

table - Table name
select - List of parameters describing how to select a column (see below)
filter - Filter description (see below)
having - Post-aggregation filter description (similar to SQL HAVING clause). The format is the same as in filter attribute.
sort - Optional result sorting configuration (see below)
skip - Optionally, skip this number of output records
limit - Optionally, limit result set size to this number

Column Selector¶

Column is either dimension or metric. The format of selecting either of them is the following:

{
  "column": "<column name>",
  "format": "<time format>",
  "granularity": "<time granularity>"
}

Parameters:

column - Dimension or metric name

Time column has two additional optional parameters:

format - Time output format (check strptime documentation for available modifiers). By default, UTC epoch timestamp will be sent
granularity - Rollup results by this time unit (see time dimension configuration for supported time units)

Query Filters¶

Filter is one of mandatory parameters in a query, which allows skipping irrelevant records. There are four different filter types.

Value Operator Filter¶

{
  "op": "<filter operator>",
  "column": "<column name>",
  "value": "<filter value>"
}

Parameters:

op - Filter kind specified by operator
column - Dimension or metric name
value - Value that filter operates on

Supported operators are:

eq - Equals
ne - Not equals
lt - Less than
le - Less or equals to
gt - Greater than
ge - Greater or equals to

Negation Filter¶

{
  "op": "not",
  "filter":  ... 
}

Parameters:

op - Must be not
filter - Other filter descriptor

Inclusion Filter¶

{
  "op": "in",
  "column": "<column name>",
  "values": ["value1", ... ]
}

Parameters:

op - Must be in
column - Dimension or metric name
values - List of values to filter on

Composite Filter¶

{
  "op": "<composition operator>",
  "filters": [ ... ]
}

Parameters:

op - One of composition operators: or, and
filters - List of other filters to compose

Filter Example¶

Below is an example of using different filter types:

{
  "type": "aggregate",
  "table": "activity",
  "select": [
    {"column": "install_date"},
    {"column": "ad_network"},
    {"column": "install_country"},
    {"column": "installs_count"},
    {"column": "launches_count"},
    {"column": "inapps_count"}
  ],
  "filter": {"op": "and", "filters": [
    {"op": "eq", "column": "app_id", "value": "com.teslacoilsw.notifier"},
    {"op": "ge", "column": "install_date", "value": "2015-01-01"},
    {"op": "lt", "column": "install_date", "value": "2015-01-30"},
    {"op": "in", "column": "install_country", "values": ["US", "IL", "RU"]}
  ]},
  "having": {"op": "gt", "column": "inapps_count", "value": "10"}
}

Sorting Results¶

You can ask to sort output results by set of columns, and specify sort order on each of them. Column sort configuration goes as follows:

{
   "column": "<column name>",
   "ascending": true|false
}

ascending parameter can be ommitted, in this case the sort order will be descending.

SQL Syntax¶

SQL syntax for querying aggregations looks similar to standard SQL query with only distinction that there's no need to specify GROUP BY clause, because all metrics are pre-aggregated already:

SELECT <dimension1>, ... <metric1>, ...
FROM <table>
[WHERE <condition>]
[HAVING <condition>]
[ORDER BY <column1> [ASC|DESC], ...]
[LIMIT <number>, <number> | LIMIT <number> OFFSET <number>]

For example, the following query finds all event types number for two mobile applications for period of 2017 January, where total number of a single event type is greater than 100:

SELECT
  event_type, count
FROM
  events
WHERE
  app_id IN ('com.skype.raider', 'com.dropbox.android') AND
  install_time BETWEEN '2017-01-01' AND '2017-01-31'
HAVING
  count > 100
ORDER BY
  event_type DESC

Search Query¶

This query type retrieves dimension values by a given set of filters. This feature can come in handy when implementing a field values type assist, when developing an analytics user interface filter.

The basic format is the following:

{
  "type": "search",
  "table": "<table name>",
  "dimension": "<dimension name>",
  "term": "<search term>",
  "limit": 0,
  "filter":  ...
}

SQL Syntax¶

SQL syntax for the Search Query is:

SELECT SEARCH(<dimension>, <term>)
FROM <table>
[WHERE <condition>]
[LIMIT <number>, <number> | LIMIT <number> OFFSET <number>]

For example, the following query looks up for all event name types starting from order_, where revenue of a single event is greater than $1:

SELECT
  SEARCH(event_name, 'order_')
FROM
  events
WHERE
  revenue > 1.0
LIMIT 10

Select Query¶

This query is useful for viewing raw data as it's stored in ViyaDB without the post-aggregation step. Please note that there is a pre-aggregation step during data ingestion, which cannot be reverted.

To run the query use the following syntax, which is similar to the Aggregate Query syntax:

{
  "type": "aggregate",
  "table": "<table name>",
  "select": [ ... ],
  "filter":  ... ,
  "skip": 0,
  "limit": 0
}

SQL Syntax¶

SQL syntax for the Select Query is:

SELECT RAW(<dimension1>, ... <metric1>, ...)
FROM <table>
[WHERE <condition>]
[LIMIT <number>, <number> | LIMIT <number> OFFSET <number>]

For example, the following query returns first 100 pre-aggregated rows stored in ViyaDB table events:

SELECT RAW('*') FROM events LIMIT 100

Metadata Queries¶

There are several endpoints that can be used for retreiving information about database status.

Database Metadata Query¶

The following query returns basic information about the database:

curl http://<viyadb-host>:<viyadb-port>/database/meta

Table Metadata Query¶

To see all available table fields, their types as long as some basic statistics use the following query:

curl http://<viyadb-host>:<viyadb-port>/tables/<table name>/meta