Usage¶
The following section describes how to configure and run a ViyaDB instance on a single node.
General¶
Interaction with ViyaDB instance is performed using REST API. Sometimes, it doesn't look like REST (See Data Ingestion or Querying sections below), but it can always be thought as a resource you're sending a request to is an action itself.
Configuring DB Instance¶
This section explains how to configure single instance of ViyaDB. This is not needed when running in a clustered environment.
Store descriptor preresents a configuration file of a single ViyaDB instance. The format is the following:
{ "query_threads": 1, "cpu": [ ... ], "workers": 1, "http_port": 5000, "tables": [ ... ], "statsd": { "host": ... , "port": 8125, "prefix": "viyadb.%h." }, "supervise": false }
Parameters:
- query_threads - Optional number of threads that will serve queries, defaults to 1.
- cpu_list - Optional list of zero-based CPU indices that this process will use, defaults to all available CPUs.
- workers - Optional number of workers to start, defaults to the number of available CPUs.
- port - All workers will be assigned a different port number, starting from this one (defaults to 5000).
- tables - List of table descriptors.
- statsd - If specified, some metrics will be reported to the given Statsd host.
- supervise - Optionally, run a supervisor on top of worker processes. This setting must be
true
if number of workers is greater than 1.
Creating Tables¶
Table descriptors can be either a part of store descriptor, or they can be created on-demand using REST API.
{ "name": "<table name>", "dimensions": [ ... ], "metrics": [ ... ], "watch": { "directory": "<path to files>", "extensions": [".tsv"] } }
Parameters:
- name - Table name
- dimensions - List of dimension descriptors
- metrics - List of metric descriptors
- watch - Optional configuration that enables watching directory for new files, and loading them automatically.
To create a table, issue the following command:
curl --data-binary @table.json http://<viyadb-host>:<viyadb-port>/tables
Explanation in SQL terms
An analogy can be drawn between ViyaDB tables and SQL aggregation queries. For example, consider the following SQL statement:
sql
SELECT app_id, SUM(revenue) FROM events GROUP BY app_id
This example translates to ViyaDB table events
that has a single dimension app_id
and a single metric revenue
.
Dimensions¶
There are four types of dimensions:
- String
- Numeric
- Boolean
- Time
- Microtime
String Dimension¶
String dimension is a basic one, and it's used to describe things like: country, user agent, event name, etc. Description format is as follows:
{ "name": "<dimension name>", "field": "<input field name>", "type": "string", "length": ... , "cardinality": ... , "cardinality_guard": { "dimensions": ["<other dimension>", ... ], "limit": ... } }
Parameters:
- name - Column name
- field - Use alternative input field name (defaults to the column name itself)
- type - Must be
string
(or can be omitted, since it's default) - length - Optionally, specify maximum length for a column value (values exceeding this limit will be stripped).
- cardinality - Number of distinct values this column holds (optional, but it's recommended to set)
cardinality_guard
allows defining a rule of how many distinct values is it possible to store per set
of other dimensions. For example, we can decide to store at most 200 distinct event names per event date, per country.
All other events will be accounted still, but they will be marked as __exceeded
. This is really important option,
especially when incoming events are not controlled by yourself, and you don't want your database memory to explode because
someone decided to sent some random values.
Cardinality protection
Cardinality protection is built into ViyaDB, which basically means that you can (and should) define the maximum number of distinct elements of any given dimension. This not only allows for filtering out irrelevant values (while still keeping record of their metrics as "Other"), but also makes possible doing optimizations that improve database performance and save memory.
Dimension cardinality can be applied either on a dimension independently or based on a set of other dimensions. For instance, you can disallow more than 100 different event names coming from a single mobile application per single day.
Numeric Dimension¶
This dimension allows to store numbers as a non-metric column. Column description format is the following:
{ "name": "<dimension name>", "field": "<input field name>", "type": "<type>" }
Parameters:
- name - Column name
- field - Use alternative input field name (defaults to the column name itself)
- type - Numeric type (see below)
Supported numeric types are:
- byte (-128 to 127)
- ubyte (0 to 255)
- short (-32768 to 32767)
- ushort (0 to 65535)
- int (-2147483648 to 2147483647)
- uint (0 to 4294967295)
- long (-9223372036854775808 to 9223372036854775807)
- ulong (0 to 18446744073709551615)
- float (floating point 32 bit number)
- double (floating point 64 bit number)
Deprecation note
Previously, only uint
and ulong
types were supported through numeric
and max
specifications.
You should upgrade to the new numeric types if you're still using the old way.
Boolean Dimension¶
Stores either boolean value of 'true' or 'false'.
{ "name": "<dimension name>", "field": "<input field name>", "type": "boolean" }
Parameters:
- name - Column name
- field - Use alternative input field name (defaults to the column name itself)
- type - Must be
boolean
.
Time and Microtime Dimensions¶
This dimension allows to store UTC time. The difference between the two is that time
precision is up to seconds, while
microtime
precision is up to microseconds.
{ "name": "<dimension name>", "field": "<input field name>", "type": "time|microtime", "format": ... , "granularity": "<time unit>", "rollup_rules": [ ... ] }
Parameters:
- name - Column name
- field - Use alternative input field name (defaults to the column name itself)
- type - The type is set according to the required precision
- format - See below
- granularity - When specified, this time granularity will be used for rolling up events during data ingestion
- rollup_rules - Rules for dynamic period based rollup
Parse format is one of the following:
- posix - Input time is in POSIX time format (seconds since UNIX epoch)
- millis - Input time is in milliseconds since UNIX epoch
- micros - Input time is in microseconds since UNIX epoch
- Any other string that uses strptime modifiers for describing format
Supported time units:
- year
- month
- week
- day
- hour
- minute
- second
Dynamic rollup rules are defined using the following format:
{ "after": "<num> <time unit>" "granularity": "<time unit>" }
For example, if the rules are:
[{ "after": "3 month", "granularity": "week" }, { "after": "1 year", "granularity": "month" }]
Then events time column will change granularity dynamically to weekly
after 3 months, to monthly
after 1 year.
Metrics¶
There are three supported metric types:
- Value
- Count
- Bitset
Value Metric¶
Value metric is a numeric value combined with an aggregation function. List of supported numeric types:
- byte (-128 to 127)
- ubyte (0 to 255)
- short (-32768 to 32767)
- ushort (0 to 65535)
- int (-2147483648 to 2147483647)
- uint (0 to 4294967295)
- long (-9223372036854775808 to 9223372036854775807)
- ulong (0 to 18446744073709551615)
- float (floating point 32 bit number)
- double (floating point 64 bit number)
Supported functions:
- sum
- max
- min
- avg
The format of defining a value metric is the following:
{ "name": "<metric name>", "field": "<input field name>", "type": "<value_type>_<function>" }
Count Metric¶
This type of metric just counts number of incoming rows. To define it, use the following format:
{ "name": "<metric name>", "field": "<input field name>", "type": "count" }
Bitset Metric¶
This metric allows storing numeric values in a memory optimized set structure, which supports
operations like intersect
and union
. This makes possible to run queries like: "what is a value
cardinality for a given set of dimensions filtered by a perdicate?". For instance: counting unique
mobile app users, which installed an application on last month split by country.
Metric format:
{ "name": "<metric name>", "field": "<input field name>", "type": "bitset" }
Data Ingestion¶
Loading from TSV files¶
Save load descriptor into load.json
file:
{ "table": "target_table", "format": "tsv", "type": "file", "columns": [...], "file": "/path/to/data.tsv" }
Post this load descriptor to a running ViyaDB instance:
curl --data-binary @load.json http://<viyadb-host>:<viyadb-port>/load
Important notes:
- The
data.tsv
file must be accessible to the ViyaDB instance you're loading the data into. - If
columns
parameter is not provided, order of columns in .tsv file must be as follows: first dimensions, then metrics as they appear in the table descriptor. columns
parameter must be used whenever one of table column definesfield
attribute.
Querying¶
Supported query types are:
- Aggregate
- Search
- Select
To query, a query request must be submitted using POST method:
curl --data-binary @query.json http://<viyadb-host>:<viyadb-port>/query
Read further to learn more on different query types.
Query compilation
The first time a query is introduced to ViyaDB, a highly optimized C++ source code is generated, and the compiled version of this code is used for physical data access. That allows to minimize branch mis-predictions, increase the level of CPU cache locality, etc. That means that the first execution of a new query type will take some extra time needed for compiling it, all the subsequent queries of the same type will use already compiled version.
Aggregate Query¶
This kind of query aggregates records selected using filter predicate, and returns them to user (optionally, sorted and/or limited). It's important to know that aggregation is done in a memory, therefore all the result set must fit in.
Query format:
{ "type": "aggregate", "table": "<table name>", "select": [ ... ], "filter": ... , "having": ..., "sort": [ ... ] , "skip": 0, "limit": 0 }
Parameters:
- table - Table name
- select - List of parameters describing how to select a column (see below)
- filter - Filter description (see below)
- having - Post-aggregation filter description (similar to SQL HAVING clause). The format is the same as in
filter
attribute. - sort - Optional result sorting configuration (see below)
- skip - Optionally, skip this number of output records
- limit - Optionally, limit result set size to this number
Column Selector¶
Column is either dimension or metric. The format of selecting either of them is the following:
{ "column": "<column name>", "format": "<time format>", "granularity": "<time granularity>" }
Parameters:
- column - Dimension or metric name
Time column has two additional optional parameters:
- format - Time output format (check strptime documentation for available modifiers). By default, UTC epoch timestamp will be sent
- granularity - Rollup results by this time unit (see time dimension configuration for supported time units)
Query Filters¶
Filter is one of mandatory parameters in a query, which allows skipping irrelevant records. There are four different filter types.
Value Operator Filter¶
{ "op": "<filter operator>", "column": "<column name>", "value": "<filter value>" }
Parameters:
- op - Filter kind specified by operator
- column - Dimension or metric name
- value - Value that filter operates on
Supported operators are:
- eq - Equals
- ne - Not equals
- lt - Less than
- le - Less or equals to
- gt - Greater than
- ge - Greater or equals to
Negation Filter¶
{ "op": "not", "filter": ... }
Parameters:
- op - Must be
not
- filter - Other filter descriptor
Inclusion Filter¶
{ "op": "in", "column": "<column name>", "values": ["value1", ... ] }
Parameters:
- op - Must be
in
- column - Dimension or metric name
- values - List of values to filter on
Composite Filter¶
{ "op": "<composition operator>", "filters": [ ... ] }
Parameters:
- op - One of composition operators:
or
,and
- filters - List of other filters to compose
Filter Example¶
Below is an example of using different filter types:
{ "type": "aggregate", "table": "activity", "select": [ {"column": "install_date"}, {"column": "ad_network"}, {"column": "install_country"}, {"column": "installs_count"}, {"column": "launches_count"}, {"column": "inapps_count"} ], "filter": {"op": "and", "filters": [ {"op": "eq", "column": "app_id", "value": "com.teslacoilsw.notifier"}, {"op": "ge", "column": "install_date", "value": "2015-01-01"}, {"op": "lt", "column": "install_date", "value": "2015-01-30"}, {"op": "in", "column": "install_country", "values": ["US", "IL", "RU"]} ]}, "having": {"op": "gt", "column": "inapps_count", "value": "10"} }
Sorting Results¶
You can ask to sort output results by set of columns, and specify sort order on each of them. Column sort configuration goes as follows:
{ "column": "<column name>", "ascending": true|false }
ascending
parameter can be ommitted, in this case the sort order will be descending.
SQL Syntax¶
SQL syntax for querying aggregations looks similar to standard SQL query with only distinction that there's no need
to specify GROUP BY
clause, because all metrics are pre-aggregated already:
SELECT <dimension1>, ... <metric1>, ... FROM <table> [WHERE <condition>] [HAVING <condition>] [ORDER BY <column1> [ASC|DESC], ...] [LIMIT <number>, <number> | LIMIT <number> OFFSET <number>]
For example, the following query finds all event types number for two mobile applications for period of 2017 January, where total number of a single event type is greater than 100:
SELECT event_type, count FROM events WHERE app_id IN ('com.skype.raider', 'com.dropbox.android') AND install_time BETWEEN '2017-01-01' AND '2017-01-31' HAVING count > 100 ORDER BY event_type DESC
Search Query¶
This query type retrieves dimension values by a given set of filters. This feature can come in handy when implementing a field values type assist, when developing an analytics user interface filter.
The basic format is the following:
{ "type": "search", "table": "<table name>", "dimension": "<dimension name>", "term": "<search term>", "limit": 0, "filter": ... }
SQL Syntax¶
SQL syntax for the Search Query is:
SELECT SEARCH(<dimension>, <term>) FROM <table> [WHERE <condition>] [LIMIT <number>, <number> | LIMIT <number> OFFSET <number>]
For example, the following query looks up for all event name types starting from order_
,
where revenue of a single event is greater than $1:
SELECT SEARCH(event_name, 'order_') FROM events WHERE revenue > 1.0 LIMIT 10
Select Query¶
This query is useful for viewing raw data as it's stored in ViyaDB without the post-aggregation step. Please note that there is a pre-aggregation step during data ingestion, which cannot be reverted.
To run the query use the following syntax, which is similar to the Aggregate Query syntax:
{ "type": "aggregate", "table": "<table name>", "select": [ ... ], "filter": ... , "skip": 0, "limit": 0 }
SQL Syntax¶
SQL syntax for the Select Query is:
SELECT RAW(<dimension1>, ... <metric1>, ...) FROM <table> [WHERE <condition>] [LIMIT <number>, <number> | LIMIT <number> OFFSET <number>]
For example, the following query returns first 100 pre-aggregated rows stored in ViyaDB table events
:
SELECT RAW('*') FROM events LIMIT 100
Metadata Queries¶
There are several endpoints that can be used for retreiving information about database status.
Database Metadata Query¶
The following query returns basic information about the database:
curl http://<viyadb-host>:<viyadb-port>/database/meta
Table Metadata Query¶
To see all available table fields, their types as long as some basic statistics use the following query:
curl http://<viyadb-host>:<viyadb-port>/tables/<table name>/meta