IOTDB

Since version 3.00.2/2.00.15, DolphinDB provides the IOTDB engine based on the TSDB engine for fine-grained and low-latency management of massive measurement points. Users can create tables with IOTANY columns to store data of different types in a single column. The system automatically caches the latest values of measurement points, significantly optimizing the latest value query and achieving millisecond latency.

Common Concepts

Measurement Point

In IoT scenarios, a measurement point refers to the specific data point or information point of a device. It can be an interface or a parameter of IoT devices (e.g., sensors, actuators and controllers) or an identifier of the device. The data includes static properties (such as ID and name) and dynamic properties (such as temperature, pressure, humidity, status, etc.).

Characteristics of Point Data

Massive Volume
The number of measurement points ranges from millions to hundreds of millions in power grids, the internet of vehicles, and industrial manufacturing scenarios.
High Frequency
The sampling frequencies can reach 100 kHz in extreme scenarios.
Various Data Types
The data types include boolean, integer, floating‑point, and enumerated values.
Diverse Transmission Formats
Data of the same device may be collected simultaneously and uploaded in one batch, or collected at different times and uploaded separately.

IOTANY Column

Namely, a column of IOTANY type supporting storing data of different types in a column.

Latest Value Cache

The IOTDB engine maintains a partition-based in-memory table as the latest value cache. This cache is updated in real time and used as the primary lookup source for the latest values, avoiding disk access. You can enable it by specifying latestKeyCache=true in the CREATE statement or the createPartitionedTable function.

Measurement Point Table

A measurement point table can be:

A table with an IOTANY column.
A partitioned table created using the IOTDB engine with the latest value cache enabled.

Static Table

The IOTDB engine maps all columns specified by sortColumns except the last column to internal IDs and stores them in a static table. These IDs are used to indicate measurement points in storage and cache operations to save space.

Feature Descriptions

IOTANY Column

Users can create at most one IOTANY column in a table using the CREATE statement. The system separately stores values of different types at the level file level. It is required to specify multiple columns that uniquely identify a group of measurement points (typically an ID column + multiple tag columns) and a time column as the sortColumns.

Supported:

Two deduplication strategies:
- ALL: keep all records
- LAST: only keep the last record
Specify sortKeyMappingFunction for dimensionality reduction.

Currently not supported:

Soft deletion.
Snapshot read.
Level 4 compaction.
Vector database.
Adding new IOTANY columns.
Sort columns of DEMICAL type.
Set ignoreNull=true or pass IOTANY vectors when using upsert! function.

Data Types

IOTANY supports the following types:


Categroy	Type
Integral	CHAR
	SHORT
	INT
	LONG
Logical	BOOL
Floating	FLOAT
Floating	DOUBLE
Literal	SYMBOL
Literal	STRING

Latest Value Query Optimization

We usually query the latest value using such syntax:

SELECT deviceId, timestamp, metrics1... FROM pt
WHERE deviceId = xxx CONTEXT BY deviceId CSORT timestamp LIMIT -1

To optimize the latest value query, the IOTDB engine introduces two methods:

Latest value cache: The IOTDB engine maintains a partition-based in-memory table as the latest value cache. This cache is updated in real time and used as the primary lookup source for the latest values, avoiding disk access. Additionally, the IOTDB engine provides a preheating mechanism that automatically loads the cache data of the latest partition after a restart.
Query optimization at the storage engine level: The IOTDB engine implements the latest value query algorithm using CONTEXT BY in the storage engine level and uses the index zonemap for filtering to read only a subset of blocks from the disk. This is designed for the cache miss scenarios of the latest values.

The prerequisites and priority for using the two optimization methods are as follows.


Optimization methods	Prerequisites (any of the following):
Latest value cache	The query has no WHERE clause. The query has a WHERE clause where the filter condition involves only the sort key columns, which are all other sort columns except the last column. Supported filter types: `=, <, <=, >, >=, BETWEEN, IN`. The WHERE clause involves the last column of sort columns and no other columns than the sort columns. Supported filter types: `<, <=, >, >=, BETWEEN`.
Query optimization at the engine level	The WHERE clause must involve all sort key columns. It may also include a temporal column, which supports the following filter types: `=, <, <=, >, >=, BETWEEN`. If there are multiple sort key columns, each column can only be filtered using `=`. If there is only one sort key column, the filter types can be `=` and `IN`.

Note:

For the third prerequisite of the latest value cache optimization, when the filter types of the last column are <, <=, or BETWEEN, the result can be returned from the cache only if all values of the last column in the cache table meet the WHERE condition. For example, if the last column “ts” in the cache table has a range of [2024-01-01, 2024-01-15], and the query condition is ts < 2024-01-16, then the query can be optimized by reading from the latest value table.

The optimization priority is as follows:

The system reads from the latest value table first if any of the prerequisites is met.
If not, the system tries query optimization at the engine level.
If neither is met, the system follows the previous logic.

Suppose an IOTDB partitioned table “test” with sortColumns specified as `c1`c2`timestamp (where “c1” and “c2” are sort key columns, and “timestamp” is the last column). A latest value query can be written in the following format: select [cols] from test [where] context by c1,c2 csort timestamp limit -1.The following are several query examples and the optimization methods applied:


sortColumns	SQL	Latest value cache	Query optimization at the storage engine level
c1, c2, timestamp	`select [cols] from pt context by c1, c2 csort timestamp limit -1` No filter.	√
c1, c2, timestamp	`select [cols] from pt where c1 in [x,x,x] and c2 = xxx context by c1, c2 csort timestamp limit -1` Multiple sort key columns with at least one filter being not `=`.	√
c1, c2, timestamp	`select [cols] from pt where c1 = xx and c2 = xx and timestamp >= xxx context by c1, c2 csort timestamp limit -1` The filters involve all sort key columns and the last column.	√	√
c1, c2, timestamp	`select [cols] from pt where c1 = xx and c2 = xx context by c1, c2 csort timestamp limit -1` The filters involve all sort key columns.	√	√
c1, c2, timestamp	`select [cols] from pt where c1 = xx and timestamp > xxx context by c1, c2 csort timestamp limit -1` The filters involve part of the sort key columns and the last column.	√
c1, timestamp	`select [cols] from pt where c1 in [x,x,x] context by c1, c2 csort timestamp limit -1` The filters are`in`and involve all sort key columns.	√	√
c1, timestamp	`select [cols] from pt where c1 in [x,x,x] and timestamp between xxx and xxx context by c1, c2 csort timestamp limit -1` The filters involve all sort key columns and the last column.	√	√
c1, timestamp	`select [cols] from pt where c1 > x context by c1, c2 csort timestamp limit -1` The filters are not `in` or `=` and involve all sort key columns.	√

Checking Whether Optimization is Applied Using HINT_EXPLAIN

You can add the HINT_EXPLAIN to the SQL queries to check the applied optimization.

Adding HINT_EXPLAIN to the latest value queries allows the lastQuery field in the output. The lastQuery field has three possible values:

The latest value cache optimization is applied:
"lastQuery": { "optimizationMethod": LatestKeyCache },
Query optimization at the storage engine level is applied:
"lastQuery": { "optimizationMethod": LatestKeyQuery },
Neither of the Optimizations is applied:
"lastQuery": { "optimizationMethod": None },

Latest Value Query Semantics with Duplicate Data

When querying the latest values using CONTEXT BY, only one of the records with the same ID and timestamp is retained. For example:


id	timestamp	value
1	10:00	1
1	10:00	2

select * from pt where id = 1 context by context by id csort timestamp limit -1

For in-memory tables, the system randomly retains one record when executing CONTEXT BY queries because no explicit retention rule is defined. For measurement point tables, the system retains the latest record without being affected by deduplication strategies.

Latest Value Table Management

The latest value table stores the latest values of all measurement points by partition. It is updated in real time during data writing and flushed to disk with the cache engine. It is named timeseries.cache and saved in the tableDir directory of each partition.

Cache management follows the same eviction policy as symbol base, periodically removing partitions that have not been accessed recently. The memory usage of the latest value table is controlled by the IOTDBLatestKeyCacheSize parameter, which defaults to 5% of maxMemSize.

IOTDBLatestKeyCacheSize=0.5 //a floating-point in GB

Usage Example

Create a Measurement Point Table

It is recommended to create a table with an IOTANY column using the CREATE statement as the measurement point table.

In the following example, we use a composite strategy to partition data by “deviceId” and “timestamp”, creating a measurement point table named “pt”. This table uses “deviceId” and “location” as the two columns that uniquely identify a measurement point. The latest value cache is enabled and the column “value” is of IOTANY type, allowing it to store data of various types. The compression for the sort key columns is also enabled.

dbName = "dfs://db"
if (existsDatabase(dbName)) {
    dropDatabase(dbName)
}
// create an IOTDB database
create database "dfs://db" partitioned by HASH([INT, 20]),VALUE(2017.08.07..2017.08.11), engine='IOTDB'
// create a measurement point table with deviceId & location as sort key columns and timestamp as the last column
create table "dfs://db"."pt" (
        deviceId INT,
        location SYMBOL,
        timestamp TIMESTAMP,
        value IOTANY
)
partitioned by deviceId, timestamp,
sortColumns = [`deviceId, `location, `timestamp],
latestKeyCache = true

Append Data

For each measurement point, the data type of the IOTANY column is defined upon the first append. It will throw an error if a different data type is appended to the same measurement point afterward.

If the deduplication strategy is ALL or FIRST, we can modify the type of IOTANY column of a measurement point using the UPDATE statement. However, the WHERE clause can only involve the sort key columns. This ensures that all records of the measurement point are updated, maintaining a consistent data type of the IOTANY column for the same point.

We write twice to the previously created table. The first write succeeds, and the type of “value” column of measurement point [1, 'loc1'] is defined as INT.

// first write of point [1, `loc1`] 
pt = loadTable("dfs://db", "pt")
t = table([1] as deviceId,
  [`loc1] as location,
  [now()] as timestamp,
  [int(233)] as value)
pt.append!(t)

On the second write to the same measurement point, the “value” column is of DOUBLE type. Since the type does not match the previously defined type, the write fails.

// second write
t = table([1] as deviceId,
  [`loc1] as location,
  [now()] as timestamp,
  [double(233)] as value)
pt.append!(t)

We can modify the type of the “value” column using the UPDATE statement.

// modify the type of IOTANY column to double
// A failed try because the WHERE clause can only include the sort key columns
update pt set value=double(233) where deviceId = 1 and location=`loc1 and timestamp >= 2017.08.07 
// Successful modification
update pt set value=double(233) where deviceId = 1 and location=`loc1

If we try the second write again, it will succeed.

Update Data

If the deduplication strategy is ALL or FIRST, we can modify the type of IOTANY column of a measurement point using the UPDATE statement.

However, the WHERE clause can only involve the sort key columns. This ensures that all records for the measurement point are updated, maintaining a consistent IOTANY column type for the same measurement point.

t = table([`dev1] as deviceId,
  [now()] as timestamp,
  [`loc1] as location,
  [int(233)] as value)
pt.append!(t) //determine the data type of the IOTANY column as INT
update pt set value = 2.33 where deviceId = `dev1 and location = `loc1
// update the data type of the IOTANY column to DOUBLE 
update pt set value = 2.33 where deviceId = `dev1
// success
update pt set value = 2.33 where deviceId = `dev1 and location = `loc1 and timestamp > xxx
// failed because the WHERE clause can only include the sort columns

Query the IOTANY Column

Query the previously created table:

select deviceId, location, sum(long(value)), timestamp from pt where deviceId = 1 and
  location = `loc1 and timestamp >= 2017.08.07

Calculate the IOTANY Column

Calculate on the previously created table:

// evaluate equality
select * from pt where value = 233

// aggregation
select deviceId, location, sum(value), timestamp from pt where deviceId = 1 and
  location = `loc1 and timestamp >= 2017.08.07

Latest Value Query Example (using CONTEXT BY)

There are two optimization methods for the latest value queries as stated above. Different methods are applied depending on specific conditions. We create a new measurement pointtable for instructions in this example.

dbName = "dfs://test";
if(existsDatabase(dbName)){
	dropDatabase(dbName)
}
// create an IOTDB database
create database "dfs://test" 
partitioned by HASH([LONG,1]),HASH([DATE, 1]),
engine = 'IOTDB'
dummy = table(1000000:0, `date`deviceId`key1`val_any, [DATE, LONG, INT, DOUBLE])
// create a measurement point table with deviceId & location as sort key columns and date as the last column
create table "dfs://test"."pt1"(
        date DATE,
        deviceId LONG,
        key1 INT,
        val_any DOUBLE
)
partitioned by deviceId, date
sortColumns = [`deviceId,`key1,`date]
keepDuplicates = ALL
latestKeyCache = true
pt1 = loadTable(dbName, "pt1")
// append the data for test
n = 200000
date = rand(2012.01.01..2012.12.31, n)
key1 = take(200199..(200199+n), n)
deviceId = take(200..(200+n), n)
val = rand(double(n), n)
t = table(date, deviceId, key1, val)
pt1.append!(t)

The following query uses the latest value cache optimization.

select [HINT_EXPLAIN] * from pt1 where deviceId in [200, 201, 202]
context by deviceId, key1 csort date
limit -1
// HINT_EXPLAIN output:
{
    .......
    "lastQuery": {
        "optimizationMethod": LatestKeyCache
    },
    ......
}

The following query uses query optimization at the storage engine level.

select [HINT_EXPLAIN] * from pt1 where deviceId = 200 and key1 = 200199 
and date > 2010.12.31
context by deviceId, key1 csort date
limit -1
// HINT_EXPLAIN output:
{
    .......
    "lastQuery": {
        "optimizationMethod": LatestKeyQuery
    },
    ......
}

The following query uses neither of them because the filter involves columns other than the sort key columns.

select [HINT_EXPLAIN] * from pt1 where deviceId = 200
and val_any = 123
context by deviceId, key1 csort date
limit -1
// HINT_EXPLAIN output:
{
    .......
    "lastQuery": {
        "optimizationMethod": None
    },
    ......
}