IOTDB
Since version 3.00.2/2.00.15, DolphinDB provides the IOTDB engine based on the TSDB engine for fine-grained and low-latency management of massive measurement points. Users can create tables with IOTANY columns to store data of different types in a single column. The system automatically caches the latest values of measurement points, significantly optimizing the latest value query and achieving millisecond latency.
Common Concepts
Measurement Point
In IoT scenarios, a measurement point refers to the specific data point or information point of a device. It can be an interface or a parameter of IoT devices (e.g., sensors, actuators and controllers) or an identifier of the device. The data includes static properties (such as ID and name) and dynamic properties (such as temperature, pressure, humidity, status, etc.).
Characteristics of Point Data
- Massive Volume
The number of measurement points ranges from millions to hundreds of millions in power grids, the internet of vehicles, and industrial manufacturing scenarios.
- High Frequency
The sampling frequencies can reach 100 kHz in extreme scenarios.
- Various Data Types
The data types include boolean, integer, floating‑point, and enumerated values.
- Diverse Transmission Formats
Data of the same device may be collected simultaneously and uploaded in one batch, or collected at different times and uploaded separately.
IOTANY Column
Namely, a column of IOTANY type supporting storing data of different types in a column.
Latest Value Cache
The IOTDB engine maintains a partition-based in-memory table as the latest value
cache. This cache is updated in real time and used as the primary lookup source
for the latest values, avoiding disk access. You can enable it by specifying
latestKeyCache=true in the CREATE statement or the
createPartitionedTable
function.
Measurement Point Table
A measurement point table can be:
- A table with an IOTANY column.
- A partitioned table created using the IOTDB engine with the latest value cache enabled.
Static Table
The IOTDB engine maps all columns specified by sortColumns except the last column to internal IDs and stores them in a static table. These IDs are used to indicate measurement points in storage and cache operations to save space.
Feature Descriptions
IOTANY Column
Users can create at most one IOTANY column in a table using the CREATE statement. The system separately stores values of different types at the level file level. It is required to specify multiple columns that uniquely identify a group of measurement points (typically an ID column + multiple tag columns) and a time column as the sortColumns.
Supported:
- Three deduplication strategies:
- ALL: keep all records
- LAST: only keep the last record
- FIRST: only keep the first record
- Specify sortKeyMappingFunction for dimensionality reduction.
Currently not supported:
- Soft deletion.
- Snapshot read.
- Level 4 compaction.
- Vector database.
- Adding new IOTANY columns.
- Sort columns of DEMICAL type.
- Set ignoreNull=true or pass IOTANY vectors when using
upsert!
function.
Data Types
IOTANY supports the following types:
Categroy | Type |
---|---|
Integral | CHAR |
SHORT | |
INT | |
LONG | |
Logical | BOOL |
Floating | FLOAT |
DOUBLE | |
Literal | SYMBOL |
STRING |
Latest Value Query Optimization
We usually query the latest value using such syntax:
SELECT deviceId, timestamp, metrics1... FROM pt
WHERE deviceId = xxx CONTEXT BY deviceId CSORT timestamp LIMIT -1
To optimize the latest value query, the IOTDB engine introduces two methods:
- Latest value cache: The IOTDB engine maintains a partition-based in-memory table as the latest value cache. This cache is updated in real time and used as the primary lookup source for the latest values, avoiding disk access. Additionally, the IOTDB engine provides a preheating mechanism that automatically loads the cache data of the latest partition after a restart.
- Query optimization at the storage engine level: The IOTDB engine implements the latest value query algorithm using CONTEXT BY in the storage engine level and uses the index zonemap for filtering to read only a subset of blocks from the disk. This is designed for the cache miss scenarios of the latest values.
The prerequisites and priority for using the two optimization methods are as follows.
Optimization methods | Prerequisites (any of the following): |
---|---|
Latest value cache |
|
Query optimization at the engine level |
|
Note:
For the third prerequisite of the latest value cache optimization, when the
filter types of the last column are <
,
<=
, or BETWEEN
, the result can be
returned from the cache only if all values of the last column in the cache table
meet the WHERE condition. For example, if the last column “ts” in the cache
table has a range of [2024-01-01, 2024-01-15]
, and the query
condition is ts < 2024-01-16
, then the query can be
optimized by reading from the latest value table.
The optimization priority is as follows:
- The system reads from the latest value table first if any of the prerequisites is met.
- If not, the system tries query optimization at the engine level.
- If neither is met, the system follows the previous logic.
Suppose an IOTDB partitioned table “test” with sortColumns specified as
`c1`c2`timestamp
(where “c1” and “c2” are sort key columns,
and “timestamp” is the last column). A latest value query can be written in the
following format: select [cols] from test [where] context by c1,c2 csort
timestamp limit -1
.The following are several query examples and the
optimization methods applied:
sortColumns | SQL | Latest value cache | Query optimization at the storage engine level |
---|---|---|---|
c1, c2, timestamp | select [cols] from pt context by c1, c2
csort timestamp limit -1 No filter. |
√ | |
c1, c2, timestamp | select [cols] from pt where c1 in
[x,x,x] and c2 = xxx context by c1, c2 csort timestamp limit
-1 Multiple sort key columns with at least one
filter being not |
√ | |
c1, c2, timestamp | select [cols] from pt where c1 = xx and
c2 = xx timestamp >= xxx context by c1, c2 csort timestamp
limit -1 The filters involve all sort key columns and the last column. |
√ | √ |
c1, c2, timestamp | select [cols] from pt where c1 = xx and
c2 = xx context by c1, c2 csort timestamp limit
-1 The filters involve all sort key columns. |
√ | √ |
c1, c2, timestamp | select [cols] from pt where c1 = xx and
timestamp > xxx context by c1, c2 csort timestamp limit
-1 The filters involve part of the sort key columns and the last column. |
√ | |
c1, timestamp | select [cols] from pt where c1 in
[x,x,x] context by c1, c2 csort timestamp limit
-1 The filters are |
√ | √ |
c1, timestamp | select [cols] from pt where c1 in
[x,x,x] and timestamp between xxx and xxx context by c1, c2
csort timestamp limit -1 The filters involve all sort key columns and the last column. |
√ | √ |
c1, timestamp | select [cols] from pt where c1 > x
context by c1, c2 csort timestamp limit -1 The
filters are not |
√ |
Checking Whether Optimization is Applied Using HINT_EXPLAIN
You can add the HINT_EXPLAIN to the SQL queries to check the applied optimization.
Adding HINT_EXPLAIN to the latest value queries allows the lastQuery field in the output. The lastQuery field has three possible values:
- The latest value cache optimization is applied:
"lastQuery": { "optimizationMethod": LatestKeyCache },
- Query optimization at the storage engine level is
applied:
"lastQuery": { "optimizationMethod": LatestKeyQuery },
- Neither of the Optimizations is applied:
"lastQuery": { "optimizationMethod": None },
Latest Value Query Semantics with Duplicate Data
When querying the latest values using CONTEXT BY, only one of the records with the same ID and timestamp is retained. For example:
id | timestamp | value |
---|---|---|
1 | 10:00 | 1 |
1 | 10:00 | 2 |
select * from pt where id = 1 context by context by id csort timestamp limit -1
For in-memory tables, the system randomly retains one record when executing CONTEXT BY queries because no explicit retention rule is defined. For measurement point tables, the system retains the latest record without being affected by deduplication strategies.
Latest Value Table Management
The latest value table stores the latest values of all measurement points by partition. It is updated in real time during data writing and flushed to disk with the cache engine. It is named timeseries.cache and saved in the tableDir directory of each partition.
Cache management follows the same eviction policy as symbol base, periodically removing partitions that have not been accessed recently. The memory usage of the latest value table is controlled by the IOTDBLatestKeyCacheSize parameter, which defaults to 5% of maxMemSize.
IOTDBLatestKeyCacheSize=0.5 //a floating-point in GB
Usage Example
Create a Measurement Point Table
It is recommended to create a table with an IOTANY column using the CREATE statement as the measurement point table.
In the following example, we use a composite strategy to partition data by “deviceId” and “timestamp”, creating a measurement point table named “pt”. This table uses “deviceId” and “location” as the two columns that uniquely identify a measurement point. The latest value cache is enabled and the column “value” is of IOTANY type, allowing it to store data of various types. The compression for the sort key columns is also enabled.
dbName = "dfs://db"
if (existsDatabase(dbName)) {
dropDatabase(dbName)
}
// create an IOTDB database
create database "dfs://db" partitioned by HASH([INT, 20]),VALUE(2017.08.07..2017.08.11), engine='IOTDB'
// create a measurement point table with deviceId & location as sort key columns and timestamp as the last column
create table "dfs://db"."pt" (
deviceId INT,
location SYMBOL,
timestamp TIMESTAMP,
value IOTANY
)
partitioned by deviceId, timestamp,
sortColumns = [`deviceId, `location, `timestamp],
latestKeyCache = true
Append Data
For each measurement point, the data type of the IOTANY column is defined upon the first append. It will throw an error if a different data type is appended to the same measurement point afterward.
If the deduplication strategy is ALL or FIRST, we can modify the type of IOTANY column of a measurement point using the UPDATE statement. However, the WHERE clause can only involve the sort key columns. This ensures that all records of the measurement point are updated, maintaining a consistent data type of the IOTANY column for the same point.
We write twice to the previously created table. The first write succeeds, and the
type of “value” column of measurement point [1, 'loc1']
is
defined as INT.
// first write of point [1, `loc1`]
pt = loadTable("dfs://db", "pt")
t = table([1] as deviceId,
[`loc1] as location,
[now()] as timestamp,
[int(233)] as value)
pt.append!(t)
On the second write to the same measurement point, the “value” column is of DOUBLE type. Since the type does not match the previously defined type, the write fails.
// second write
t = table([1] as deviceId,
[`loc1] as location,
[now()] as timestamp,
[double(233)] as value)
pt.append!(t)
We can modify the type of the “value” column using the UPDATE statement.
// modify the type of IOTANY column to double
// A failed try because the WHERE clause can only include the sort key columns
update pt set value=double(233) where deviceId = 1 and location=`loc1 and timestamp >= 2017.08.07
// Successful modification
update pt set value=double(233) where deviceId = 1 and location=`loc1
If we try the second write again, it will succeed.
Update Data
If the deduplication strategy is ALL or FIRST, we can modify the type of IOTANY column of a measurement point using the UPDATE statement.
However, the WHERE clause can only involve the sort key columns. This ensures that all records for the measurement point are updated, maintaining a consistent IOTANY column type for the same measurement point.
t = table([`dev1] as deviceId,
[now()] as timestamp,
[`loc1] as location,
[int(233)] as value)
pt.append!(t) //determine the data type of the IOTANY column as INT
update pt set value = 2.33 where deviceId = `dev1 and location = `loc1
// update the data type of the IOTANY column to DOUBLE
update pt set value = 2.33 where deviceId = `dev1
// success
update pt set value = 2.33 where deviceId = `dev1 and location = `loc1 and timestamp > xxx
// failed because the WHERE clause can only include the sort columns
Query the IOTANY Column
Query the previously created table:
select deviceId, location, sum(long(value)), timestamp from pt where deviceId = 1 and
location = `loc1 and timestamp >= 2017.08.07
Calculate the IOTANY Column
Calculate on the previously created table:
// evaluate equality
select * from pt where value = 233
// aggregation
select deviceId, location, sum(value), timestamp from pt where deviceId = 1 and
location = `loc1 and timestamp >= 2017.08.07
Latest Value Query Example (using CONTEXT BY)
There are two optimization methods for the latest value queries as stated above. Different methods are applied depending on specific conditions. We create a new measurement pointtable for instructions in this example.
dbName = "dfs://test";
if(existsDatabase(dbName)){
dropDatabase(dbName)
}
// create an IOTDB database
create database "dfs://test"
partitioned by HASH([LONG,1]),HASH([DATE, 1]),
engine = 'IOTDB'
dummy = table(1000000:0, `date`deviceId`key1`val_any, [DATE, LONG, INT, DOUBLE])
// create a measurement point table with deviceId & location as sort key columns and date as the last column
create table "dfs://test"."pt1"(
date DATE,
deviceId LONG,
key1 INT,
val_any DOUBLE
)
partitioned by deviceId, date
sortColumns = [`deviceId,`key1,`date]
keepDuplicates = ALL
latestKeyCache = true
pt1 = loadTable(dbName, "pt1")
// append the data for test
n = 200000
date = rand(2012.01.01..2012.12.31, n)
key1 = take(200199..(200199+n), n)
deviceId = take(200..(200+n), n)
val = rand(double(n), n)
t = table(date, deviceId, key1, val)
pt1.append!(t)
The following query uses the latest value cache optimization.
select [HINT_EXPLAIN] * from pt1 where deviceId in [200, 201, 202]
context by deviceId, key1 csort date
limit -1
// HINT_EXPLAIN output:
{
.......
"lastQuery": {
"optimizationMethod": LatestKeyCache
},
......
}
The following query uses query optimization at the storage engine level.
select [HINT_EXPLAIN] * from pt1 where deviceId = 200 and key1 = 200199
and date > 2010.12.31
context by deviceId, key1 csort date
limit -1
// HINT_EXPLAIN output:
{
.......
"lastQuery": {
"optimizationMethod": LatestKeyQuery
},
......
}
The following query uses neither of them because the filter involves columns other than the sort key columns.
select [HINT_EXPLAIN] * from pt1 where deviceId = 200
and val_any = 123
context by deviceId, key1 csort date
limit -1
// HINT_EXPLAIN output:
{
.......
"lastQuery": {
"optimizationMethod": None
},
......
}