context by
context by is a unique feature in DolphinDB. It is an extension to ANSI SQL for convenient time-series data manipulation.
The traditional relational database doesn't support time series data processing. In
RDBMS, a table is a set of rows and the order of rows is not modeled. We can apply
aggregate functions such as min
, max
, avg
, stdev
to a group of rows,
but we can't use order sensitive aggregate functions such as first
, last
, etc, or order
sensitive vector functions such as cumsum
, cummax
, ratios
, deltas
on groups of rows.
DolphinDB supports time series data processing. The context by
clause make it very
convenient to perform time series data processing within each group.
Both context by
and group by
conduct grouping. However, with group by, each group returns
a scalar value; with context by, each group returns a vector of the same size as the
group's records. The group by clause can only be used with aggregate functions, whereas
the context by clause can be used with aggregate functions, moving window functions,
cumulative functions, etc. The context by clause is often used with update statement.
The context by clause can also be used together with the having clause. Please refer to
the section about having.
context by is often used together with time-series functions such as cumsum and mavg etc. The results of these functions are affected by the order of rows within each context by group. We can use keyword csort after context by. csort sorts the rows within each context by group before conducting calculations in select statements. csort can use multiple columns including calculated columns and can sort in both ascending order (asc, the default sorting order) and descending order (desc). csort can be used with top clause to get the most recent observations within each group.
The example below illustrates the difference between group by and context by.
sym = `C`MS`MS`MS`IBM`IBM`C`C`C$SYMBOL
price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
timestamp = [09:34:07,09:36:42,09:36:51,09:36:59,09:32:47,09:35:26,09:34:16,09:34:26,09:38:12]
t1 = table(timestamp, sym, qty, price);
t1;
timestamp | sym | qty | price |
---|---|---|---|
09:34:07 | C | 2200 | 49.6 |
09:36:42 | MS | 1900 | 29.46 |
09:36:51 | MS | 2100 | 29.52 |
09:36:59 | MS | 3200 | 30.02 |
09:32:47 | IBM | 6800 | 174.97 |
09:35:26 | IBM | 5400 | 175.23 |
09:34:16 | C | 1300 | 50.76 |
09:34:26 | C | 2500 | 50.32 |
09:38:12 | C | 8800 | 51.29 |
select wavg(price,qty) as wvap, sum(qty) as totalqty from t1 group by sym;
sym | wvap | totalqty |
---|---|---|
C | 50.828378 | 14800 |
IBM | 175.085082 | 12200 |
MS | 29.726389 | 7200 |
select sym, price, qty, wavg(price,qty) as wvap, sum(qty) as totalqty from t1 context by sym;
sym | price | qty | wvap | totalqty |
---|---|---|---|---|
C | 49.6 | 2200 | 50.828378 | 14800 |
C | 50.76 | 1300 | 50.828378 | 14800 |
C | 50.32 | 2500 | 50.828378 | 14800 |
C | 51.29 | 8800 | 50.828378 | 14800 |
IBM | 174.97 | 6800 | 175.085082 | 12200 |
IBM | 175.23 | 5400 | 175.085082 | 12200 |
MS | 29.46 | 1900 | 29.726389 | 7200 |
MS | 29.52 | 2100 | 29.726389 | 7200 |
MS | 30.02 | 3200 | 29.726389 | 7200 |
To calculate stock returns for each firm, we cannot use group by. Instead we can use context by. We need to make sure the records are sorted appropriately within each group before we use context by.
select sym, timestamp, price, eachPre(\,price)-1.0 as ret from t1 context by sym;
sym | timestamp | price | ret |
---|---|---|---|
C | 09:34:07 | 49.6 | |
C | 09:34:16 | 50.76 | 0.023387 |
C | 09:34:26 | 50.32 | -0.008668 |
C | 09:38:12 | 51.29 | 0.019277 |
IBM | 09:32:47 | 174.97 | |
IBM | 09:35:26 | 175.23 | 0.001486 |
MS | 09:36:42 | 29.46 | |
MS | 09:36:51 | 29.52 | 0.002037 |
MS | 09:36:59 | 30.02 | 0.016938 |
We can use template function contextby for the same calculation, but the result is a vector instead of a table.
contextby(eachPre{ratio}, t1.price, t1.sym);
// output
[,,1.002037,1.016938,,1.001486,1.023387,0.991332,1.019277]
Here we use a partial application eachPre{ratio}. Please refer to Partial Application for details.
Calculate cumulative sum of trading volume for each stock in every minute:
select *, cumsum(qty) from t1 context by sym, timestamp.minute();
timestamp | sym | qty | price | cumsum_qty |
---|---|---|---|---|
09:34:07 | C | 2200 | 49.6 | 2200 |
09:34:16 | C | 1300 | 50.76 | 3500 |
09:34:26 | C | 2500 | 50.32 | 6000 |
09:38:12 | C | 8800 | 51.29 | 8800 |
09:32:47 | IBM | 6800 | 174.97 | 6800 |
09:35:26 | IBM | 5400 | 175.23 | 5400 |
09:36:42 | MS | 1900 | 29.46 | 1900 |
09:36:51 | MS | 2100 | 29.52 | 4000 |
09:36:59 | MS | 3200 | 30.02 | 7200 |
Use top clause with context by clause:
select top 2 * from t1 context by sym;
timestamp | sym | qty | price |
---|---|---|---|
09:34:07 | C | 2200 | 49.6 |
09:34:16 | C | 1300 | 50.76 |
09:32:47 | IBM | 6800 | 174.97 |
09:35:26 | IBM | 5400 | 175.23 |
09:36:42 | MS | 1900 | 29.46 |
09:36:51 | MS | 2100 | 29.52 |
Please note that we cannot specify a range for the top clause when it is used with the context by clause:
select top 2:3 * from t1 context by sym;
//Syntax Error: [line #2] When top clause uses together with context clause in SQL query, can't specify a range in top clause
Use top clause and csort clause together with context by clause to get the most recent 2 records for each stock:
select top 2 * from t1 context by sym csort timestamp desc;
timestamp | sym | qty | price |
---|---|---|---|
09:38:12 | C | 8800 | 51.29 |
09:34:26 | C | 2500 | 50.32 |
09:35:26 | IBM | 5400 | 175.23 |
09:32:47 | IBM | 6800 | 174.97 |
09:36:59 | MS | 3200 | 30.02 |
09:36:51 | MS | 2100 | 29.52 |
Use context by clause together with limit clause to get the first or the last few records for each stock:
select * from t1 context by sym limit 2;
timestamp | sym | qty | price |
---|---|---|---|
09:34:07 | C | 2200 | 49.6 |
09:36:42 | MS | 1900 | 29.46 |
09:36:51 | MS | 2100 | 29.52 |
09:32:47 | IBM | 6800 | 174.97 |
09:35:26 | IBM | 5400 | 175.23 |
09:34:16 | C | 1300 | 50.76 |
Use context by clause together with csort and limit clause to get the last 2 records for each stock after sorting by qty:
select * from t1 context by sym limit -2;
timestamp | sym | qty | price |
---|---|---|---|
09:36:51 | MS | 2100 | 29.52 |
09:36:59 | MS | 3200 | 30.02 |
09:32:47 | IBM | 6800 | 174.97 |
09:35:26 | IBM | 5400 | 175.23 |
09:34:26 | C | 2500 | 50.32 |
09:38:12 | C | 8800 | 51.29 |
Use context by clause together with csort and limit clause to get the last 2 records for each stock after sorting by qty:
select * from t1 context by sym csort qty limit -2;
timestamp | sym | qty | price |
---|---|---|---|
09:34:26 | C | 2500 | 50.32 |
09:38:12 | C | 8800 | 51.29 |
09:35:26 | IBM | 5400 | 175.23 |
09:32:47 | IBM | 6800 | 174.97 |
09:36:51 | MS | 2100 | 29.52 |
09:36:59 | MS | 3200 | 30.02 |
Calculate fitted values of price from the regression of price on qty for each stock:
select *, ols(price, qty)[0]+ols(price, qty)[1]*qty as fittedPrice from t1 context by sym;
timestamp | sym | qty | price | fittedPrice |
---|---|---|---|---|
09:34:07 | C | 2200 | 49.6 | 50.282221 |
09:34:16 | C | 1300 | 50.76 | 50.156053 |
09:34:26 | C | 2500 | 50.32 | 50.324277 |
09:38:12 | C | 8800 | 51.29 | 51.207449 |
09:32:47 | IBM | 6800 | 174.97 | 174.97 |
09:35:26 | IBM | 5400 | 175.23 | 175.23 |
09:36:42 | MS | 1900 | 29.46 | 29.447279 |
09:36:51 | MS | 2100 | 29.52 | 29.535034 |
09:36:59 | MS | 3200 | 30.02 | 30.017687 |
The context by clause can be used with the order by clause. The order by columns must be among the output columns.
select *, ols(price, qty)[0]+ols(price, qty)[1]*qty as fittedPrice from t1 context by sym order by timestamp;
timestamp | sym | qty | price | fittedPrice |
---|---|---|---|---|
09:32:47 | IBM | 6800 | 174.97 | 174.97 |
09:34:07 | C | 2200 | 49.6 | 50.075318 |
09:34:16 | C | 1300 | 50.76 | 49.911222 |
09:34:26 | C | 2500 | 50.32 | 50.130017 |
09:35:26 | IBM | 5400 | 175.23 | 175.23 |
09:36:42 | MS | 1900 | 29.46 | 29.447279 |
09:36:51 | MS | 2100 | 29.52 | 29.535034 |
09:36:59 | MS | 3200 | 30.02 | 30.017687 |
09:38:12 | C | 8800 | 51.29 | 51.278686 |
context by is different from the contextby function in 3 aspects:
(1) contextby generates a vector while context by is used in a select clause to produce a table.
(2) contextby is limited to one grouping column whereas a context by clause can be used on multiple columns.
(3) contextby calculates one item for every call, whereas a context by clause can calculate multiple items.
Performance Tip
Before we use context by, we should sort the database by the same variable or variables in the context by clause. This could greatly improve the speed of context by clause.
n=1000000
ID=rand(100, n)
x=rand(10.0, n)
ta=table(ID, x)
tb=select * from ta order by ID;
timer select (NULL \:P x)-1 as ret from ta context by ID;
// Time elapsed: 4.018 ms
timer select (NULL \:P x)-1 as ret from tb context by ID;
// Time elapsed: 2.991 ms
DolphinDB optimizes the performance of context by
clause under
certain conditions.
To query the latest records of given groups in a partitioned table, use
context by
clause with keywords csort
and
limit
. The performance of "context by + csort + limit"
statement is optimized if the following conditions are satisfied:
-
The
context by
column is filtered by thewhere
clause. -
The
csort
column is a partitioning column, and the partition type is VALUE or RANGE. -
csort
andcontext by
can only specify one column. -
The
context by
column is specified in theselect
clause.
You can obtain the execution plan by adding keyword [HINT_EXPLAIN] to check whether the statement is optimized.