context by

context by is a unique feature in DolphinDB. It is an extension to ANSI SQL for convenient time-series data manipulation.

The traditional relational database doesn't support time series data processing. In RDBMS, a table is a set of rows and the order of rows is not modeled. We can apply aggregate functions such as min, max, avg, stdev to a group of rows, but we can't use order sensitive aggregate functions such as first, last, etc, or order sensitive vector functions such as cumsum, cummax, ratios, deltas on groups of rows. DolphinDB supports time series data processing. The context by clause make it very convenient to perform time series data processing within each group.

Both context by and group by conduct grouping. However, with group by, each group returns a scalar value; with context by, each group returns a vector of the same size as the group's records. The group by clause can only be used with aggregate functions, whereas the context by clause can be used with aggregate functions, moving window functions, cumulative functions, etc. The context by clause is often used with update statement. The context by clause can also be used together with the having clause. Please refer to the section about having.

context by is often used together with time-series functions such as cumsum and mavg etc. The results of these functions are affected by the order of rows within each context by group. We can use keyword csort after context by. csort sorts the rows within each context by group before conducting calculations in select statements. csort can use multiple columns including calculated columns and can sort in both ascending order (asc, the default sorting order) and descending order (desc). csort can be used with top clause to get the most recent observations within each group.

The example below illustrates the difference between group by and context by.

sym = `C`MS`MS`MS`IBM`IBM`C`C`C$SYMBOL
price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
timestamp = [09:34:07,09:36:42,09:36:51,09:36:59,09:32:47,09:35:26,09:34:16,09:34:26,09:38:12]
t1 = table(timestamp, sym, qty, price);

t1;
timestamp sym qty price
09:34:07 C 2200 49.6
09:36:42 MS 1900 29.46
09:36:51 MS 2100 29.52
09:36:59 MS 3200 30.02
09:32:47 IBM 6800 174.97
09:35:26 IBM 5400 175.23
09:34:16 C 1300 50.76
09:34:26 C 2500 50.32
09:38:12 C 8800 51.29
select wavg(price,qty) as wvap, sum(qty) as totalqty from t1 group by sym;
sym wvap totalqty
C 50.828378 14800
IBM 175.085082 12200
MS 29.726389 7200
select sym, price, qty, wavg(price,qty) as wvap, sum(qty) as totalqty from t1 context by sym;
sym price qty wvap totalqty
C 49.6 2200 50.828378 14800
C 50.76 1300 50.828378 14800
C 50.32 2500 50.828378 14800
C 51.29 8800 50.828378 14800
IBM 174.97 6800 175.085082 12200
IBM 175.23 5400 175.085082 12200
MS 29.46 1900 29.726389 7200
MS 29.52 2100 29.726389 7200
MS 30.02 3200 29.726389 7200

To calculate stock returns for each firm, we cannot use group by. Instead we can use context by. We need to make sure the records are sorted appropriately within each group before we use context by.

select sym, timestamp, price, eachPre(\,price)-1.0 as ret from t1 context by sym;
sym timestamp price ret
C 09:34:07 49.6
C 09:34:16 50.76 0.023387
C 09:34:26 50.32 -0.008668
C 09:38:12 51.29 0.019277
IBM 09:32:47 174.97
IBM 09:35:26 175.23 0.001486
MS 09:36:42 29.46
MS 09:36:51 29.52 0.002037
MS 09:36:59 30.02 0.016938

We can use template function contextby for the same calculation, but the result is a vector instead of a table.

contextby(eachPre{ratio}, t1.price, t1.sym);
// output
[,,1.002037,1.016938,,1.001486,1.023387,0.991332,1.019277]

Here we use a partial application eachPre{ratio}. Please refer to Partial Application for details.

Calculate cumulative sum of trading volume for each stock in every minute:

select *, cumsum(qty) from t1 context by sym, timestamp.minute();
timestamp sym qty price cumsum_qty
09:34:07 C 2200 49.6 2200
09:34:16 C 1300 50.76 3500
09:34:26 C 2500 50.32 6000
09:38:12 C 8800 51.29 8800
09:32:47 IBM 6800 174.97 6800
09:35:26 IBM 5400 175.23 5400
09:36:42 MS 1900 29.46 1900
09:36:51 MS 2100 29.52 4000
09:36:59 MS 3200 30.02 7200

Use top clause with context by clause:

select top 2 * from t1 context by sym;
timestamp sym qty price
09:34:07 C 2200 49.6
09:34:16 C 1300 50.76
09:32:47 IBM 6800 174.97
09:35:26 IBM 5400 175.23
09:36:42 MS 1900 29.46
09:36:51 MS 2100 29.52

Please note that we cannot specify a range for the top clause when it is used with the context by clause:

select top 2:3 * from t1 context by sym;
// output
Syntax Error: [line #2] When top clause uses together with context clause in SQL query, can't specify a range in top clause

Use top clause and csort clause together with context by clause to get the most recent 2 records for each stock:

select top 2 * from t1 context by sym csort timestamp desc;
timestamp sym qty price
09:38:12 C 8800 51.29
09:34:26 C 2500 50.32
09:35:26 IBM 5400 175.23
09:32:47 IBM 6800 174.97
09:36:59 MS 3200 30.02
09:36:51 MS 2100 29.52

Use context by clause together with limit clause to get the first or the last few records for each stock:

select * from t1 context by sym limit 2;
timestamp sym qty price
09:34:07 C 2200 49.6
09:36:42 MS 1900 29.46
09:36:51 MS 2100 29.52
09:32:47 IBM 6800 174.97
09:35:26 IBM 5400 175.23
09:34:16 C 1300 50.76

Use context by clause together with csort and limit clause to get the last 2 records for each stock after sorting by qty:

select * from t1 context by sym limit -2;
timestamp sym qty price
09:36:51 MS 2100 29.52
09:36:59 MS 3200 30.02
09:32:47 IBM 6800 174.97
09:35:26 IBM 5400 175.23
09:34:26 C 2500 50.32
09:38:12 C 8800 51.29

Use context by clause together with csort and limit clause to get the last 2 records for each stock after sorting by qty:

select * from t1 context by sym csort qty limit -2;
timestamp sym qty price
09:34:26 C 2500 50.32
09:38:12 C 8800 51.29
09:35:26 IBM 5400 175.23
09:32:47 IBM 6800 174.97
09:36:51 MS 2100 29.52
09:36:59 MS 3200 30.02

Calculate fitted values of price from the regression of price on qty for each stock:

select *, ols(price, qty)[0]+ols(price, qty)[1]*qty as fittedPrice from t1 context by sym;
timestamp sym qty price fittedPrice
09:34:07 C 2200 49.6 50.282221
09:34:16 C 1300 50.76 50.156053
09:34:26 C 2500 50.32 50.324277
09:38:12 C 8800 51.29 51.207449
09:32:47 IBM 6800 174.97 174.97
09:35:26 IBM 5400 175.23 175.23
09:36:42 MS 1900 29.46 29.447279
09:36:51 MS 2100 29.52 29.535034
09:36:59 MS 3200 30.02 30.017687

The context by clause can be used with the order by clause. The order by columns must be among the output columns.

select *, ols(price, qty)[0]+ols(price, qty)[1]*qty as fittedPrice from t1 context by sym order by timestamp;
timestamp sym qty price fittedPrice
09:32:47 IBM 6800 174.97 174.97
09:34:07 C 2200 49.6 50.075318
09:34:16 C 1300 50.76 49.911222
09:34:26 C 2500 50.32 50.130017
09:35:26 IBM 5400 175.23 175.23
09:36:42 MS 1900 29.46 29.447279
09:36:51 MS 2100 29.52 29.535034
09:36:59 MS 3200 30.02 30.017687
09:38:12 C 8800 51.29 51.278686

context by is different from the contextby function in 3 aspects:

(1) contextby generates a vector while context by is used in a select clause to produce a table.

(2) contextby is limited to one grouping column whereas a context by clause can be used on multiple columns.

(3) contextby calculates one item for every call, whereas a context by clause can calculate multiple items.

Performance Tip

Before we use context by, we should sort the database by the same variable or variables in the context by clause. This could greatly improve the speed of context by clause.

n=1000000
ID=rand(100, n)
x=rand(10.0, n)
ta=table(ID, x)
tb=select * from ta order by ID;
timer select (NULL \:P x)-1 as ret from ta context by ID;
// output
Time elapsed: 4.018 ms

timer select (NULL \:P x)-1 as ret from tb context by ID;
// output
Time elapsed: 2.991 ms

DolphinDB optimizes the performance of context by clause under certain conditions.

To query the latest records of given groups in a partitioned table, use context by clause with keywords csort and limit. The performance of "context by + csort + limit" statement is optimized if the following conditions are satisfied:

  1. The context by column is filtered by the where clause.

  2. The csort column is a partitioning column, and the partition type is VALUE or RANGE.

  3. csort and context by can only specify one column.

  4. The context by column is specified in the select clause.

You can obtain the execution plan by adding keyword [HINT_EXPLAIN] to check whether the statement is optimized.