Panel Data Analysis in DolphinDB

Panel data is a two-dimensional data form that combines cross-sectional and time-series data. In Python, panel data can be represented by a pandas DataFrame or a NumPy 2d array; In DolphinDB, it can be organized in a table or matrix. This tutorial introduces how to process panel data with SQL statements or vectorized functions in DolphinDB. The comparison test shows that the performance of panel analysis in DolphinDB outperforms Python by over 200 times.

All examples are illustrated based on version 1.30.16/2.00.4.

1. Organize and Analyze Panel Data

DolphinDB offers 2 methods of panel analysis by applying

  • SQL queries or vectorized functions to tables;
  • vectorized functions to matrices.

DolphinDB uses columnar storage for tables and matrices. The following are some commonly-used operators and functions applicable to columns:

  • Binary operators: +, -, *, /, ratio, %, &&, ||, &, |, pow
  • Order-sensitive functions: ratios, deltas, prev, next, move
  • Moving functions: mcount, mavg, msum, mmax, mimax, mimin, mmin, mprod, mstd, mvar, mmed, mpercentile, mrank, mwavg, mwsum, mbeta, mcorr, mcovar
  • Cumulative functions: cumcount, cumavg, cumsum, cummax, cummin, cumprod, cumstd, cumvar, cummed, cumpercentile, cumPositiveStreak, cumrank, cumwavg, cumwsum, cumbeta, cumcorr, cumcovar
  • Aggregate functions: count, avg, sum, sum2, first, firstNot, last, lastNot, max, min, std, var, med, mode, percentile, atImax, atImin, wavg, wsum, beta, corr, covar
  • Row-based functions (applied to rows): rowCount, rowAvg, rowSum, rowSum2, rowProd, rowMax, rowMin, rowStd, rowVar, rowBeta, rowCorr, rowAnd, rowOr, rowXor

2. Process Panel Data with SQL Queries

DolphinDB's SQL syntax is similar to the standard SQL language. It extends support to panel analysis, asof join, window join and window functions. This chapter introduces how to process panel data with SQL queries in DolphinDB.

For panel data represented by a DolphinDB table, usually a column represents a variable (such as open, high, low, close, or volume) and a row of record can be viewed as an observation at a timestamp. For calculations involving multiple indicators, data alignment is not needed. However, when the panel analysis requires grouped data (e.g., time-series analysis on stocks, or cross-sectional calculations by time range), grouping must be conducted first. DolphinDB offers SQL keywords group by, context by, and pivot by for data grouping. As grouping data has certain overhead, it is recommended to perform grouped calculations within one SQL query.

The following script simulates a data set with 3 stocks for later use.

sym = `C`C`C`C`MS`MS`MS`IBM`IBM
timestamp = [09:34:57,09:34:59,09:35:01,09:35:02,09:34:57,09:34:59,09:35:01,09:35:01,09:35:02]
price= 50.6 50.62 50.63 50.64 29.46 29.48 29.5 174.97 175.02
volume = 2200 1900 2100 3200 6800 5400 1300 2500 8800
t = table(sym, timestamp, price, volume);
t;

# output
sym timestamp price  volume
--- --------- ------ ------
C   09:34:57  50.6   2200
C   09:34:59  50.62  1900
C   09:35:01  50.63  2100
C   09:35:02  50.64  3200
MS  09:34:57  29.46  6800
MS  09:34:59  29.48  5400
MS  09:35:01  29.5   1300
IBM 09:35:01  174.97 2500
IBM 09:35:02  175.02 8800

2.1. 2.1 context by

DolphinDB offers a unique feature context by to simplify the time series manipulation in panel data.

When we work with panel data, it is usually not convenient to apply time-series functions on each cross sectional object. Such examples include calculating moving average/volatility for stock returns, and cumulative returns or running maximums of different trading strategies. The group by clause divides data into different groups and generates a record for each group. It is often used with aggregations to show one value per grouped field. To keep the number of records in each group unchanged, other database systems (such as SQL Server and PostgreSQL) adopt window functions, whereas DolphinDB offers context by clause in SQL statements. context by, as well as group by and pivot by, composes the grouped calculation system in DolphinDB. Compared with the window functions, the context by clause is more concise, efficient and powerful, specifically:

  • context by can be used with the select clause to query data, or with the update clause to update data.
  • Window functions can only group data based on existing fields, while context by can be performed based on calculated fields.
  • Window functions of other database systems only include a limited number of functions, whereas context by clause has no limits on what functions can be used. Meanwhile, context by can be used with any expressions, such as a combination of multiple functions.
  • context by can be used with the having clause to filter the records within each group.

Examples

(1) Group the data by symbols and apply function ratios to each group to calculate the ratios of trading volume between 2 consecutive trades for each stock:

select timestamp, sym, price, ratios(volume) ,volume from t context by sym;

# output
timestamp sym price  ratios_volume volume
--------- --- ------ ------------- ------
09:34:57  C   50.6                  2200
09:34:59  C   50.62  0.86           1900
09:35:01  C   50.63  1.106          2100
09:35:02  C   50.64  1.52           3200
09:35:01  IBM 174.97                2500
09:35:02  IBM 175.02 3.52           8800
09:34:57  MS  29.46                 6800
09:34:59  MS  29.48  0.79           5400
09:35:01  MS  29.5   0.24           1300

(2) Calculate moving average price of each stock for every 3 trades with the m-function mavg:

select *, mavg(price,3) from t context by sym;

# output
sym timestamp price  volume mavg_price
--- --------- ------ ------ -----------
C   09:34:57  50.60  2200
C   09:34:59  50.62  1900
C   09:35:01  50.63  2100   50.62
C   09:35:02  50.64  3200   50.63
IBM 09:35:01  174.97 2500
IBM 09:35:02  175.02 8800
MS  09:34:57  29.46  6800
MS  09:34:59  29.48  5400
MS  09:35:01  29.50  1300   29.48

(3) Calculate the cumulative maximum trading volume of each stock with the cum-function cummax:

select timestamp, sym, price,volume, cummax(volume) from t context by sym;

# output
timestamp sym price  volume cummax_volume
--------- --- ------ ------ -------------
09:34:57  C   50.6   2200   2200
09:34:59  C   50.62  1900   2200
09:35:01  C   50.63  2100   2200
09:35:02  C   50.64  3200   3200
09:35:01  IBM 174.97 2500   2500
09:35:02  IBM 175.02 8800   8800
09:34:57  MS  29.46  6800   6800
09:34:59  MS  29.48  5400   6800
09:35:01  MS  29.5   1300   6800

(4) Calculate the maximum trading volume for each stock every minute with the aggregate function max, and then append the result as a new column to the input table.

select *, max(volume) from t context by sym, timestamp.minute();

# output
sym timestamp price  volume max_volume
--- --------- ------ ------ ----------
C   09:34:57  50.61  2200   2200
C   09:34:59  50.62  1900   2200
C   09:35:01  50.63  2100   3200
C   09:35:02  50.64  3200   3200
IBM 09:35:01  174.97 2500   8800
IBM 09:35:02  175.02 8800   8800
MS  09:34:57  29.46  6800   6800
MS  09:34:59  29.48  5400   6800
MS  09:35:01  29.5   1300   1300

2.2. 2.2 pivot by

The pivot by clause rearranges a column (or multiple columns) of a table in two dimensions. It generates a table with select statement or a matrix with exec statement. As the following example shows, each column in the rearranged table represents a stock and the records are sorted based on the column "timestamp".

select price from t pivot by timestamp, sym;

# output
timestamp C     IBM    MS
--------- ----- ------ -----
09:34:57  50.6         29.46
09:34:59  50.62        29.48
09:35:01  50.63 174.97 29.5
09:35:02  50.64 175.02

pivot by can be used with an aggregate function. For example, you can use pivot by and select to generate a table of the average closing price per minute.

select avg(price) from t where sym in `C`IBM pivot by minute(timestamp) as minute, sym;

# output
minute C      IBM
------ ------ -------
09:34m 50.61
09:35m 50.635 174.995

Alternatively, you can use pivot by with exec to generate a matrix:

resM = exec avg(price) from t where sym in `C`IBM pivot by minute(timestamp) as minute, sym;
resM

# output
       C      IBM
       ------ -------
09:34m|50.61
09:35m|50.635 174.995

Check the data form of "resM" with typestr:

typestr(resM)

# output
FAST DOUBLE MATRIX

3. Process Panel Data with Vectorized Functions

Use DolphinDB built-in function panel to rearrange one or more columns into a matrix.

Arrange the column price of table t as a matrix:

price = panel(t.timestamp, t.sym, t.price);
price;
 
# output
         C     IBM    MS
         ----- ------ -----
09:34:57|50.60        29.46
09:34:59|50.62        29.48
09:35:01|50.63 174.97 29.5
09:35:02|50.64 175.02

The following script converts the columns price and volume into 2 matrices and returns a tuple:

price, volume = panel(t.timestamp, t.sym, [t.price, t.volume]);

You can specify row and column labels for the matrix with function panel. Note that the labels must be composed of distinct values in ascending order. For example:

rowLabel = 09:34:59..09:35:02;
colLabel = ["C", "MS"];
volume = panel(t.timestamp, t.sym, t.volume, rowLabel, colLabel);
volume;
 
# output
         C    MS
         ---- ----
09:34:59|1900 5400
09:35:00|
09:35:01|2100 1300
09:35:02|3200

Use function rowNames (or colNames) to obtain the row (or column) labels specified by function panel.

volume.rowNames();
volume.colNames();

We recommend you use matrices rather than tables for panel analysis as DolphinDB supports vectorized and binary operations on matrices with optimal performance.

3.1. 3.1 Examples

The following examples show how to deal with panel data in a matrix.

(1) Calculate the changes in stock prices for each stock with function deltas.

price = panel(t.timestamp, t.sym, t.price);
deltas(price);
 
# output
         C    IBM  MS
         ---- ---- -----
09:34:57|
09:34:59|0.02      0.02
09:35:01|0.01      0.02
09:35:02|0.01 0.05

(2) Calculate the average prices of a stock within every two timestamps with the moving function mavg.

mavg(price,2);
 
# output
         C      IBM      MS
         ------ ------ -----------
09:34:57|
09:34:59|50.61         29.47
09:35:01|50.63  174.97 29.49
09:35:02|50.63  175.00 29.50

(3) Calculate the cumulative rank of prices of each stock with the cumulative function cumrank.

cumrank(price);
 
# output
         C IBM MS
         - --- --
09:34:57|0     0
09:34:59|1     1
09:35:01|2 0   2
09:35:02|3 1

(4) Return the lowest price of each stock with the aggregate function min.

min(price);
 
# output
[50.60,174.97,29.46]

(5) Return the lowest price at each timestamp with the aggregate function rowMin.

rowMin(price);
 
 # output
[29.46,29.48,29.5,50.64]

3.2. 3.2 Binary Operations on Aligned Matrices

Binary operations on standard matrices (without labels) are performed on the corresponding elements, which requires the matrices to be of the same shape. DolphinDB provides indexed matrix and indexed series, and the shapes of the matrices do not have to be the same.

  • indexed matrix: a matrix with row/column labels as index
  • indexed series: a vector with label as index

Before version 1.30.20/2.00.8, indexed series or indexed matrices are used for aligned calculation where the row or column labels (index) must be strictly increasing. When binary operations are performed on indexed series and indexed matrices, the data is aligned with outer join for calculation. Since version 1.30.20/2.00.8, DolphinDB provides function align for more alignment methods.

(1) Binary operations on 2 indexed series:

Data is automatically aligned based on the index.

index1 = 2020.11.01..2020.11.06;
value1 = 1..6;
s1 = indexedSeries(index1, value1);
 
index2 = 2020.11.04..2020.11.09;
value2 =4..9;
s2 = indexedSeries(index2, value2);
 
s1+s2;
 
 # output
           #0
           --
2020.11.01|
2020.11.02|
2020.11.03|
2020.11.04|8
2020.11.05|10
2020.11.06|12
2020.11.07|
2020.11.08|
2020.11.09|

(2) Binary operations on indexed matrices:

Alignment of 2 indexed matrices are the same as that on indexed series.

id1 = 2020.11.01..2020.11.06;
m1 = matrix(1..6, 7..12, 13..18).rename!(id1, `a`b`d)
m1.setIndexedMatrix!()
 
id2 = 2020.11.04..2020.11.09;
m2 = matrix(4..9, 10..15, 16..21).rename!(id2, `a`b`c)
m2.setIndexedMatrix!()
 
m1+m2;
 
 # output
           a  b  c d
           -- -- - -
2020.11.01|        
2020.11.02|        
2020.11.03|        
2020.11.04|8  20   
2020.11.05|10 22   
2020.11.06|12 24   
2020.11.07|        
2020.11.08|        
2020.11.09|

(3) Binary operations on an indexed series and an indexed matrix:

It aligns the data based on the row labels, and the calculation is conducted on the indexed series with each column of the indexed matrix.

m1=matrix(1..6, 11..16);
m1.rename!(2020.11.04..2020.11.09, `A`B);
m1.setIndexedMatrix!();
m1;
 
 # output
           A B
           - --
2020.11.04|1 11
2020.11.05|2 12
2020.11.06|3 13
2020.11.07|4 14
2020.11.08|5 15
2020.11.09|6 16
 
s1;
 
 # output
           #0
           --
2020.11.01|1
2020.11.02|2
2020.11.03|3
2020.11.04|4
2020.11.05|5
2020.11.06|6
 
m1 + s1;
 
 # output
           A B
           - --
2020.11.01|
2020.11.02|
2020.11.03|
2020.11.04|5 15
2020.11.05|7 17
2020.11.06|9 19
2020.11.07|
2020.11.08|
2020.11.09|

(4) Alignment of matrices with align:

x1 = [09:00:00, 09:00:01, 09:00:03]
x2 = [09:00:00, 09:00:03, 09:00:03, 09:00:04]
y1 = `a`a`b
y2 = `a`b`b
m1 = matrix(1 2 3, 2 3 4, 3 4 5).rename!(y1,x1)
m2 = matrix(11 12 13, 12 13 14, 13 14 15, 14 15 16).rename!(y2,x2)
a, b = align(m1, m2, 'ej,aj', false);
a;
 
# output
  09:00:00 09:00:01 09:00:03
  -------- -------- --------
a|1        2        3      
a|2        3        4      
b|3        4        5     
 
b;
 
# output
  09:00:00 09:00:01 09:00:03
  -------- -------- --------
a|11       11       13     
b|12       12       14     
b|13       13       15

3.3. 3.3 Resample and Frequency Conversion

DolphinDB built-in functions resample and asfreq are used for frequency conversion and resampling on the indexed series or indexed matrix with temporal index.

3.3.1. 3.3.1 resample

The function resample is mainly used for frequency conversion and resampling of time series. An aggregate function must be specified.

Downsample to monthly frequency:

index=2020.01.01..2020.06.30;
s=indexedSeries(index, take(1,size(index)));
s.resample("M",sum);
 
 # output
           #0
           --
2020.01.31|31
2020.02.29|29
2020.03.31|31
2020.04.30|30
2020.05.31|31
2020.06.30|30

3.3.2. 3.3.2 asfreq

The function asfreq is used for frequency conversion. Different from resample, aggregate functions cannot be used in function asfreq to process the data. asfreq is usually used to convert low-frequency data to a higher frequency (upsampling) and it is usually used with a filling function.

Upsample the data on daily frequency:

index=2020.01.01 2020.01.05 2020.01.10
s=indexedSeries(index, take(1,size(index)));
s.asfreq("D").ffill()
 
 # output
            #0
           --
2020.01.01|1
2020.01.02|1
2020.01.03|1
2020.01.04|1
2020.01.05|1
2020.01.06|1
2020.01.07|1
2020.01.08|1
2020.01.09|1
2020.01.10|1

3.4. 3.4 Aggregating Matrices

3.4.1. 3.4.1 Aggregating Columns

The calculations of built-in functions (including vectorized functions, aggregate functions, and window functions) to matrices are conducted based on columns.

For example, apply the function sum to a matrix:

m = rand(10, 20)$10:2
sum(m)
 
# output
[69, 38]

Each column is treated as a single vector in calculation.

To apply different arguments to the columns of a matrix, you can use the higher-order function each.

m = rand(10, 20)$10:2
m
 
# output
#0 #1
-- --
0  8 
5  2 
5  3 
6  9 
6  4 
8  2 
9  8 
5  6 
0  4 
4  4 
 
each(msum, m, 3 4)
 
# output
#0 #1
-- --
     
     
10   
16 22
17 18
20 18
23 23
22 20
14 20
9  22

3.4.2. 3.4.2 Aggregating Rows

DolphinDB provides a higher-order function byRow and built-in row-based functions to perform calculation by row.

For example, apply function rowCount to a matrix:

m=matrix([4.5 NULL 1.5, 1.5 4.8 5.9, 4.9 2.0 NULL]);
rowCount(m);
 
# output
[3,2,2]

The function rowCount returns a vector whose length is the same as the number of rows in the original matrix. Each element in the vector indicates the number of non-NULL elements in a row.

In addition to the row-based functions, you can use functions with the higher-order function byRow for row-based calculation. The following example reuses function mfunc and matrix m defined in section 3.4.1.

byRow(mfunc{, 0}, m)
 
# output
[6,5.5,3.5,5,8,4.5,6,7.5,3.5,3.5]

3.4.3. 3.4.3 Aggregating in Groups

You can use an SQL group by clause on a table, or function regroup on a matrix to implement aggregation in groups.

For example, aggregate a matrix based on the timestamp label:

timestamp = 09:00:00 + rand(10000, 1000).sort!()
id= rand(['st1', 'st2'], 1000)
price = (190 + rand(10.0, 2000))$1000:2
regroup(price, minute(timestamp), avg, true)

For a matrix generated with pivot by, you can aggregate it based on the labels obtained with rowNames (or colNames).

n=1000
timestamp = 09:00:00 + rand(10000, n).sort!()
id = take(`st1`st2`st3, n)
vol = 100 + rand(10.0, n)
t = table(timestamp, id, vol)
 
m = exec vol from t pivot by timestamp, id
regroup(m,minute(m.rowNames()), avg)

4. DolphinDB vs. Python: Performance of Panel Analysis

This chapter uses a complex example to demonstrate how to process panel data efficiently. The paper 101 Formulaic Alphas presents 101 quant trading alphas used by WorldQuant, one of Wall Street's top quantitative hedge funds.

Take Alpha 98 for example:

alpha_098 = (rank(decay_linear(correlation(((high_0+low_0+open_0+close_0)*0.25), sum(mean(volume_0,5), 26.4719), 4.58418), 7.18088)) -rank(decay_linear(ts_rank(ts_argmin(correlation(rank(open_0), rank(mean(volume_0,15)), 20.8187), 8.62571),6.95668), 8.07206)))

To compare the performance of the methods mentioned above, we use one year's daily stock data with approximately 1 million records (which can be generated by the panelDataSimulation script).

The table contains the following columns:

  • securityid: the stock symbol
  • tradetime: the trading time
  • vol: the trading volume
  • vwap: the weighted average prices
  • open: the opening price
  • close: the closing price

4.1. 4.1 SQL Queries vs. Vectorized Functions

The following examples use SQL queries and vectorized functions respectively to calculate Alpha 98 in DolphinDB.

  • Use DolphinDB SQL queries:
def alpha98(stock){
	t = select securityid, tradetime, vwap, open, mavg(vol, 5) as adv5, mavg(vol,15) as adv15 from stock context by securityid
	update t set rank_open = rank(open), rank_adv15 = rank(adv15) context by tradetime
	update t set decay7 = mavg(mcorr(vwap, msum(adv5, 26), 5), 1..7), decay8 = mavg(mrank(9 - mimin(mcorr(rank_open, rank_adv15, 21), 9), true, 7), 1..8) context by securityid
	return select securityid, tradetime, rank(decay7)-rank(decay8) as A98 from t context by tradetime
}

t = loadTable("dfs://k_day_level","k_day")
timer alpha98(t)
  • Use DolphinDB vectorized functions:
def myrank(x){
	return rowRank(x)\x.columns()
}

def alphaPanel98(vwap, open, vol){
	return myrank(mavg(mcorr(vwap, msum(mavg(vol, 5), 26), 5), 1..7)) - myrank(mavg(mrank(9 - mimin(mcorr(myrank(open), myrank(mavg(vol, 15)), 21), 9), true, 7), 1..8))
}
t = select * from loadTable("dfs://k_day_level","k_day")
timer vwap, open, vol = panel(t.tradetime, t.securityid, [t.vwap, t.open, t.vol])
timer res = alphaPanel98(vwap, open, vol)

The second script using vectorized functions is more concise.

To calculate Alpha 98 for a stock in a day, we need all stocks' data on the day as well as the stock's data in previous days. Therefore, using a matrix to calculate Alpha 98 is simpler as there is no need for dimensional transformations of the intermediate result or the final result. As for the calculation of Alpha 98 with SQL queries, nested functions and grouped calculations are involved. In conclusion, the script using function panel is cleaner and more efficient.

Both scripts are executed with a single thread. The first script with SQL queries takes 610 milliseconds, whereas the second script using a matrix takes 510 milliseconds (function panel takes 70 milliseconds to create the panel data and the calculation takes 440 milliseconds).

The performance of panel analysis with vectorized functions is slightly better. Panel data represented by matrices cannot be regrouped, and users cannot observe multiple indicators at the same time. As for panel data represented by tables, you can use SQL statements to group data by multiple columns and query joint columns, which is more suitable for parallel computing with a large volume of data.

4.2. 4.2 DolphinDB vs. pandas

Part of the pandas script of Alpha 98 calculation is shown as follows.

 def myrank(x):
    return ((x.rank(axis=1,method='min'))-1)/x.shape[1]

def imin(x):
    return np.where(x==min(x))[0][0]


def rank(x):
    s = pd.Series(x)
    return (s.rank(ascending=True, method="min")[len(s)-1])-1


def alpha98(vwap, open, vol):
    return myrank(vwap.rolling(5).corr(vol.rolling(5).mean().rolling(26).sum()).rolling(7).apply(lambda x: np.sum(np.arange(1, 8)*x)/np.sum(np.arange(1, 8)))) - myrank((9 - myrank(open).rolling(21).corr(myrank(vol.rolling(15).mean())).rolling(9).apply(imin)).rolling(7).apply(rank).rolling(8).apply(lambda x: np.sum(np.arange(1, 9)*x)/np.sum(np.arange(1, 9))))
start_time = time.time()
re=alpha98(vwap, open, vol)
print("--- %s seconds ---" % (time.time() - start_time))

The pandas script takes 520 seconds to calculate Alpha 98, which is approximately 1000 times slower than using a DolphinDB matrix.

DolphinDB provides optimized built-in functions for time-series processing, such as m-functions mavg, mcorr, mrank, mimin, and msum used above. These built-in functions outperform other database systems by 1-2 orders of magnitude. In particular, the calculation of mcorr takes only 0.6 seconds in DolphinDB, over 200 times faster than 142 seconds in pandas. The comparison tests reduce measurement errors by repeating the execution 10 times on closing prices in 10 years (around 5.3 million records).