Panel Data Analysis
Panel data is a two-dimensional data form that combines cross-sectional and time-series data. In Python, panel data can be represented by a pandas DataFrame or a NumPy 2d array; In DolphinDB, it can be organized in a table or matrix. This tutorial introduces how to process panel data with SQL statements or vectorized functions in DolphinDB. The comparison test shows that the performance of panel analysis in DolphinDB outperforms Python by over 200 times.
All examples are illustrated based on version 1.30.16/2.00.4.
1. Organize and Analyze Panel Data
DolphinDB offers 2 methods of panel analysis by applying
- SQL queries or vectorized functions to tables;
- vectorized functions to matrices.
DolphinDB uses columnar storage for tables and matrices. The following are some commonly-used operators and functions applicable to columns:
- Binary operators: +, -, *, /, ratio, %, &&, ||, &, |, pow
- Order-sensitive functions: ratios, deltas, prev, next, move
- Moving functions: mcount, mavg, msum, mmax, mimax, mimin, mmin, mprod, mstd, mvar, mmed, mpercentile, mrank, mwavg, mwsum, mbeta, mcorr, mcovar
- Cumulative functions: cumcount, cumavg, cumsum, cummax, cummin, cumprod, cumstd, cumvar, cummed, cumpercentile, cumPositiveStreak, cumrank, cumwavg, cumwsum, cumbeta, cumcorr, cumcovar
- Aggregate functions: count, avg, sum, sum2, first, firstNot, last, lastNot, max, min, std, var, med, mode, percentile, atImax, atImin, wavg, wsum, beta, corr, covar
- Row-based functions (applied to rows): rowCount, rowAvg, rowSum, rowSum2, rowProd, rowMax, rowMin, rowStd, rowVar, rowBeta, rowCorr, rowAnd, rowOr, rowXor
2. Process Panel Data with SQL Queries
DolphinDB's SQL syntax is similar to the standard SQL language. It extends support to panel analysis, asof join, window join and window functions. This chapter introduces how to process panel data with SQL queries in DolphinDB.
For panel data represented by a DolphinDB table, usually a column represents a variable (such as open, high, low, close, or volume) and a row of record can be viewed as an observation at a timestamp. For calculations involving multiple indicators, data alignment is not needed. However, when the panel analysis requires grouped data (e.g., time-series analysis on stocks, or cross-sectional calculations by time range), grouping must be conducted first. DolphinDB offers SQL keywords group by
, context by
, and pivot by
for data grouping. As grouping data has certain overhead, it is recommended to perform grouped calculations within one SQL query.
The following script simulates a data set with 3 stocks for later use.
sym = `C`C`C`C`MS`MS`MS`IBM`IBM
timestamp = [09:34:57,09:34:59,09:35:01,09:35:02,09:34:57,09:34:59,09:35:01,09:35:01,09:35:02]
price= 50.6 50.62 50.63 50.64 29.46 29.48 29.5 174.97 175.02
volume = 2200 1900 2100 3200 6800 5400 1300 2500 8800
t = table(sym, timestamp, price, volume);
t;
# output
sym timestamp price volume
--- --------- ------ ------
C 09:34:57 50.6 2200
C 09:34:59 50.62 1900
C 09:35:01 50.63 2100
C 09:35:02 50.64 3200
MS 09:34:57 29.46 6800
MS 09:34:59 29.48 5400
MS 09:35:01 29.5 1300
IBM 09:35:01 174.97 2500
IBM 09:35:02 175.02 8800
2.1. 2.1 context by
DolphinDB offers a unique feature context by
to simplify the time series manipulation in panel data.
When we work with panel data, it is usually not convenient to apply time-series functions on each cross sectional object. Such examples include calculating moving average/volatility for stock returns, and cumulative returns or running maximums of different trading strategies. The group by
clause divides data into different groups and generates a record for each group. It is often used with aggregations to show one value per grouped field. To keep the number of records in each group unchanged, other database systems (such as SQL Server and PostgreSQL) adopt window functions, whereas DolphinDB offers context by
clause in SQL statements. context by
, as well as group by
and pivot by
, composes the grouped calculation system in DolphinDB. Compared with the window functions, the context by
clause is more concise, efficient and powerful, specifically:
context by
can be used with theselect
clause to query data, or with theupdate
clause to update data.- Window functions can only group data based on existing fields, while
context by
can be performed based on calculated fields. - Window functions of other database systems only include a limited number of functions, whereas
context by
clause has no limits on what functions can be used. Meanwhile,context by
can be used with any expressions, such as a combination of multiple functions. context by
can be used with thehaving
clause to filter the records within each group.
Examples
(1) Group the data by symbols and apply function ratios
to each group to calculate the ratios of trading volume between 2 consecutive trades for each stock:
select timestamp, sym, price, ratios(volume) ,volume from t context by sym;
# output
timestamp sym price ratios_volume volume
--------- --- ------ ------------- ------
09:34:57 C 50.6 2200
09:34:59 C 50.62 0.86 1900
09:35:01 C 50.63 1.106 2100
09:35:02 C 50.64 1.52 3200
09:35:01 IBM 174.97 2500
09:35:02 IBM 175.02 3.52 8800
09:34:57 MS 29.46 6800
09:34:59 MS 29.48 0.79 5400
09:35:01 MS 29.5 0.24 1300
(2) Calculate moving average price of each stock for every 3 trades with the m-function mavg
:
select *, mavg(price,3) from t context by sym;
# output
sym timestamp price volume mavg_price
--- --------- ------ ------ -----------
C 09:34:57 50.60 2200
C 09:34:59 50.62 1900
C 09:35:01 50.63 2100 50.62
C 09:35:02 50.64 3200 50.63
IBM 09:35:01 174.97 2500
IBM 09:35:02 175.02 8800
MS 09:34:57 29.46 6800
MS 09:34:59 29.48 5400
MS 09:35:01 29.50 1300 29.48
(3) Calculate the cumulative maximum trading volume of each stock with the cum-function cummax
:
select timestamp, sym, price,volume, cummax(volume) from t context by sym;
# output
timestamp sym price volume cummax_volume
--------- --- ------ ------ -------------
09:34:57 C 50.6 2200 2200
09:34:59 C 50.62 1900 2200
09:35:01 C 50.63 2100 2200
09:35:02 C 50.64 3200 3200
09:35:01 IBM 174.97 2500 2500
09:35:02 IBM 175.02 8800 8800
09:34:57 MS 29.46 6800 6800
09:34:59 MS 29.48 5400 6800
09:35:01 MS 29.5 1300 6800
(4) Calculate the maximum trading volume for each stock every minute with the aggregate function max
, and then append the result as a new column to the input table.
select *, max(volume) from t context by sym, timestamp.minute();
# output
sym timestamp price volume max_volume
--- --------- ------ ------ ----------
C 09:34:57 50.61 2200 2200
C 09:34:59 50.62 1900 2200
C 09:35:01 50.63 2100 3200
C 09:35:02 50.64 3200 3200
IBM 09:35:01 174.97 2500 8800
IBM 09:35:02 175.02 8800 8800
MS 09:34:57 29.46 6800 6800
MS 09:34:59 29.48 5400 6800
MS 09:35:01 29.5 1300 1300
2.2. 2.2 pivot by
The pivot by
clause rearranges a column (or multiple columns) of a table in two dimensions. It generates a table with select
statement or a matrix with exec
statement. As the following example shows, each column in the rearranged table represents a stock and the records are sorted based on the column "timestamp".
select price from t pivot by timestamp, sym;
# output
timestamp C IBM MS
--------- ----- ------ -----
09:34:57 50.6 29.46
09:34:59 50.62 29.48
09:35:01 50.63 174.97 29.5
09:35:02 50.64 175.02
pivot by
can be used with an aggregate function. For example, you can use pivot by
and select
to generate a table of the average closing price per minute.
select avg(price) from t where sym in `C`IBM pivot by minute(timestamp) as minute, sym;
# output
minute C IBM
------ ------ -------
09:34m 50.61
09:35m 50.635 174.995
Alternatively, you can use pivot by
with exec
to generate a matrix:
resM = exec avg(price) from t where sym in `C`IBM pivot by minute(timestamp) as minute, sym;
resM
# output
C IBM
------ -------
09:34m|50.61
09:35m|50.635 174.995
Check the data form of "resM" with typestr
:
typestr(resM)
# output
FAST DOUBLE MATRIX
3. Process Panel Data with Vectorized Functions
Use DolphinDB built-in function panel
to rearrange one or more columns into a matrix.
Arrange the column price of table t as a matrix:
price = panel(t.timestamp, t.sym, t.price);
price;
# output
C IBM MS
----- ------ -----
09:34:57|50.60 29.46
09:34:59|50.62 29.48
09:35:01|50.63 174.97 29.5
09:35:02|50.64 175.02
The following script converts the columns price and volume into 2 matrices and returns a tuple:
price, volume = panel(t.timestamp, t.sym, [t.price, t.volume]);
You can specify row and column labels for the matrix with function panel
. Note that the labels must be composed of distinct values in ascending order. For example:
rowLabel = 09:34:59..09:35:02;
colLabel = ["C", "MS"];
volume = panel(t.timestamp, t.sym, t.volume, rowLabel, colLabel);
volume;
# output
C MS
---- ----
09:34:59|1900 5400
09:35:00|
09:35:01|2100 1300
09:35:02|3200
Use function rowNames
(or colNames
) to obtain the row (or column) labels specified by function panel
.
volume.rowNames();
volume.colNames();
We recommend you use matrices rather than tables for panel analysis as DolphinDB supports vectorized and binary operations on matrices with optimal performance.
3.1. 3.1 Examples
The following examples show how to deal with panel data in a matrix.
(1) Calculate the changes in stock prices for each stock with function deltas
.
price = panel(t.timestamp, t.sym, t.price);
deltas(price);
# output
C IBM MS
---- ---- -----
09:34:57|
09:34:59|0.02 0.02
09:35:01|0.01 0.02
09:35:02|0.01 0.05
(2) Calculate the average prices of a stock within every two timestamps with the moving function mavg
.
mavg(price,2);
# output
C IBM MS
------ ------ -----------
09:34:57|
09:34:59|50.61 29.47
09:35:01|50.63 174.97 29.49
09:35:02|50.63 175.00 29.50
(3) Calculate the cumulative rank of prices of each stock with the cumulative function cumrank
.
cumrank(price);
# output
C IBM MS
- --- --
09:34:57|0 0
09:34:59|1 1
09:35:01|2 0 2
09:35:02|3 1
(4) Return the lowest price of each stock with the aggregate function min
.
min(price);
# output
[50.60,174.97,29.46]
(5) Return the lowest price at each timestamp with the aggregate function rowMin
.
rowMin(price);
# output
[29.46,29.48,29.5,50.64]
3.2. 3.2 Binary Operations on Aligned Matrices
Binary operations on standard matrices (without labels) are performed on the corresponding elements, which requires the matrices to be of the same shape. DolphinDB provides indexed matrix and indexed series, and the shapes of the matrices do not have to be the same.
- indexed matrix: a matrix with row/column labels as index
- indexed series: a vector with label as index
Before version 1.30.20/2.00.8, indexed series or indexed matrices are used for aligned calculation where the row or column labels (index) must be strictly increasing. When binary operations are performed on indexed series and indexed matrices, the data is aligned with outer join
for calculation. Since version 1.30.20/2.00.8, DolphinDB provides function align
for more alignment methods.
(1) Binary operations on 2 indexed series:
Data is automatically aligned based on the index.
index1 = 2020.11.01..2020.11.06;
value1 = 1..6;
s1 = indexedSeries(index1, value1);
index2 = 2020.11.04..2020.11.09;
value2 =4..9;
s2 = indexedSeries(index2, value2);
s1+s2;
# output
#0
--
2020.11.01|
2020.11.02|
2020.11.03|
2020.11.04|8
2020.11.05|10
2020.11.06|12
2020.11.07|
2020.11.08|
2020.11.09|
(2) Binary operations on indexed matrices:
Alignment of 2 indexed matrices are the same as that on indexed series.
id1 = 2020.11.01..2020.11.06;
m1 = matrix(1..6, 7..12, 13..18).rename!(id1, `a`b`d)
m1.setIndexedMatrix!()
id2 = 2020.11.04..2020.11.09;
m2 = matrix(4..9, 10..15, 16..21).rename!(id2, `a`b`c)
m2.setIndexedMatrix!()
m1+m2;
# output
a b c d
-- -- - -
2020.11.01|
2020.11.02|
2020.11.03|
2020.11.04|8 20
2020.11.05|10 22
2020.11.06|12 24
2020.11.07|
2020.11.08|
2020.11.09|
(3) Binary operations on an indexed series and an indexed matrix:
It aligns the data based on the row labels, and the calculation is conducted on the indexed series with each column of the indexed matrix.
m1=matrix(1..6, 11..16);
m1.rename!(2020.11.04..2020.11.09, `A`B);
m1.setIndexedMatrix!();
m1;
# output
A B
- --
2020.11.04|1 11
2020.11.05|2 12
2020.11.06|3 13
2020.11.07|4 14
2020.11.08|5 15
2020.11.09|6 16
s1;
# output
#0
--
2020.11.01|1
2020.11.02|2
2020.11.03|3
2020.11.04|4
2020.11.05|5
2020.11.06|6
m1 + s1;
# output
A B
- --
2020.11.01|
2020.11.02|
2020.11.03|
2020.11.04|5 15
2020.11.05|7 17
2020.11.06|9 19
2020.11.07|
2020.11.08|
2020.11.09|
(4) Alignment of matrices with align
:
x1 = [09:00:00, 09:00:01, 09:00:03]
x2 = [09:00:00, 09:00:03, 09:00:03, 09:00:04]
y1 = `a`a`b
y2 = `a`b`b
m1 = matrix(1 2 3, 2 3 4, 3 4 5).rename!(y1,x1)
m2 = matrix(11 12 13, 12 13 14, 13 14 15, 14 15 16).rename!(y2,x2)
a, b = align(m1, m2, 'ej,aj', false);
a;
# output
09:00:00 09:00:01 09:00:03
-------- -------- --------
a|1 2 3
a|2 3 4
b|3 4 5
b;
# output
09:00:00 09:00:01 09:00:03
-------- -------- --------
a|11 11 13
b|12 12 14
b|13 13 15
3.3. 3.3 Resample and Frequency Conversion
DolphinDB built-in functions resample
and asfreq
are used for frequency conversion and resampling on the indexed series or indexed matrix with temporal index.
3.3.1. 3.3.1 resample
The function resample
is mainly used for frequency conversion and resampling of time series. An aggregate function must be specified.
Downsample to monthly frequency:
index=2020.01.01..2020.06.30;
s=indexedSeries(index, take(1,size(index)));
s.resample("M",sum);
# output
#0
--
2020.01.31|31
2020.02.29|29
2020.03.31|31
2020.04.30|30
2020.05.31|31
2020.06.30|30
3.3.2. 3.3.2 asfreq
The function asfreq
is used for frequency conversion. Different from resample
, aggregate functions cannot be used in function asfreq
to process the data. asfreq
is usually used to convert low-frequency data to a higher frequency (upsampling) and it is usually used with a filling function.
Upsample the data on daily frequency:
index=2020.01.01 2020.01.05 2020.01.10
s=indexedSeries(index, take(1,size(index)));
s.asfreq("D").ffill()
# output
#0
--
2020.01.01|1
2020.01.02|1
2020.01.03|1
2020.01.04|1
2020.01.05|1
2020.01.06|1
2020.01.07|1
2020.01.08|1
2020.01.09|1
2020.01.10|1
3.4. 3.4 Aggregating Matrices
3.4.1. 3.4.1 Aggregating Columns
The calculations of built-in functions (including vectorized functions, aggregate functions, and window functions) to matrices are conducted based on columns.
For example, apply the function sum to a matrix:
m = rand(10, 20)$10:2
sum(m)
# output
[69, 38]
Each column is treated as a single vector in calculation.
To apply different arguments to the columns of a matrix, you can use the higher-order function each
.
m = rand(10, 20)$10:2
m
# output
#0 #1
-- --
0 8
5 2
5 3
6 9
6 4
8 2
9 8
5 6
0 4
4 4
each(msum, m, 3 4)
# output
#0 #1
-- --
10
16 22
17 18
20 18
23 23
22 20
14 20
9 22
3.4.2. 3.4.2 Aggregating Rows
DolphinDB provides a higher-order function byRow
and built-in row-based functions to perform calculation by row.
For example, apply function rowCount
to a matrix:
m=matrix([4.5 NULL 1.5, 1.5 4.8 5.9, 4.9 2.0 NULL]);
rowCount(m);
# output
[3,2,2]
The function rowCount
returns a vector whose length is the same as the number of rows in the original matrix. Each element in the vector indicates the number of non-NULL elements in a row.
In addition to the row-based functions, you can use functions with the higher-order function byRow
for row-based calculation. The following example reuses function mfunc
and matrix m defined in section 3.4.1.
byRow(mfunc{, 0}, m)
# output
[6,5.5,3.5,5,8,4.5,6,7.5,3.5,3.5]
3.4.3. 3.4.3 Aggregating in Groups
You can use an SQL group by
clause on a table, or function regroup
on a matrix to implement aggregation in groups.
For example, aggregate a matrix based on the timestamp label:
timestamp = 09:00:00 + rand(10000, 1000).sort!()
id= rand(['st1', 'st2'], 1000)
price = (190 + rand(10.0, 2000))$1000:2
regroup(price, minute(timestamp), avg, true)
For a matrix generated with pivot by
, you can aggregate it based on the labels obtained with rowNames
(or colNames
).
n=1000
timestamp = 09:00:00 + rand(10000, n).sort!()
id = take(`st1`st2`st3, n)
vol = 100 + rand(10.0, n)
t = table(timestamp, id, vol)
m = exec vol from t pivot by timestamp, id
regroup(m,minute(m.rowNames()), avg)
4. DolphinDB vs. Python: Performance of Panel Analysis
This chapter uses a complex example to demonstrate how to process panel data efficiently. The paper 101 Formulaic Alphas presents 101 quant trading alphas used by WorldQuant, one of Wall Street's top quantitative hedge funds.
Take Alpha 98 for example:
alpha_098 = (rank(decay_linear(correlation(((high_0+low_0+open_0+close_0)*0.25), sum(mean(volume_0,5), 26.4719), 4.58418), 7.18088)) -rank(decay_linear(ts_rank(ts_argmin(correlation(rank(open_0), rank(mean(volume_0,15)), 20.8187), 8.62571),6.95668), 8.07206)))
To compare the performance of the methods mentioned above, we use one year's daily stock data with approximately 1 million records (which can be generated by the panelDataSimulation script).
The table contains the following columns:
- securityid: the stock symbol
- tradetime: the trading time
- vol: the trading volume
- vwap: the weighted average prices
- open: the opening price
- close: the closing price
4.1. 4.1 SQL Queries vs. Vectorized Functions
The following examples use SQL queries and vectorized functions respectively to calculate Alpha 98 in DolphinDB.
- Use DolphinDB SQL queries:
def alpha98(stock){
t = select securityid, tradetime, vwap, open, mavg(vol, 5) as adv5, mavg(vol,15) as adv15 from stock context by securityid
update t set rank_open = rank(open), rank_adv15 = rank(adv15) context by tradetime
update t set decay7 = mavg(mcorr(vwap, msum(adv5, 26), 5), 1..7), decay8 = mavg(mrank(9 - mimin(mcorr(rank_open, rank_adv15, 21), 9), true, 7), 1..8) context by securityid
return select securityid, tradetime, rank(decay7)-rank(decay8) as A98 from t context by tradetime
}
t = loadTable("dfs://k_day_level","k_day")
timer alpha98(t)
- Use DolphinDB vectorized functions:
def myrank(x){
return rowRank(x)\x.columns()
}
def alphaPanel98(vwap, open, vol){
return myrank(mavg(mcorr(vwap, msum(mavg(vol, 5), 26), 5), 1..7)) - myrank(mavg(mrank(9 - mimin(mcorr(myrank(open), myrank(mavg(vol, 15)), 21), 9), true, 7), 1..8))
}
t = select * from loadTable("dfs://k_day_level","k_day")
timer vwap, open, vol = panel(t.tradetime, t.securityid, [t.vwap, t.open, t.vol])
timer res = alphaPanel98(vwap, open, vol)
The second script using vectorized functions is more concise.
To calculate Alpha 98 for a stock in a day, we need all stocks' data on the day as well as the stock's data in previous days. Therefore, using a matrix to calculate Alpha 98 is simpler as there is no need for dimensional transformations of the intermediate result or the final result. As for the calculation of Alpha 98 with SQL queries, nested functions and grouped calculations are involved. In conclusion, the script using function panel
is cleaner and more efficient.
Both scripts are executed with a single thread. The first script with SQL queries takes 610 milliseconds, whereas the second script using a matrix takes 510 milliseconds (function panel
takes 70 milliseconds to create the panel data and the calculation takes 440 milliseconds).
The performance of panel analysis with vectorized functions is slightly better. Panel data represented by matrices cannot be regrouped, and users cannot observe multiple indicators at the same time. As for panel data represented by tables, you can use SQL statements to group data by multiple columns and query joint columns, which is more suitable for parallel computing with a large volume of data.
4.2. 4.2 DolphinDB vs. pandas
Part of the pandas script of Alpha 98 calculation is shown as follows.
def myrank(x):
return ((x.rank(axis=1,method='min'))-1)/x.shape[1]
def imin(x):
return np.where(x==min(x))[0][0]
def rank(x):
s = pd.Series(x)
return (s.rank(ascending=True, method="min")[len(s)-1])-1
def alpha98(vwap, open, vol):
return myrank(vwap.rolling(5).corr(vol.rolling(5).mean().rolling(26).sum()).rolling(7).apply(lambda x: np.sum(np.arange(1, 8)*x)/np.sum(np.arange(1, 8)))) - myrank((9 - myrank(open).rolling(21).corr(myrank(vol.rolling(15).mean())).rolling(9).apply(imin)).rolling(7).apply(rank).rolling(8).apply(lambda x: np.sum(np.arange(1, 9)*x)/np.sum(np.arange(1, 9))))
start_time = time.time()
re=alpha98(vwap, open, vol)
print("--- %s seconds ---" % (time.time() - start_time))
The pandas script takes 520 seconds to calculate Alpha 98, which is approximately 1000 times slower than using a DolphinDB matrix.
DolphinDB provides optimized built-in functions for time-series processing, such as m-functions mavg
, mcorr
, mrank
, mimin
, and msum
used above. These built-in functions outperform other database systems by 1-2 orders of magnitude. In particular, the calculation of mcorr
takes only 0.6 seconds in DolphinDB, over 200 times faster than 142 seconds in pandas. The comparison tests reduce measurement errors by repeating the execution 10 times on closing prices in 10 years (around 5.3 million records).