pivot by

pivot by is a unique feature in DolphinDB and an extension to the ANSI SQL. It rearranges a column (or multiple columns) of a table (with or without a data transformation function) on two dimensions. If the pivot by clause includes more than 2 columns, the last column determines the columns in the result and all other columns determine the rows in the result. It produces a table with select statement and a matrix with exec statement. When used in conjunction with the pivot by clause, the select clause can specify multiple columns. See pivot and unpivot .

The following versions contain enhancements for the pivot by statement:

Version Enhancements
1.20.4

1. If the pivot by clause includes more than 2 columns, the last column determines the columns in the result and all other columns determine the rows in the result.

2. When used in conjunction with the pivot by clause, the select clause can specify multiple columns.

2.00.9

As of version 1.30.21/2.00.9, query performance has been optimized for pivot by statements which satisfy both of the following conditions:

  • None of the following functions are applied to the queried columns: aggregate functions (e.g., sum), order-sensitive functions (e.g., last), row-based functions (e.g., rowMin), filling functions (e.g., ffill).

  • The last column specified in the pivot by clause is a partitioning column.

2.00.10.4
  1. pivot by supports querying array vector columns in the select statement, and applying aggregate functions (except row-based columns) on array vector columns. It also supports applying aggregate functions that return scalars (such as max, sum etc.) on array vector columns in the exec clause (see Example 6 for details).

  2. When there exist multiple identical combinations of pivot-by column values, all data will be retained without deduplication after applying the asis function to at least one query column if all the following conditions are met:

  • The queried table is a DFS table.

  • The from clause does not contain table joins.

  • The first n-1 pivot-by columns cannot be of data type DECIMAL128.

  • The last pivot-by column is a partitioning column.

  • The selected columns do not contain null values. Aggregate functions, order-sensitive functions, row-based functions, or fill functions are not used in the select clause.

  • The table generated by the pivot by clause can be fully populated by original data, which means:

    • The number of rows for each combination of values in the first n-1 pivot-by columns (which determine the rows in the output table) does not need to be exactly equal.

    • For each combination of values in the first n-1 pivot-by columns, the number of values in the last pivot-by column (which determines the columns in the output table) must be the same.

Examples

Example 1

sym = `C`MS`MS`MS`IBM`IBM`C`C`C
price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
timestamp = [09:34:07,09:35:42,09:36:51,09:36:59,09:35:47,09:36:26,09:34:16,09:35:26,09:36:12]
t2 = table(timestamp, sym, qty, price);

t2;
timestamp sym qty price
09:34:07 C 2200 49.6
09:35:42 MS 1900 29.46
09:36:51 MS 2100 29.52
09:36:59 MS 3200 30.02
09:35:47 IBM 6800 174.97
09:36:26 IBM 5400 175.23
09:34:16 C 1300 50.76
09:35:26 C 2500 50.32
09:36:12 C 8800 51.29
select price from t2 pivot by timestamp, sym;
timestamp C IBM MS
09:34:07 49.6
09:34:16 50.76
09:35:26 50.32
09:35:42 29.46
09:35:47 174.97
09:36:12 51.29
09:36:26 175.23
09:36:51 29.52
09:36:59 30.02
a=select last(price) from t2 pivot by timestamp.minute(), sym;

a;
minute_timestamp C IBM MS
09:34m 50.76
09:35m 50.32 174.97 29.46
09:36m 51.29 175.23 30.02
typestr a;
// output: TABLE

b=exec count(price) from t2 pivot by timestamp.minute(), sym;
b;
label C IBM MS
09:34m 2
09:35m 1 1 1
09:36m 1 1 2
typestr b;
// output: FAST DOUBLE MATRIX

Example 2. Calculate the basket value of an ETF's constituents. For simplicity, assume an ETF has 2 constituents AAPL and FB.

symbol=take(`AAPL, 6) join take(`FB, 5)
time=2019.02.27T09:45:01.000000000+[146, 278, 412, 445, 496, 789, 212, 556, 598, 712, 989]
price=173.27 173.26 173.24 173.25 173.26 173.27 161.51 161.50 161.49 161.50 161.51
quotes=table(symbol, time, price)
weights=dict(`AAPL`FB, 0.6 0.4)
ETF = select symbol, time, price*weights[symbol] as price from quotes;

select price from ETF pivot by time, symbol;
Time AAPL FB
2019.02.27T09:45:01.000000146 103.962
2019.02.27T09:45:01.000000212 64.604
2019.02.27T09:45:01.000000278 103.956
2019.02.27T09:45:01.000000412 103.944
2019.02.27T09:45:01.000000445 103.95
2019.02.27T09:45:01.000000496 103.956
2019.02.27T09:45:01.000000556 64.6
2019.02.27T09:45:01.000000598 64.596
2019.02.27T09:45:01.000000712 64.6
2019.02.27T09:45:01.000000789 103.962
2019.02.27T09:45:01.000000989 64.604

To calculate the basket value of this ETF's constituents at each timestamp that any of its constituents had a trade, we need to forward fill each stock's price in the pivot table, and then sum each row's weighted price. They can be completed in just one SQL statement with "pivot by".

select rowSum(ffill(price)) from ETF pivot by time, symbol;
Time rowSum
2019.02.27T09:45:01.000000146 103.962
2019.02.27T09:45:01.000000212 168.566
2019.02.27T09:45:01.000000278 168.56
2019.02.27T09:45:01.000000412 168.548
2019.02.27T09:45:01.000000445 168.554
2019.02.27T09:45:01.000000496 168.56
2019.02.27T09:45:01.000000556 168.556
2019.02.27T09:45:01.000000598 168.552
2019.02.27T09:45:01.000000712 168.556
2019.02.27T09:45:01.000000789 168.562
2019.02.27T09:45:01.000000989 168.566

Example 3. The names of the newly created columns from a pivot by clause will be automatically adjusted.

Before version 2.00.2, column names in DolphinDB can only use letters, digits or underscores (_), and must start with letters.

Since version 2.00.2, column names generated by pivot by, addColumn can contain special characters or start with digits.

For details, please refer to section Creating tables in Table.

date = take(2021.08.01  2021.08.02 2021.08.03, 12)
sym = take(["IBM N", "_MSFTN", "3_GOOGS", ""], 12).sort()
value = 1..12
t=table(date, sym, value)
re = select value from t pivot by date, sym
date NULL 3_GOOGS IBM N _MSFTN
2021.08.01 1 4 7 10
2021.08.02 2 5 8 11
2021.08.03 3 6 9 12
select  _"NULL" from re
NULL
1
2
3
select  _"3_GOOGS" from re
3_GOOGS
4
5
6

In the following example, set removeSpecialCharInColumnName = true. The stock symbols "600300" and "600600" are adjusted to "c600300" and "c600600" as column names.

symbol=take(`600300, 6) join take(`600600, 5)
time=2019.02.27T09:45:01.000000000+[146, 278, 412, 445, 496, 789, 212, 556, 598, 712, 989]
price=173.27 173.26 173.24 173.25 173.26 173.27 161.51 161.50 161.49 161.50 161.51
quotes=table(symbol, time, price)
weights=dict(`600300`600600, 0.6 0.4)
ETF = select symbol, time, price*weights[symbol] as price from quotes;

select last(price) from ETF pivot by time, symbol;
time C600300 C600600
2019.02.27T09:45:01.000000146 103.962
2019.02.27T09:45:01.000000212 64.604
2019.02.27T09:45:01.000000278 103.956
2019.02.27T09:45:01.000000412 103.944
2019.02.27T09:45:01.000000445 103.95
2019.02.27T09:45:01.000000496 103.956
2019.02.27T09:45:01.000000556 64.6
2019.02.27T09:45:01.000000598 64.596
2019.02.27T09:45:01.000000712 64.6
2019.02.27T09:45:01.000000789 103.962
2019.02.27T09:45:01.000000989 64.604

Example 4. The pivot by clause includes more than 2 columns.

date = 2020.09.21 + 0 0 0 0 1 1 1 1
sym = `MS`MS`GS`GS`MS`MS`GS`GS$SYMBOL
factorNum = 1 2 1 2 1 2 1 2
factorValue = 1.2 -3.4 -2.5 6.3 1.1 -3.2 -2.1 5.6
t = table(date, sym, factorNum, factorValue);

t;
date sym factorNum factorValue
2020.09.21 MS 1 1.2
2020.09.21 MS 2 -3.4
2020.09.21 GS 1 -2.5
2020.09.21 GS 2 6.3
2020.09.22 MS 1 1.1
2020.09.22 MS 2 -3.2
2020.09.22 GS 1 -2.1
2020.09.22 GS 2 5.6
select factorValue from t pivot by date, sym, factorNum;
date sym 1 2
2020.09.21 GS -2.5 6.3
2020.09.21 MS 1.2 -3.4
2020.09.22 GS -2.1 5.6
2020.09.22 MS 1.1 -3.2

Select multiple columns for each factorNum

select factorValue, factorValue>0 as factorSign from t pivot by date, sym, factorNum;
date sym factorValue_1 factorValue_2 factorSign_1 factorSign_2
2020.09.21 GS -2.5 6.3 0 1
2020.09.21 MS 1.2 -3.4 1 0
2020.09.22 GS -2.1 5.6 0 1
2020.09.22 MS 1.1 -3.2 1 0

Example 5. Retain duplicate values after applying function asis

In this example, the following conditions are met:

  • table pt is a DFS table

  • The from clause does not contain table joins.

  • The first n-1 pivot-by column (i.e., the time column) does not contain DECIMAL128 values.

  • The last pivot-by column (i.e., the sym column) is a partitioning column.

  • The selected column value does not contain null values. Aggregate functions, order-sensitive functions, row-based functions, or fill functions are not used in the select clause.

  • The table generated by the pivot by clause can be fully populated. In this example, for each distinct value in the time column, the number of values in the last pivot-by column sym is always equal. When time=10:20:44, the sym A, B, and C all have one record; When time=10:20:45, the sym A, B, and C all have one record; When time=10:20:46, the sym A, B, and C all have two records.

Apply function asis to the value column, all data can be retained:

time = [10:20:44,10:20:44,10:20:44,10:20:45,10:20:45,10:20:45,10:20:46,10:20:46,10:20:46,10:20:46,10:20:46,10:20:46]
sym = ["A","B","C","A","B","C","A","B","C","A","B","C"]
value = [510,434,999,837,402,615,495,885,745,968,975,165]
t = table(time,sym,value)

db = database("dfs://test_pivot",VALUE,`A`B`C)
pt = db.createPartitionedTable(t,`pt,`sym)
pt.append!(t)

select value from pt pivot by time,sym
time A B C
10:20:44 510 434 999
10:20:45 837 402 615
10:20:46 968 975 165
select asis(value) from pt pivot by time,sym
time A B C
10:20:44 510 434 999
10:20:45 837 402 615
10:20:46 495 885 745
10:20:46 968 975 165

Example 6. Use pivot by with select or exec clause to query and aggregate array vector columns.

Create an in-memory table where the column val1 contains array vectors:

id = `A`B`B`B`C`C`D`D`D
val1 = array(DOUBLE[], 0, 10).append!([1.0, 2.1 2.2 2.4, 2.1 2.2 2.4, 3, 5, 1.6 7, 8.9 100, 2.3, 1.1 4 5.1 4]);
val2 = 2200 1900 2100 3200 6800 5400 1300 2500 8800
timestamp = [09:34:07,09:35:42,09:36:51,09:36:59,09:35:47,09:36:26,09:34:16,09:35:26,09:36:12]
t = table(timestamp, id, val1, val2);

// Query the val1 value for each id every minute
select val1 from t pivot by timestamp.minute(), id;
minute_timestamp A B C D
09:34m [1.0000] [] [] [8.9000,100.0000]
09:35m [] [2.1000,2.2000,2.4000] [5.0000] [2.3000]
09:36m [] [3.0000] [1.6000,7.0000] [1.1000,4.0000,5.1000,4.0000]

Apply aggregation on val1 with select and return a table:

select max(val1) from t pivot by timestamp.minute(), id;
minute_timestamp A B C D
09:34m 1.0000 100.0000
09:35m 2.4000 5.0000 2.3000
09:36m 3.0000 7.0000 5.1000

Apply aggregation on val1 with exec and return a matrix:

exec max(val1) from t pivot by timestamp.minute(), id;
label A B C D
09:34m 1.0000 100.0000
09:35m 2.4000 5.0000 2.3000
09:36m 3.0000 7.0000 5.1000