regroup

Syntax

regroup(X, label, func, [byRow=true])

Arguments

X is a matrix.

label is a vector indicating the column/row labels based on which the matrix is grouped and aggregated. When byRow = true, the length of label must match the number of rows of X. Otherwise, it must match the number of columns of X.

func is a unary aggregate function called on each group of the matrix. It can be built-in or user-defined.

byRow (optional) is a Boolean. The default value is true, indicating that the matrix will be grouped and aggregated by rows. False means to group and aggregate matrix by columns.

Details

Group the data of a matrix based on user-specified column/row labels and apply aggregation on each group.

regroup is similar to the SQL keyword "group by", except that "group by" is applied only on tables whereas this function is applied on matrices.

Note: It is recommended that the func parameter be specified as a built-in aggregate function as built-in functions are optimized internally for optimal performance. (see Example 2)

Examples

Example 1. Perform grouped aggregation on a matrix by row/column labels.

m = rand(20, 4:5)
m;
col1 col2 col3 col4 col5
11 6 6 10 4
6 7 5 2 16
2 16 14 19 9
17 6 13 10 2
// by column labels
label = `A`A`B`A`B
regroup(X=m, label=label, func=firstNot, byRow=false)
A B
11 6
6 5
2 14
17 13
// by row labels
label = 1 2 1 2
regroup(X=m, label=label, func=firstNot, byRow=true)
label col1 col2 col3 col4 col5
1 11 6 6 10 4
2 6 7 5 2 16

Example 2. Compare the performance between applying a built-in function and a user-defined function.

m = rand(1000.0, 10000)$100:100
defg my_avg(v):avg(v)

timer(1000) regroup(m, take(1 2 3 4 5, 100), avg)
// Time elapsed 176.175 ms

timer(1000) regroup(m, take(1 2 3 4 5, 100), my_avg)
// Time elapsed 1062.553 ms

Example 3. Aggregate the panel data by minute

n=1000
timestamp = 09:00:00 + rand(10000, n).sort!()
id = take(`st1`st2`st3, n)
vol = 100 + rand(10.0, n)
vt = table(timestamp, id, vol)
m = exec vol from vt pivot by timestamp, id
regroup(m, minute(m.rowNames()), avg)