regroup
Syntax
regroup(X, label, func, [byRow=true])
Arguments
X is a matrix.
label is a vector indicating the column/row labels based on which the matrix is grouped and aggregated. When byRow = true, the length of label must match the number of rows of X. Otherwise, it must match the number of columns of X.
func is a unary aggregate function called on each group of the matrix. It can be built-in or user-defined.
byRow (optional) is a Boolean. The default value is true, indicating that the matrix will be grouped and aggregated by rows. False means to group and aggregate matrix by columns.
Details
Group the data of a matrix based on user-specified column/row labels and apply aggregation on each group.
regroup
is similar to the SQL keyword "group by", except that "group
by" is applied only on tables whereas this function is applied on
matrices.
Note: It is recommended that the func parameter be specified as a built-in aggregate function as built-in functions are optimized internally for optimal performance. (see Example 2)
Examples
Example 1. Perform grouped aggregation on a matrix by row/column labels.
m = rand(20, 4:5)
m;
col1 | col2 | col3 | col4 | col5 |
---|---|---|---|---|
11 | 6 | 6 | 10 | 4 |
6 | 7 | 5 | 2 | 16 |
2 | 16 | 14 | 19 | 9 |
17 | 6 | 13 | 10 | 2 |
// by column labels
label = `A`A`B`A`B
regroup(X=m, label=label, func=firstNot, byRow=false)
A | B |
---|---|
11 | 6 |
6 | 5 |
2 | 14 |
17 | 13 |
// by row labels
label = 1 2 1 2
regroup(X=m, label=label, func=firstNot, byRow=true)
label | col1 | col2 | col3 | col4 | col5 |
---|---|---|---|---|---|
1 | 11 | 6 | 6 | 10 | 4 |
2 | 6 | 7 | 5 | 2 | 16 |
Example 2. Compare the performance between applying a built-in function and a user-defined function.
m = rand(1000.0, 10000)$100:100
defg my_avg(v):avg(v)
timer(1000) regroup(m, take(1 2 3 4 5, 100), avg)
// Time elapsed 176.175 ms
timer(1000) regroup(m, take(1 2 3 4 5, 100), my_avg)
// Time elapsed 1062.553 ms
Example 3. Aggregate the panel data by minute
n=1000
timestamp = 09:00:00 + rand(10000, n).sort!()
id = take(`st1`st2`st3, n)
vol = 100 + rand(10.0, n)
vt = table(timestamp, id, vol)
m = exec vol from vt pivot by timestamp, id
regroup(m, minute(m.rowNames()), avg)