group by

We use keyword group by followed by column name(s) to group the records and to apply functions on each group of records. The order of columns in the table generated by a SQL statement is: The grouping columns not specified by the select clause are displayed first, then follow the columns specified by both the group by clause and the select clause. Therefore unlike in ANSI SQL, we don't need to include group by columns explicitly in select statements.

Examples

sym = `C`MS`MS`MS`IBM`IBM`C`C`C$SYMBOL
price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
timestamp = [09:34:07,09:36:42,09:36:51,09:36:59,09:32:47,09:35:26,09:34:16,09:34:26,09:38:12]
t1 = table(timestamp, sym, qty, price);

t1;
timestamp sym qty price
09:34:07 C 2200 49.6
09:36:42 MS 1900 29.46
09:36:51 MS 2100 29.52
09:36:59 MS 3200 30.02
09:32:47 IBM 6800 174.97
09:35:26 IBM 5400 175.2
09:34:16 C 1300 50.76
09:34:26 C 2500 50.32
09:38:12 C 8800 51.29
select count(sym) as counts from t1 group by sym;
sym counts
C 4
MS 3
IBM 2
select top 2:3 count(sym) as counts from t1 group by sym;
sym counts
IBM 2
select max(price), sym from t1 group by sym, minute(timestamp)
minute_timestamp max_price sym
09:34m 50.76 C
09:36m 30.02 MS
09:32m 174.97 IBM
09:35m 175.23 IBM
09:38m 51.29 C
select avg(qty) from t1 group by sym;
sym avg_qty
C 3700
MS 2400
IBM 6100
select wavg(price, qty) as vwap, sum(qty) from t1 group by sym;
sym vwap sum_qty
C 50.828378 14800
IBM 175.085082 12200
MS 29.726389 7200
select wsum(price, qty) as dollarVolume, sum(qty) from t1 group by minute(timestamp) as ts;
ts dollarVolume sum_qty
09:32m 1.189796e+006 6800
09:34m 300908 6000
09:35m 946242 5400
09:36m 214030 7200
09:38m 451352 8800
select sum(qty) from t1 group by sym, timestamp.minute() as minute;
sym minute sum_qty
C 09:34m 6000
C 09:38m 8800
IBM 09:32m 6800
IBM 09:35m 5400
MS 09:36m 7200

When group by is used with order by, the order by column(s) can be one or more of group by columns, or one or more of the new columns generated by group by calculation.

select sum(qty) from t1 group by sym, timestamp.minute() as minute order by minute;
sym minute sum_qty
C 09:00m 1100
IBM 09:32m 6800
C 09:34m 6000
IBM 09:35m 5400
MS 09:36m 7200
C 09:38m 8800

In the examples above, the result of the function for each group is a scalar. In other cases, a function may output a vector or a dictionary rather than a scalar. For examples, the function stat outputs a dictionary, and the function ols outputs a coefficient vector or a dictionary that includes parameter estimates, t-stat, R square, etc. To output the results in multiple columns, we need to convert the vector/dictionary output to multiple scalars. In other words, we need to convert a composite column into multiple columns. This can be accomplished with "as" keyword and a constant string vector for column names.

y=1..15
factor1=3.2 1.2 5.9 6.9 11.1 9.6 1.4 7.3 2.0 0.1 6.1 2.9 6.3 8.4 5.6
factor2=1.7 1.3 4.2 6.8 9.2 1.3 1.4 7.8 7.9 9.9 9.3 4.6 7.8 2.4 8.7
id=take(1 2 3, 15).sort();
t=table(id, y, factor1, factor2);

t;
id y factor1 factor2
1 1 3.2 1.7
1 2 1.2 1.3
1 3 5.9 4.2
1 4 6.9 6.8
1 5 11.1 9.2
2 6 9.6 1.3
2 7 1.4 1.4
2 8 7.3 7.8
2 9 2 7.9
2 10 0.1 9.9
3 11 6.1 9.3
3 12 2.9 4.6
3 13 6.3 7.8
3 14 8.4 2.4
3 15 5.6 8.7
select ols(y,(factor1,factor2),true,0) as `int`factor1`factor2 from t group by id;
id int factor1 factor2
1 1.063991 -0.258685 0.732795
2 6.886877 -0.148325 0.303584
3 11.833867 0.272352 -0.065526
select ols(y,(factor1,factor2),true,2).Coefficient.tstat[1:] as `t1`t2 from t group by id;
id t1 t2
1 -0.891868 2.253451
2 -5.73315 11.433117
3 0.510866 -0.183903

To skip certain elements from the output of the function, we can leave the names of the corresponding elements empty.

select ols(y,(factor1,factor2),true,2).Coefficient.beta as ``factor1`factor2 from t group by id;
id factor1 factor2
1 -0.258685 0.732795
2 -0.148325 0.303584
3 0.272352 -0.065526

To customize the output, we can write a simple wrapper. In the following example, the output includes coefficient estimates and R square:

def myols(y,x) {
  r=ols(y,x,true,2)
  return r.Coefficient.beta join r.RegressionStat.statistics[0]
}

select myols(y,(factor1,factor2)) as `int`factor1`factor2`R2 from t group by id;
id int factor1 factor2 R2
1 1.063991 -0.258685 0.732795 0.946056
2 6.886877 -0.148325 0.303584 0.992413
3 11.833867 0.272352 -0.065526 0.144837