Quantitative Finance Examples

1. Database Design for Market Data

In DolphinDB, data is organized and managed in partitions. For optimal performance, data should be distributed evenly across partitions and the size of raw data (before compression) of each partition is recommended to be between 100 MB and 1 GB. If the partition size is too small, a query may need to work with too many small data files and drag down performance.

High frequency data can usually be partitioned with a COMPO scheme on date and security ticker:

(1) For date, in most cases we can use VALUE partition.

(2) For security ticker, we can use HASH or RANGE partitions.

Market data can be end of day data, level 1, level 2 or level 3 high frequency data, etc. As the data of different categories are of different orders of magnitude, we may use different partitioning arrangements for these data.

2. Import Historical Data

You may have a large number of high frequency data files. For example, quite often the quotes or trades data in each day is saved as a .csv file. If all the high frequency data files are stored in the same directory, we can use the following script to create a partitioned database in DolphinDB and load the data files into the database. Suppose for a typical day the quotes data is about 50 GB and trades data is about 10 GB, we can create a partitioned database with a COMPO partition. The first level of partition is a VALUE partition on date and the second level is a HASH partition with 50 partitions on stock symbols. With TSDB storage engine, you can specify security ticker and timestamp as the sorting columns, which are sparse index within each partition.

login(`admin, `123456)
dbDate = database("", VALUE, 2020.01.01..2025.12.31)
dbSymbol = database("", HASH, [SYMBOL, 20])
db = database("dfs://HF", COMPO, [dbDate, dbSymbol], engine = "TSDB")

//Please change dataDir to your directory
dataDir="/hdd/hdd1/data/HFTextFiles/"

def importTxtFiles(dataDir, db){
    dataFiles = exec filename from files(dataDir) where isDir=false
    for(f in dataFiles){
        loadTextEx(dbHandle = db, tableName = `quotes, partitionColumns = `Date`Symbol, filename = dataDir+f, sortColumns = `Symbol`Timestamp)
    }
}
importTxtFiles(dataDir, db);

Please note that here we didn't specify the schema of the database table as DolphinDB can automatically determine the schema based on a random sample of the data file. We recommend that you check whether the automatically determined schema is correct after creating the table. If not, then you can retrieve the schema with function extractTextSchema, revise the schema with SQL update statement and then specify the parameter schema in function loadTextEx.

Authorization is required to create or access distributed databases and tables. The default username and password are "admin" and "123456" respectively in this tutorial. In all following examples we will skip the login step.

3. Working with Historical Data

3.1. SQL Queries

Load the metadata of table 'quotes' and 'trades' into memory:

db = database("dfs://HF")
quotes = loadTable(db, `quotes);
trades = loadTable(db, `trades);

Calculate the average bid-ask spread of a stock every minute in a day and plot the result:

avgSpread = select max((Offer_Price-Bid_Price)/(Offer_Price+Bid_Price)*2) as avgSpread from quotes where Date=2016.10.24, Symbol=`AAPL, Time between 09:30:00.000000000 : 15:59:59.999999999, Offer_Price>=Bid_Price group by minute(Time) as minute
plot(avgSpread.avgSpread, avgSpread.minute, "Average bid-ask spread per minute")

3.2. `context by` Clause for Panel Data

The context by clause is an extension of ANSI SQL by DolphinDB that significantly simplifies panel data operations.

context by is similar to group by as both conducts calculations within each group. Their differences are:

For each group, group by returns one row whereas context by returns the same number of rows as the group.
group by can only be used with aggregate functions whereas context by can be used with aggregate functions, moving window functions or cumulative functions.

Calculate the maximum trading volume for selected stocks and days in the previous 20 transactions:

t = select Symbol, Date, Time, mmax(Trade_Volume, 20) as volume_mavg20 from trades where Date=2016.10.24, Symbol in `AAPL`NFLX context by Symbol

Calculate the maximum trading volume up to the current moment for selected stocks and days:

t = select Symbol, Date, Time, cummax(Trade_Volume) as volume_cummax from trades where Date=2016.10.24, Symbol in `AAPL`NFLX context by Symbol, Date

Calculate the maximum trading volume for selected stocks and days and assign the result to each row in the input:

t = select Symbol, Date, Time, max(Trade_Volume) as volume_dailyMax from trades where Date=2016.10.24, Symbol in `AAPL`NFLX context by Symbol, Date

Nested moving window functions:

t=select Symbol, Date, Time, mcorr(mavg(Bid_Size,100), mavg(Offer_Size,100), 60) from quotes where Date=2016.10.24, Symbol=`AAPL, Time between 09:30:00.000000000 : 15:59:59.999999999 context by Symbol order by Symbol, Date, Time

3.3. `pivot by` Clause

The pivot by clause is an extension of ANSI SQL by DolphinDB to generate a pivot table. It can be used together with an aggregate function.

select sum(Trade_Volume) as Volume from trades where Date=2016.10.24, Symbol in `DAL`AAL`UAL`LUV`JBLU, Time between 09:30:00.000000000 : 15:59:59.999999999 pivot by minute(Time) as minute, Symbol

The result is:

minute	AAL     DAL     JBLU    LUV     UAL
------  ------- ------  ------- ------  ------
09:30m	214,978	77,439	152,927	88,763	5,001
09:31m	20,950	42,583	19,134	15,255	52,307
09:32m	68,403	71,521	23,395	25,780	28,131
09:33m	27,260	62,254	8,807	29,488	20,510
09:34m	37,784	32,374	22,429	43,823	12,411
09:35m	12,704	19,616	7,254	28,472	32,387
09:36m	22,285	40,953	64,316	25,013	20,959
09:37m	21,865	25,745	22,111	19,484	10,913
09:38m	47,561	42,527	16,508	19,974	32,019
09:39m	30,173	36,081	20,173	46,281	20,150
......

3.4. Asof Join and Window Join

Asof join (function aj) and window join (function wj) are frequently used to calculate trading costs. Trades and quotes are usually saved in different tables. For each stock, asof join gets the last quote before (and up to) each trade and window join conducts calculations on a specified window of quotes relative to each trade. Function aj is faster than the asof join function in pandas by about 200 times.

In the following example, we calculate 2 measures of trading cost (in basis points) with asof join and window join, respectively. For window join, the average quote in the 100 milliseconds before each trade is used instead of the last quote for asof join.

TC1 = select sum(Trade_Volume*abs(Trade_Price-(Bid_Price+Offer_Price)/2))/sum(Trade_Volume*Trade_Price)*10000 as cost from aj(trades, quotes, `Symbol`Time) where Time between 09:30:00.000000000 : 15:59:59.999999999 group by Symbol

TC2 = select sum(Trade_Volume*abs(Trade_Price-(Bid_Price+Offer_Price)/2))/sum(Trade_Volume*Trade_Price)*10000 as cost from pwj(trades, quotes, -100000000:0,<[avg(Offer_Price) as Offer_Price, avg(Bid_Price) as Bid_Price]>,`Symbol`Time) where Time between 09:30:00.000000000 : 15:59:59.999999999 group by Symbol