Quantitative Finance Examples
1. Database Design for Market Data
In DolphinDB, data is organized and managed in partitions. For optimal performance, data should be distributed evenly across partitions and the size of raw data (before compression) of each partition is recommended to be between 100 MB and 1 GB. If the partition size is too small, a query may need to work with too many small data files and drag down performance.
High frequency data can usually be partitioned with a COMPO scheme on date and security ticker:
(1) For date, in most cases we can use VALUE partition.
(2) For security ticker, we can use HASH or RANGE partitions.
Market data can be end of day data, level 1, level 2 or level 3 high frequency data, etc. As the data of different categories are of different orders of magnitude, we may use different partitioning arrangements for these data.
2. Import Historical Data
You may have a large number of high frequency data files. For example, quite often the quotes or trades data in each day is saved as a .csv file. If all the high frequency data files are stored in the same directory, we can use the following script to create a partitioned database in DolphinDB and load the data files into the database. Suppose for a typical day the quotes data is about 50 GB and trades data is about 10 GB, we can create a partitioned database with a COMPO partition. The first level of partition is a VALUE partition on date and the second level is a HASH partition with 50 partitions on stock symbols. With TSDB storage engine, you can specify security ticker and timestamp as the sorting columns, which are sparse index within each partition.
login(`admin, `123456)
dbDate = database("", VALUE, 2020.01.01..2025.12.31)
dbSymbol = database("", HASH, [SYMBOL, 20])
db = database("dfs://HF", COMPO, [dbDate, dbSymbol], engine = "TSDB")
//Please change dataDir to your directory
dataDir="/hdd/hdd1/data/HFTextFiles/"
def importTxtFiles(dataDir, db){
dataFiles = exec filename from files(dataDir) where isDir=false
for(f in dataFiles){
loadTextEx(dbHandle = db, tableName = `quotes, partitionColumns = `Date`Symbol, filename = dataDir+f, sortColumns = `Symbol`Timestamp)
}
}
importTxtFiles(dataDir, db);
Please note that here we didn't specify the schema of the database table as DolphinDB can automatically determine the schema based on a random sample of the data file. We recommend that you check whether the automatically determined schema is correct after creating the table. If not, then you can retrieve the schema with function extractTextSchema
, revise the schema with SQL update
statement and then specify the parameter schema in function loadTextEx
.
Authorization is required to create or access distributed databases and tables. The default username and password are "admin" and "123456" respectively in this tutorial. In all following examples we will skip the login step.
3. Working with Historical Data
3.1. SQL Queries
Load the metadata of table 'quotes' and 'trades' into memory:
db = database("dfs://HF")
quotes = loadTable(db, `quotes);
trades = loadTable(db, `trades);
Calculate the average bid-ask spread of a stock every minute in a day and plot the result:
avgSpread = select max((Offer_Price-Bid_Price)/(Offer_Price+Bid_Price)*2) as avgSpread from quotes where Date=2016.10.24, Symbol=`AAPL, Time between 09:30:00.000000000 : 15:59:59.999999999, Offer_Price>=Bid_Price group by minute(Time) as minute
plot(avgSpread.avgSpread, avgSpread.minute, "Average bid-ask spread per minute")
3.2. context by
Clause for Panel Data
The context by
clause is an extension of ANSI SQL by DolphinDB that significantly simplifies panel data operations.
context by
is similar to group by
as both conducts calculations within each group. Their differences are:
- For each group,
group by
returns one row whereascontext by
returns the same number of rows as the group. group by
can only be used with aggregate functions whereascontext by
can be used with aggregate functions, moving window functions or cumulative functions.
Calculate the maximum trading volume for selected stocks and days in the previous 20 transactions:
t = select Symbol, Date, Time, mmax(Trade_Volume, 20) as volume_mavg20 from trades where Date=2016.10.24, Symbol in `AAPL`NFLX context by Symbol
Calculate the maximum trading volume up to the current moment for selected stocks and days:
t = select Symbol, Date, Time, cummax(Trade_Volume) as volume_cummax from trades where Date=2016.10.24, Symbol in `AAPL`NFLX context by Symbol, Date
Calculate the maximum trading volume for selected stocks and days and assign the result to each row in the input:
t = select Symbol, Date, Time, max(Trade_Volume) as volume_dailyMax from trades where Date=2016.10.24, Symbol in `AAPL`NFLX context by Symbol, Date
Nested moving window functions:
t=select Symbol, Date, Time, mcorr(mavg(Bid_Size,100), mavg(Offer_Size,100), 60) from quotes where Date=2016.10.24, Symbol=`AAPL, Time between 09:30:00.000000000 : 15:59:59.999999999 context by Symbol order by Symbol, Date, Time
3.3. pivot by
Clause
The pivot by
clause is an extension of ANSI SQL by DolphinDB to generate a pivot table. It can be used together with an aggregate function.
select sum(Trade_Volume) as Volume from trades where Date=2016.10.24, Symbol in `DAL`AAL`UAL`LUV`JBLU, Time between 09:30:00.000000000 : 15:59:59.999999999 pivot by minute(Time) as minute, Symbol
The result is:
minute AAL DAL JBLU LUV UAL
------ ------- ------ ------- ------ ------
09:30m 214,978 77,439 152,927 88,763 5,001
09:31m 20,950 42,583 19,134 15,255 52,307
09:32m 68,403 71,521 23,395 25,780 28,131
09:33m 27,260 62,254 8,807 29,488 20,510
09:34m 37,784 32,374 22,429 43,823 12,411
09:35m 12,704 19,616 7,254 28,472 32,387
09:36m 22,285 40,953 64,316 25,013 20,959
09:37m 21,865 25,745 22,111 19,484 10,913
09:38m 47,561 42,527 16,508 19,974 32,019
09:39m 30,173 36,081 20,173 46,281 20,150
......
3.4. Asof Join and Window Join
Asof join (function aj
) and window join (function wj
) are frequently used to calculate trading costs. Trades and quotes are usually saved in different tables. For each stock, asof join gets the last quote before (and up to) each trade and window join conducts calculations on a specified window of quotes relative to each trade. Function aj
is faster than the asof join function in pandas by about 200 times.
In the following example, we calculate 2 measures of trading cost (in basis points) with asof join and window join, respectively. For window join, the average quote in the 100 milliseconds before each trade is used instead of the last quote for asof join.
TC1 = select sum(Trade_Volume*abs(Trade_Price-(Bid_Price+Offer_Price)/2))/sum(Trade_Volume*Trade_Price)*10000 as cost from aj(trades, quotes, `Symbol`Time) where Time between 09:30:00.000000000 : 15:59:59.999999999 group by Symbol
TC2 = select sum(Trade_Volume*abs(Trade_Price-(Bid_Price+Offer_Price)/2))/sum(Trade_Volume*Trade_Price)*10000 as cost from pwj(trades, quotes, -100000000:0,<[avg(Offer_Price) as Offer_Price, avg(Bid_Price) as Bid_Price]>,`Symbol`Time) where Time between 09:30:00.000000000 : 15:59:59.999999999 group by Symbol