Quick Start Guide for Shark GPLearn
Overview
GPLearn is often used in quant finance to extract factors from historical data to guide trading decisions for stocks and futures. However, it faces challenges currently:
-
Low computing performance: To extract high-quality factors, it is necessary to increase the complexity of individual operators, the number of initial formulas, and the number of iterations, which leads to greater computation workloads. For example, using 4 years of stock data for single-factor mining, a single round of mining (with 15 generations) takes about 5-24 hours.
-
Insufficient operators: The basic mathematical operators provided by Python GPLearn are difficult to fit the complex features of financial data.
-
Difficulty in handling 3D data: GPLearn can only handle two-dimensional input data, e.g., data of all stocks at the same time (cross-sectional data) or data of the same stock at different times (time series data). For three-dimensional data (different times + different stocks + features), additional grouping operations (
group by
) are required, which increases the complexity of data processing.
Shark GPLearn
Shark GPLearn is a framework designed for solving symbolic regression using genetic algorithms, which aims to automatically generate factors that fit the data distribution. Compared with Python GPLearn, Shark GPLearn brings the following improvements:
-
GPU acceleration: Shark GPLearn utilizes GPUs to significantly improve the efficiency of factor mining and computation.
-
Extensive operator library: Shark GPLearn integrates the functions built into the DolphinDB database to enrich its operator library, enabling it to accurately fit the complex features of data.
-
Support for 3D data: Shark GPLearn supports the processing of three-dimensional data. Users only need to set the grouping column name for in-group calculation.
Genetic Programming
A genetic algorithm is a search optimization method inspired by natural selection and Darwin's theory of evolution, especially the idea of "survival of the fittest". There are some key concepts in genetic algorithms, such as "population", "crossover (mating)", "mutation" and "selection", all of which come from the evolutionary theory. A population is the set of all possible solutions. Crossover is a way to generate new solutions by combining parts of two solutions. Mutation is a random change to a solution to increase the diversity of the population. Selection is the process of keeping solutions with better fitness.
The process of genetic programming can be summarized into the following steps:
-
Initialization: Generate a population of random formulas.
-
Evaluation: Evaluate the fitness (i.e., the accuracy of fitting the data) of each formula and give a fitness score.
-
Selection: Select formulas based on their fitness scores for the next generation. Formulas with better fitness scores have a higher probability of being selected.
-
Evolution and mutation: Randomly select two formulas, then exchange part of the formula with the other to generate two new formulas. Or randomly modify a part of the formula to generate a new formula.
-
Iteration: Repeat steps 2-4 until the stopping criterion is met, such as reaching the maximum number of iterations or finding a sufficiently good formula.
Full Example
-
Deploy DolphinDB on a GPU-enabled server.
Shark GPLearn requires your GPU compute capability over 6.0. See NVIDIA Compute Capability for more information.
-
Prepare data for training
def prepareData(num){ total=num data=table(total:total, `a`b`c`d,[FLOAT,FLOAT,FLOAT,FLOAT])// 1024*1024*5 行 data[`a]=rand(10.0, total) - rand(5.0, total) data[`b]=rand(10.0, total) - rand(5.0, total) data[`c]=rand(10.0, total) - rand(5.0, total) data[`d]=rand(10.0, total) - rand(5.0, total) return data } num = 1024 * 1024 * 5 source = prepareData(num)
-
Prepare data for prediction
a = source[`a] b = source[`b] c = source[`c] d = source[`d] predVec = a*a*a*a/(a*a*a*a+1) + b*b*b*b/(b*b*b*b+1) + c*c*c*c/( c*c*c*c+1) + d*d*d*d/(d*d*d*d+1)
-
Perform training
engine = createGPLearnEngine(source, predVec, functionSet=['add', 'sub', 'mul', 'div', 'sqrt','log', 'abs','neg', 'max','min', 'sin', 'cos', 'tan'], constRange=0,initDepth = [2,5],restrictDepth=true) engine.gpFit(10)
-
Make predictions
predict = engine.gpPredict(source, 10)
-
Training output:
- Training process
- Training
- Predictions
- Training process
Related Functions
- createGPLearnEngine: Create a GPLearn engine.
- gpFit: Get trained programs.
- gpPredict: Make predictions.
- setGpFitnessFunc: Set fitness function for the GPLearn engine.
- addGpFunction: Add a user-defined training function to the GPLearn engine.
Appendix
The following table lists available functions for building and evolving programs. The parameter n indicates the sliding window size taken from windowRange. For all m-functions, if the current window is smaller than n, 0 is returned.
Function | Number of Inputs | Description |
---|---|---|
add(x,y) | 2 | Addition |
sub(x,y) | 2 | Subtraction |
mul(x,y) | 2 | Multiplication |
div(x,y) | 2 | Division, returns 1 if the absolute value of the divisor is less than 0.001 |
max(x,y) | 2 | Maximum value |
min(x,y) | 2 | Minimum value |
sqrt(x) | 1 | Square root based on absolute value |
log(x) | 1 | If x < 0.001, returns 0, otherwise returns log(abs(x)) |
neg(x) | 1 | Negation |
reciprocal(x) | 1 | Reciprocal, returns 0 if the absolute value of x is less than 0.001 |
abs(x) | 1 | Absolute value |
sin(x) | 1 | Sine function |
cos(x) | 1 | Cosine function |
tan(x) | 1 | Tangent function |
sig(x) | 1 | Sigmoid function |
mdiff(x, n) | 1 | n-th order difference of x |
mcovar(x, y, n) | 2 | Covariance of x and y with a sliding window of size n |
mcorr(x, y, n) | 2 | Correlation of x and y with a sliding window of size n |
mstd(x, n) | 1 | Sample standard deviation of x with a sliding window of size n |
mmax(x, n) | 1 | Maximum value of x with a sliding window of size n |
mmin(x, n) | 1 | Minimum value of x with a sliding window of size n |
msum(x, n) | 1 | Sum of x with a sliding window of size n |
mavg(x, n) | 1 | Average of x with a sliding window of size n |
mprod(x, n) | 1 | Product of x with a sliding window of size n |
mvar(x, n) | 1 | Sample variance of x with a sliding window of size n |
mvarp(x, n) | 1 | Population variance of x with a sliding window of size n |
mstdp(x, n) | 1 | Population standard deviation of x with a sliding window of size n |
mimin(x, n) | 1 | Index of the minimum value of x with a sliding window of size n |
mimax(x, n) | 1 | Index of the maximum value of x with a sliding window of size n |
mbeta(x, y, n) | 2 | Least squares estimate of the regression coefficient of x on y with a sliding window of size n |
mwsum(x, y, n) | 2 | Inner product of x and y with a sliding window of size n |
mwavg(x, y, n) | 2 | Weighted average of x using y as weights with a sliding window of size n |