Quick Start Guide for Shark GPLearn
Overview
GPLearn is often used in quant finance to extract factors from historical data to guide trading decisions for stocks and futures. However, it faces challenges currently:

Low computing performance: To extract highquality factors, it is necessary to increase the complexity of individual operators, the number of initial formulas, and the number of iterations, which leads to greater computation workloads. For example, using 4 years of stock data for singlefactor mining, a single round of mining (with 15 generations) takes about 524 hours.

Insufficient operators: The basic mathematical operators provided by Python GPLearn are difficult to fit the complex features of financial data.

Difficulty in handling 3D data: GPLearn can only handle twodimensional input data, e.g., data of all stocks at the same time (crosssectional data) or data of the same stock at different times (time series data). For threedimensional data (different times + different stocks + features), additional grouping operations (
group by
) are required, which increases the complexity of data processing.
Shark GPLearn
Shark GPLearn is a framework designed for solving symbolic regression using genetic algorithms, which aims to automatically generate factors that fit the data distribution. Compared with Python GPLearn, Shark GPLearn brings the following improvements:

GPU acceleration: Shark GPLearn utilizes GPUs to significantly improve the efficiency of factor mining and computation.

Extensive operator library: Shark GPLearn integrates the functions built into the DolphinDB database to enrich its operator library, enabling it to accurately fit the complex features of data.

Support for 3D data: Shark GPLearn supports the processing of threedimensional data. Users only need to set the grouping column name for ingroup calculation.
Genetic Programming
A genetic algorithm is a search optimization method inspired by natural selection and Darwin's theory of evolution, especially the idea of "survival of the fittest". There are some key concepts in genetic algorithms, such as "population", "crossover (mating)", "mutation" and "selection", all of which come from the evolutionary theory. A population is the set of all possible solutions. Crossover is a way to generate new solutions by combining parts of two solutions. Mutation is a random change to a solution to increase the diversity of the population. Selection is the process of keeping solutions with better fitness.
The process of genetic programming can be summarized into the following steps:

Initialization: Generate a population of random formulas.

Evaluation: Evaluate the fitness (i.e., the accuracy of fitting the data) of each formula and give a fitness score.

Selection: Select formulas based on their fitness scores for the next generation. Formulas with better fitness scores have a higher probability of being selected.

Evolution and mutation: Randomly select two formulas, then exchange part of the formula with the other to generate two new formulas. Or randomly modify a part of the formula to generate a new formula.

Iteration: Repeat steps 24 until the stopping criterion is met, such as reaching the maximum number of iterations or finding a sufficiently good formula.
Full Example

Deploy DolphinDB on a GPUenabled server.
Shark GPLearn requires your GPU compute capability over 6.0. See NVIDIA Compute Capability for more information.

Prepare data for training
def prepareData(num){ total=num data=table(total:total, `a`b`c`d,[FLOAT,FLOAT,FLOAT,FLOAT])// 1024*1024*5 行 data[`a]=rand(10.0, total)  rand(5.0, total) data[`b]=rand(10.0, total)  rand(5.0, total) data[`c]=rand(10.0, total)  rand(5.0, total) data[`d]=rand(10.0, total)  rand(5.0, total) return data } num = 1024 * 1024 * 5 source = prepareData(num)

Prepare data for prediction
a = source[`a] b = source[`b] c = source[`c] d = source[`d] predVec = a*a*a*a/(a*a*a*a+1) + b*b*b*b/(b*b*b*b+1) + c*c*c*c/( c*c*c*c+1) + d*d*d*d/(d*d*d*d+1)

Perform training
engine = createGPLearnEngine(source, predVec, functionSet=['add', 'sub', 'mul', 'div', 'sqrt','log', 'abs','neg', 'max','min', 'sin', 'cos', 'tan'], constRange=0,initDepth = [2,5],restrictDepth=true) engine.gpFit(10)

Make predictions
predict = engine.gpPredict(source, 10)

Training output:
 Training process
 Training
 Predictions
 Training process
Related Functions
 createGPLearnEngine: Create a GPLearn engine.
 gpFit: Get trained programs.
 gpPredict: Make predictions.
 setGpFitnessFunc: Set fitness function for the GPLearn engine.
 addGpFunction: Add a userdefined training function to the GPLearn engine.
Appendix
The following table lists available functions for building and evolving programs. The parameter n indicates the sliding window size taken from windowRange. For all mfunctions, if the current window is smaller than n, 0 is returned.
Function  Number of Inputs  Description 

add(x,y)  2  Addition 
sub(x,y)  2  Subtraction 
mul(x,y)  2  Multiplication 
div(x,y)  2  Division, returns 1 if the absolute value of the divisor is less than 0.001 
max(x,y)  2  Maximum value 
min(x,y)  2  Minimum value 
sqrt(x)  1  Square root based on absolute value 
log(x)  1  If x < 0.001, returns 0, otherwise returns log(abs(x)) 
neg(x)  1  Negation 
reciprocal(x)  1  Reciprocal, returns 0 if the absolute value of x is less than 0.001 
abs(x)  1  Absolute value 
sin(x)  1  Sine function 
cos(x)  1  Cosine function 
tan(x)  1  Tangent function 
sig(x)  1  Sigmoid function 
mdiff(x, n)  1  nth order difference of x 
mcovar(x, y, n)  2  Covariance of x and y with a sliding window of size n 
mcorr(x, y, n)  2  Correlation of x and y with a sliding window of size n 
mstd(x, n)  1  Sample standard deviation of x with a sliding window of size n 
mmax(x, n)  1  Maximum value of x with a sliding window of size n 
mmin(x, n)  1  Minimum value of x with a sliding window of size n 
msum(x, n)  1  Sum of x with a sliding window of size n 
mavg(x, n)  1  Average of x with a sliding window of size n 
mprod(x, n)  1  Product of x with a sliding window of size n 
mvar(x, n)  1  Sample variance of x with a sliding window of size n 
mvarp(x, n)  1  Population variance of x with a sliding window of size n 
mstdp(x, n)  1  Population standard deviation of x with a sliding window of size n 
mimin(x, n)  1  Index of the minimum value of x with a sliding window of size n 
mimax(x, n)  1  Index of the maximum value of x with a sliding window of size n 
mbeta(x, y, n)  2  Least squares estimate of the regression coefficient of x on y with a sliding window of size n 
mwsum(x, y, n)  2  Inner product of x and y with a sliding window of size n 
mwavg(x, y, n)  2  Weighted average of x using y as weights with a sliding window of size n 