Machine Learning
DolphinDB implements machine learning algorithms including linear regression, random forest, K-means, etc. that allow users to easily complete tasks such as regression, classification, and clustering. This tutorial will introduce the process of using DolphinDB script language for machine learning with specific application examples.
All code examples in this tutorial are based on DolphinDB server version 1.10.11.
1. Small-Sample Classification
We use the Wine dataset provided by UCI Machine Learning Repository to train our first random forest classification model.
1.1. Import Data to DolphinDB
Download the dataset, and import it to DolphinDB with function loadText
.
wineSchema = table(
`Label`Alcohol`MalicAcid`Ash`AlcalinityOfAsh`Magnesium`TotalPhenols`Flavanoids`NonflavanoidPhenols`Proanthocyanins`ColorIntensity`Hue`OD280_OD315`Proline as name,
`INT`DOUBLE`DOUBLE`DOUBLE`DOUBLE`DOUBLE`DOUBLE`DOUBLE`DOUBLE`DOUBLE`DOUBLE`DOUBLE`DOUBLE`DOUBLE as type
)
wine = loadText("/home/dataset/wine.data", schema=wineSchema)
1.2. Data Preprocessing
The DolphinDB randomForestClassifier function requires the class labels to be integers in [0, numClasses), while the wine dataset uses class labels 1, 2, 3. So we use the following code to process these labels as 0, 1, 2.
update wine set Label = Label - 1
Split the data into training and testing sets in a 7:3 ratio. In this example, the function trainTestSplit
is defined to facilitate the split.
def trainTestSplit(x, testRatio) {
xSize = x.size()
testSize = xSize * testRatio
r = (0..(xSize-1)).shuffle()
return x[r > testSize], x[r <= testSize]
}
wineTrain, wineTest = trainTestSplit(wine, 0.3)
wineTrain.size() // 124
wineTest.size() // 54
1.3. Random Forest Classification
We perform random forest classification on the training set with function randomForestClassifier
. The function has four required parameters:
- ds: The input data source (usually generated using the
sqlDS
function). - yColName: The column name of the dependent variable in the data source.
- xColNames: The column names of the dependent variables in the data source.
- numClasses: The number of classes.
model = randomForestClassifier(
sqlDS(<select * from wineTrain>),
yColName=`Label,
xColNames=`Alcohol`MalicAcid`Ash`AlcalinityOfAsh`Magnesium`TotalPhenols`Flavanoids`NonflavanoidPhenols`Proanthocyanins`ColorIntensity`Hue`OD280_OD315`Proline,
numClasses=3
)
Predict the test set with the trained model:
predicted = model.predict(wineTest)
Examine the prediction accuracy:
sum(predicted == wineTest.Label) \ wineTest.size();
0.925926
1.4. Model Persistence
Save the trained model to disk using the saveModel
function:
model.saveModel("/home/model/wineModel.bin")
You can load the model from disk with loadModel
for predictions:
model = loadModel("/home/model/wineModel.bin")
predicted = model.predict(wineTest)
2. Distributed Machine Learning
The above example only uses a small dataset for demonstration purposes. DolphinDB offers a machine learning library that differs from the usual ones in that it is specifically designed for distributed processing. It provides reliable support for machine learning algorithms in distributed environments. This chapter will introduce how to use the logistic regression algorithm in the DolphinDB distributed database to complete the training of classification models.
We use a DFS database partitioned by stock ID. It contains daily OHLC data of stocks from 2010 to 2018.
The following nine variables are used as predictive indicators:
- opening price
- highest price
- lowest price
- closing price
- difference between the opening price of the current day and the closing price of the previous day
- difference between the opening price of the current day and the opening price of the previous day
- 10-day moving average
- correlation coefficient
- relative strength index (RSI).
We will make predictions on whether the closing price of the second day is greater than that of the current day.
2.1. Data Preprocessing
In this example, missing values in the raw data are filled with the ffill
function. The first 10 rows of the 10-day moving average and RSI calculation result are empty and we remove these records from output. We use the transDS!
function to apply the preprocessing steps to the original data. In this example, calculating RSI uses the ta module of DolphinDB, and the specific usage can be found in DolphinDBModules.
use ta
def preprocess(t) {
ohlc = select ffill(Open) as Open, ffill(High) as High, ffill(Low) as Low, ffill(Close) as Close from t
update ohlc set OpenClose = Open - prev(Close), OpenOpen = Open - prev(Open), S_10 = mavg(Close, 10), RSI = ta::rsi(Close, 10), Target = iif(next(Close) > Close, 1, 0)
update ohlc set Corr = mcorr(Close, S_10, 10)
return ohlc[10:]
}
After loading the data, generate a data source through function sqlDS
, and use transDS!
to convert the data source using the preprocessing function preprocess defined above
:
ohlc = database("dfs://trades").loadTable("ohlc")
ds = sqlDS(<select * from ohlc>).transDS!(preprocess)
2.2. Model Training
We use function logisticRegression which has three required parameters:
- ds: The input data source (usually generated using the
sqlDS
function). - yColName: The column name of the dependent variable in the data source.
- xColNames: The column names of the dependent variables in the data source.
The data source generated in "Data Preprocessing" can be passed to ds:
model = logisticRegression(ds, `Target, `Open`High`Low`Close`OpenClose`OpenOpen`S_10`RSI`Corr)
Then use the trained model for prediction and check the classification accuracy:
aapl = preprocess(select * from ohlc where Ticker = `AAPL)
predicted = model.predict(aapl)
score = sum(predicted == aapl.Target) \ aapl.size() // 0.756522
3. Using PCA for Dimensionality Reduction
Principal component analysis (PCA) is a popular technique in machine learning for analyzing large datasets containing a high number of dimensions/features per observation. It is often used to reduce the dimensionality of large datasets by transforming a large set of variables into a smaller one that still contains most of the information in the large set. PCA can also be used to transform a high dimensional data to low dimensional data (2 or 3 dimension) so that it can be visualized easily.
Taking the example conducted in "Small-Sample Classification", the input dataset has 13 dependent variables. By calling the pca function on the data source, the variance weights of each principal component are observed. Set the normalize parameter to true to normalize the data.
xColNames = `Alcohol`MalicAcid`Ash`AlcalinityOfAsh`Magnesium`TotalPhenols`Flavanoids`NonflavanoidPhenols`Proanthocyanins`ColorIntensity`Hue`OD280_OD315`Proline
pcaRes = pca(
sqlDS(<select * from wineTrain>),
colNames=xColNames,
normalize=true
)
The return value is a dictionary. Based on the explainedVarianceRatio, we can observe that the variance weights of the initial three compressed dimensions are already substantial. Therefore, compressing the data into three dimensions already satisfies training purposes.
pcaRes.explainedVarianceRatio;
[0.209316,0.201225,0.121788,0.088709,0.077805,0.075314,0.058028,0.045604,0.038463,0.031485,0.021256,0.018073,0.012934]
Keep only the first three principal components:
components = pcaRes.components.transpose()[:3]
Apply the principal component analysis matrix to the input data set and call randomForestClassifier
for training.
def principalComponents(t, components, yColName, xColNames) {
res = matrix(t[xColNames]).dot(components).table()
res[yColName] = t[yColName]
return res
}
ds = sqlDS(<select * from wineTrain>)
ds.transDS!(principalComponents{, components, `Class, xColNames})
model = randomForestClassifier(ds, yColName=`Class, xColNames=`col0`col1`col2, numClasses=3)
The principal components of the test set also need to be extracted for prediction of the test set.
model.predict(wineTest.principalComponents(components, `Class, xColNames))
4. Linear Regression: Ridge, Lasso, and ElasticNet
DolphinDB offers functions ols and olsEx for ordinary least squares regression. These functions typically deal with low-dimensional data, but can lead to overfitting when working with high-dimensional data. To address this issue, the Ridge, Lasso, and ElasticNet regression methods were introduced by enhancing the algorithm and tackling the problem from different aspects:
- Lasso performs the L1 regularization which adds penalty equivalent to the square of the magnitude of coefficients to the objective function for sparse feature selection.
- Ridge performs the L2 regularization which adds penalty equivalent to the absolute value of the magnitude of coefficients.
- ElasticNet employs both L1 and L2 penalties during training.
DolphinDB provides functions lasso, ridge, elasticNetaccordingly. They all have three required parameters:
- ds: An in-memory table or the input data source (usually generated using the
sqlDS
function). - yColName: The column name of the dependent variable in the data source.
- xColNames: The column names of the dependent variables in the data source.
The function lasso
is a special case of elasticNet
when l1Ratio = 1. They use the same implementation method that employs coordinate descent to compute the parameters. The function ridge
uses an analytical solution, and the solver can be 'svd' or 'cholesky'.
To train a model:
model = lasso(sqlDS(<select * from t>), `y, `x0`x1, alpha=0.5)
To predict on a test set:
model.predict(t)
5. Machine Learning With DolphinDB Plugins
In addition to the built-in functions that implement classical machine learning algorithms, DolphinDB also provides plugins for calling third-party libraries for machine learning. This chapter uses the DolphinDB XGBoost plugin as an example.
5.1. Load XGBoost Plugin
Download the pre-compiled XGBoost plugin from xxxx to your local machine. Then, run loadPlugin(pathToXgboost)
in DolphinDB, where pathToXgboost is the path to the downloaded PluginXgboost.txt file:
pathToXgboost = "C:/DolphinDB/plugin/xgboost/PluginXgboost.txt"
loadPlugin(pathToXgboost)
5.2. Train and Predict
We also use the wine dataset for model training in this example. The training method xgboost::train(Y, X, [params], [numBoostRound=10], [xgbModel])
is used. The Label column of the wineTrain set is taken as the input Y, and the other columns are kept as X.
Y = exec Label from wineTrain
X = select Alcohol, MalicAcid, Ash, AlcalinityOfAsh, Magnesium, TotalPhenols, Flavanoids, NonflavanoidPhenols, Proanthocyanins, ColorIntensity, Hue, OD280_OD315, Proline from wineTrain
Before training the model, we need to specify a dictionary for params. We will train a multi-classification model so the objective is set to "multi-softmax" and the number of classification num_class is set to 3. You can refer to XGBoost doc for the parameter descriptions.
Parameters in this example are set as below:
params = {
objective: "multi:softmax",
num_class: 3,
max_depth: 5,
eta: 0.1,
subsample: 0.9
}
Train the model, predict and calculate the classification accuracy:
model = xgboost::train(Y, X, params)
testX = select Alcohol, MalicAcid, Ash, AlcalinityOfAsh, Magnesium, TotalPhenols, Flavanoids, NonflavanoidPhenols, Proanthocyanins, ColorIntensity, Hue, OD280_OD315, Proline from wineTest
predicted = xgboost::predict(model, testX)
sum(predicted == wineTest.Label) \ wineTest.size() // 0.962963
Similarly, the model can be persisted to or loaded from disk:
xgboost::saveModel(model, "xgboost001.mdl")
model = xgboost::loadModel("xgboost001.mdl")
By specifying the xgbModel parameter of xgboost::train
, incremental training can be performed on an existing model:
model = xgboost::train(Y, X, params, , model)
6. Appendix: Built-in Functions for Machine Learning
6.1. A. Training Functions
Function | Usage | Description | Distributed Processing |
---|---|---|---|
adaBoostClassifier | classification | AdaBoost classification | √ |
adaBoostRegressor | regression | AdaBoost regression | √ |
elasticNet | regression | Elastic net regression | × |
gaussianNB | classification | Naive Bayesian classification | × |
glm | classification/regression | generalized linear model (GLM) | √ |
kmeans | clustering | k-means clustering | × |
knn | classification | k-nearest neighbors (KNN) | × |
lasso | regression | Lasso regression | × |
logisticRegression | classification | logistic regression | √ |
multinomialNB | classification | multinomial Naive Bayesian classification | × |
ols | regression | ordinary-least-squares (OLS) regression | × |
olsEx | regression | ordinary-least-squares (OLS) regression | √ |
pca | downsampling | principal component analysis (PCA) | √ |
randomForestClassifier | classification | random forest classification | √ |
randomForestRegressor | regression | random forest regression | √ |
ridge | regression | ridge regression | √ |
6.2. B. Tools
Function | Description |
---|---|
loadModel | load model |
saveModel | save model |
predict | prediction |
6.3. C. Plugins
Plugin | Usage | Description |
---|---|---|
XGBoost | classification/regression | gradient boosting based on XGBOOST |
svm | classification/regression | support vector machines based on LIBSVM |