Probability and Statistical Analysis

DolphinDB provides a comprehensive suite of built-in statistical functions designed for efficient probability and statistical analysis. With these functions, users can effortlessly perform probability and statistical analysis such as probability districution, random number generation, regression analysis, and hypothesis testing. These capabilities enable efficient data modeling and analysis, supporting critical tasks like financial risk management, investment strategies, and IoT insights.

1. Probability and Statistics

DolphinDB provides comprehensive support for univariate and multivariate probability functions. This section covers the implementation and usage of complex univariate statistical measures such as mad, kurtosis and skew, as well as important multivariate metrics like covariance and correlation coefficients.

1.1 Univariate Statistics

Univariate statistics measures a single variable individually, focusing on patterns like central tendency and dispersion. The univariate statistics functions supported by DolphinDB are listed below. Note that DolphinDB ignores null values when calculating univariate statistical features.

Statistical Feature Function
Minimum Value min
Maximum Value max
Mean mean, avg
Unbiased Sample Variance var
Population Variance varp
Unbiased Sample Standard Deviation std
Population Standard Deviation stdp
Unbiased Standard Error of Mean sem
Median med
Percentile percentile
Mean Absolute Deviation mad(X, useMedian=false)
Median Absolute Deviation mad(X, useMedian=true)
Kurtosis kurtosis
Skewness skew
Autocorrelation acf, autocor

Common statistical features such as min, max, mean, avg, std, and percentile can be obtained using the summary function. The following is a simple example of preliminary exploratory data analysis on three years of data for three stocks using summary.

n=2022
data=1..2022
value=take(1..3,n)
name=take(`APPLE`IBM`INTEL,n)
t=table(data,value,name);
res = summary(t, precision=0.001);
res

/*Output
	name	min	max	nonNullCount	count	avg	std	percentile
0	data	1.00	2022.00	2022	2022	1011.50	583.85	[506.25, 1011.50, 1,516.75]
1	value	1.00	3.00	2,022	2,022	2.000	0.8166985591	[1.00, 2.00, 3.00]
*/

The following sections will focus on more complex univariate statistical features such as mad, kurtosis, and skewness.

1.1.1 Mean Absolute Deviation (MAD)

The mean absolute deviation (MAD) serves as a reliable measure of dispersion, reflecting the spread or variability of data points around the mean. However, when used in statistical inference, its statistical properties are inferior to the standard deviation. The mad function by default calculates the mean absolute deviation, and the formula is as follows:

The following results show that the calculations from DolphinDB and Python Numpy are consistent up to 8 decimal places.

import time
import numpy as np
import dolphindb as ddb

def calculate_mad(data):
    med = np.median(data)
    abs_diffs = np.abs(data - med)
    return np.median(abs_diffs)

# Set random seed to ensure reproducibility
np.random.seed(42)

data = np.random.randint(1, 101, size=10000)

# Calculate MAD using Python Numpy
result1 = calculate_mad(data)
print(f"MAD value using Python Numpy: {result1:.8f}")

# Calculate MAD using DolphinDB
s = ddb.session()
s.connect("localhost", 7271)
result2 = s.run("mad{,true}", data)
print(f"MAD value using DolphinDB: {result2:.8f}")

# Data validation
abs_diff = abs(result1 - result2)
if abs_diff < 1e-8:
    print("MAD calculation results are consistent up to 8 decimal places.")
else:
    print(f"MAD calculation results differ by: {abs_diff:.8f}")
    
s.close()

/* output
MAD value using Python Numpy: 25.00000000
MAD value using DolphinDB: 25.00000000
MAD calculation results are consistent up to 8 decimal places.
*/

1.1.2 Median Absolute Deviation (MAD)

The median absolute deviation (MAD) is a robust measure of statistical dispersion, providing an alternative to the mean absolute deviation that is less influenced by extreme outliers. It improves upon the traditional absolute deviation by using the median as a reference. To calculate the median absolute deviation, the mad function must specify useMedian=true. The formula for calculating MAD is as follows:

The following results show that the calculations from DolphinDB and Python Numpy are consistent up to 8 decimal places.

import time
import numpy as np
import dolphindb as ddb

def calculate_mad(data):
    med = np.median(data)
    abs_diffs = np.abs(data - med)
    return np.median(abs_diffs)

# Set random seed to ensure reproducibility
np.random.seed(42)

data = np.random.randint(1, 101, size=10000)

# Calculate MAD using Python Numpy
result1 = calculate_mad(data)
print(f"MAD value using Python Numpy: {result1:.8f}")

# Calculate MAD using DolphinDB
s = ddb.session()
s.connect("localhost", 7271)
result2 = s.run("mad{,true}", data)
print(f"MAD value using DolphinDB: {result2:.8f}")

# Data validation
abs_diff = abs(result1 - result2)
if abs_diff < 1e-8:
    print("MAD calculation results are consistent up to 8 decimal places.")
else:
    print(f"MAD calculation results differ by: {abs_diff:.8f}")
    
s.close()

/* output
MAD value using Python Numpy: 25.00000000
MAD value using DolphinDB: 25.00000000
MAD calculation results are consistent up to 8 decimal places.
*/

MAD can be effectively used to detect outliers. The process typically involves the following steps:

  1. Use the med function to obtain the median of the dataset to serve as the central reference.
  2. For each data point, use the mad function to calculate the median absolute deviation.
  3. Set an outlier threshold, usually as a multiple of MAD.

  4. Detect and mark outliers for further analysis.

The following example demonstrates how to use median absolute deviation for identifying and handling outliers via winsorization:

/* Identifying outliers based on MAD
   Input : original data
   Output : indexes of outliers
 */
def winsorized(X){
    // calculate the median and mean absolute deviation
    medianX = median(X)
    madX = mad(X, true)
    sigma = 1.4826*madX
    // calculate the lower and upper limits
    lowerLimit = medianX - 3*sigma
    upperLimit = medianX + 3*sigma
    // winsorize the data
    indexes = 0..(X.size()-1)
    return indexes.at(X < lowerLimit || X > upperLimit)
}
X = [-5.8, 2, 3, 4, 5, 6, 7, 8, 9, 18]
winsorized(X)
// output: [0, 9]

1.1.3 Skewness

Skewness measures the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be:

  • Positive (right-skewed): The tail is more pronounced on the right side than on the left.
  • Negative (left-skewed): The tail is more pronounced on the left rather than the right.
  • Zero: Values are relatively evenly distributed around the mean.

Skewness can be computed using the skew function when the sample size n≥3. Let x̄ be the sample mean, s be the sample standard deviation, and m3 be the third central moment of the sample. The formula for sample skewness is:

Note that the skewness computed in Python and Excel differs from the above. These tools calculate the Fisher-Pearson standardized moment G1, which adjusts the sample skewness for bias. The formula is:

To compute the Fisher-Pearson standardized moment G1 in DolphinDB, you can use the following script. The results show that the calculations from DolphinDB and Python pandas are consistent up to 8 decimal places.

import pandas as pd
import numpy as np
import dolphindb as ddb

# Set random seed to ensure reproducibility
np.random.seed(42)
data = np.random.normal(0, 1, 1000)

# Calculate Fisher-Pearson skewness using Python pandas
result1 = pd.Series(data).skew()
print(f"skewness using Python pandas: {result1:.8f}")

# Calculate Fisher-Pearson skewness using DolphinDB
session = ddb.session()
session.connect("localhost", 7271)
session.upload({"n":len(data),"x":data})
result2 = session.run("sqrt(n*(n-1))\(n-2)*skew(x)")
print(f"skewness using DolphinDB: {result2:.8f}")

# Data validation
abs_diff = abs(result1 - result2)
if abs_diff < 1e-8:
    print("Skewness calculation results are consistent up to 8 decimal places.")
else:
    print(f"Skewness calculation results differ by: {abs_diff:.8f}")

session.close()

/* output
Skewness value using Python pandas: 0.11697637
Skewness value using DolphinDB: 0.11697637
Skewness calculation results are consistent up to 8 decimal places.
*/

1.1.4 Kurtosis

Kurtosis refers to a descriptive statistic used to measure how data disperse between the center and tails of a distribution. A higher kurtosis indicates that the variance is primarily driven by extreme values (either larger or smaller than the mean) that occur less frequently.

In DolphinDB, the kurtosis is calculated as an estimate of the fourth central moment. Subtracting 3 from the result yields a value that aligns with Python’s kurtosis calculation.

  • By default, biased=true is used to calculate the sample kurtosis, which results in a biased estimate. Let the sample mean be denoted by x̄, the sample standard deviation by s, and the fourth central moment by m4. The formula for sample kurtosis is as follows:

  • If biased=false is specified, the calculation is for the population kurtosis, which provides an unbiased estimate for a normal distribution. The formula is as follows:

Since the kurt function in Python pandas calculates the unique symmetric unbiased estimator of the fourth cumulant (i.e., excess kurtosis), the results from DolphinDB's kurtosis function match those of pandas after subtracting 3, as shown in the following script.

import pandas as pd
import Numpy as np
import dolphindb as ddb

# Set random seed to ensure reproducibility
np.random.seed(42)
data = np.random.normal(0, 1, 1000)

# Calculate kurtosis using Python pandas
result1 = pd.Series(data).kurt()
print(f"Kurtosis (Pandas): {kurt_pandas:.8f}")

# Calculate kurtosis using DolphinDB
session = ddb.session()
session.connect("localhost", 7271)
result2 = session.run("kurtosis{,false}",data)-3
print(f"Kurtosis (DolphinDB): {kurt_dolphindb:.8f}")

# Data validation
abs_diff = abs(result1 - result2)
if abs_diff < 1e-8:
    print("Kurtosis calculation results are consistent up to 8 decimal places.")
else:
    print(f"Kurtosis calculation results differ by: {abs_diff:.8f}")

session.close()

/* output
Kurtosis (Pandas): 0.07256220
Adjusted Kurtosis (DolphinDB): 0.07256220
Kurtosis calculation results are consistent up to 8 decimal places.
*/

1.2 Multivariate Statistics

Multivariate statistics refers to the analysis of relationships between multiple variables, including the calculation of statistical features such as covariance and correlation coefficients. These measures help users understand the linear relationships and the degree of mutual influence between variables. DolphinDB provides a variety of functions to support the statistical analysis of multivariate features, which are listed below. This section will introduce the implementation and usage of these functions in detail.

Statistical Feature Function
Covariance covar
Covariance Matrix covarMatrix
Pearson Correlation Coefficient corr
Pearson Correlation Matrix corrMatrix
Spearman’s Rank Correlation Coefficient spearmanr
Kendall Correlation Coefficient kendall
Weighted Covariance wcovar

1.2.1 Covariance and Covariance Matrix

Covariance measures the joint variability of two random variables. A covariance matrix is a square matrix giving the covariance between each pair of elements across all dimensions. In DolphinDB, the covar function calculates the sample covariance between two vectors using the following formula:

For multi-dimensional data, DolphinDB provides the covarMatrix function to compute the covariance matrix, using the formula:

The following example demonstrates how to construct a risk matrix in DolphinDB:

returns_a = [0.01, 0.02, -0.03, 0.01, 0.02]
returns_b = [-0.02, 0.01, 0.03, -0.02, -0.01]
returns_c = [0.03, -0.01, -0.02, 0.02, 0.01]
m = matrix(returns_a, returns_b, returns_c).rename!(`a`b`c)
covariance_matrix = covarMatrix(m)
/* output:
	a	b	c
a	0.0004300000	-0.0003100000	0.0002300000
b	-0.0003100000	0.0004700000	-0.0004350000
c	0.0002300000	-0.0004350000	0.0004300000
*/

1.2.2 Pearson Correlation Coefficient and Correlation Matrix

Pearson correlation coefficient measures linear correlation between two variables. In DolphinDB, the correlation coefficient can be calculated using the corr function, using the formula:

For multi-dimensional data, DolphinDB provides the corrMatrix function to compute the correlation matrix, using the formula:

The Pearson correlation coefficient between a market index and individual stock returns shows the correlation between the stock and the overall market. For example:

// Market index data
market_index = [100, 110, 120, 130, 140]
// Individual stock return data
stock_returns = [0.01, 0.02, 0.03, -0.01, 0.02]
corr(market_index, stock_returns)
// Output: -0.1042572070

In portfolio optimization, correlation matrix can help reveal the correlation between different assets. This allows investors to select assets with low correlations in the portfolio for better diversification. For example, select assets with low correlations to others:

setRandomSeed(42)
m = rand(10.0, 100)$10:10
c= corrMatrix(m)
result = table(1..10 as stock_code, 
               c[c<0.5].abs().sum() as corr_sum).sortBy!(`corr_sum)
result
/* Output
	stock_code	corr_sum
0	8	1.6943835693
1	2	1.9124975741
2	6	1.9699810000
3	10	1.9706433980
4	4	1.9955912515
5	1	2.0791768735
6	9	2.2377748096
7	5	2.2757034717
8	7	2.8368851661
9	3	3.0716024528
*/

1.2.3 Spearman’s Rank Correlation Coefficient

The Spearman’s rank correlation coefficient is a non-parametric measure of rank correlation, essentially the Pearson correlation coefficient applied to ranked variables. In DolphinDB, it can be computed using the spearmanr function, using the formula:

In quantitative trading, the Information Coefficient (IC) measures the correlation between predicted and actual future returns, often used to evaluate the forecasting and stock-picking abilities. The spearmanr function is commonly used to calculate the Rank IC values. The result of spearmanr is consistent with the rank-based correlation computed using corr.

setRandomSeed(123)
predicted_returns = rand(1.0, 100)
actual_returns = rand(1.0, 100)
r1 = spearmanr(predicted_returns, actual_returns)
r2 = corr(rank(predicted_returns) + 1, rank(actual_returns) + 1)
eqObj(r1, r2)
// output: true

2. Probability Distributions

DolphinDB supports an extensive range of discrete and continuous probability distributions, offering functions for probability density, cumulative density, and random number generation. This section highlights the application of DolphinDB in discrete distributions, including Poisson and Binomial, as well as continuous distributions such as Normal, t, F, and Chi-square distributions.

Distribution Cumulative Density Function (CDF) Inverse CDF Random Number Generator
Normal Distribution cdfNormal invNormal randNormal
Binomial Distribution cdfBinomial invBinomial randBinomial
Poisson Distribution cdfPoisson invPoisson randPoisson
Zipf Distribution cdfZipf
t Distribution cdfStudent invStudent randStudent
Chi-square Distribution cdfChiSquare invChiSquare randChiSquare
F Distribution cdfF invF randF
Exponential Distribution cdfExp invExp randExp
Gamma Distribution cdfGamma invGamma randGamma
Logistic Distribution cdfLogistic invLogistic randLogistic
Uniform Distribution cdfUniform invUniform randUniform
Weibull Distribution cdfWeibull invWeibull randWeibull
Kolmogorov Distribution cdfKolmogorov
Beta Distribution cdfBeta invBeta randBeta

2.1 Discrete Distributions

2.1.1 Bernoulli Distribution (0-1 Distribution)

Bernoulli distribution (0-1 distribution) is a basic univariate distribution and serves as the basis for the Binomial distribution. Suppose an investment strategy generates daily returns as a binary random variable, which could either be a positive return (1) or a negative return (0), with no cases of no gain or loss. This aligns with the following description:

The probability of positive returns pwin determines all the characteristics of the Bernoulli distribution. For example, if pwin=0.5, the probability mass function (PMF) of the 0-1 distribution can be calculated using the pmfBernoulli function in DolphinDB.

def pmfBernoulli(p){
   cdf = cdfBinomial(1, p, [0,1])
   pmf =table([0,1] as X,cdf -prev(cdf).nullFill(0) as P).sortBy!(`X)
   return pmf
}
p = 0.5
pmfBernoulli(p)
/* Output 
	X	P
0	0	0.50
1	1	0.50
*/

In DolphinDB, once the probability pwin of a 0-1 event and the number of trials N are defined, the randBernoulli function can be used to simulate Bernoulli trials. For instance, assume each positive return corresponds to a logarithmic return of 10%, and each negative return corresponds to -10%. After running N trials according to the strategy, the total return can be calculated by summing the logarithmic returns. The following example shows that the return after 10 investment periods is 22%.

setRandomSeed(40393)
def randBernoulli(p, n){
   return randBinomial(1, p, n)
}
x = randBernoulli(0.5, 10)
// Suppose each period's logarithmic return is fixed at 0.1
log_return = 0.1
exp(sum(log_return * iif(x == 0, -1, x))) - 1
// Output: 0.2214

2.1.2 Binomial Distribution

Evaluating the probability and frequency of positive returns can be modeled using the Binomial distribution. A random variable X that follows a Binomial distribution with parameters p (probability of a 0-1 event) and n (number of trials) is denoted as XB (n, p) . The probability mass function (PMF) for X=k is expressed as:

In DolphinDB, the pmfBinomial function can be used to compute the PMF of a Binomial distribution. For example, the following result shows that after 10 trials, the probability of achieving exactly 5 positive returns is approximately 24.61%.

def pmfBinomial(trials, p){
    X = 0..trials
    cdf = cdfBinomial(trials, p, X)
    return table(X, cdf - prev(cdf).nullFill(0) as P).sortBy!(`X)
}
pmfBinomial(10, 0.5)

/* output:
X  P                
-- -----------------
0  0.0009765625     
1  0.009765625      
2  0.0439453125     
3  0.1171875        
4  0.205078125      
5  0.246093749999999
6  0.205078125      
7  0.1171875        
8  0.0439453125     
9  0.009765625      
10 0.0009765625   
*/

Using the pmfBinomial function, we can visualize the Binomial distribution under varying probabilities p and trial numbers n. Note that only integer values of X are valid.

plot([pmfBinomial(20,0.5)["P"] <- take(double(),20) as "p=0.5,n=20",
      pmfBinomial(20,0.7)["P"] <- take(double(),20) as "p=0.7,n=20",
      pmfBinomial(40,0.5)["P"] as "p=0.5,n=40"],0..40,"Binomial Distributions")
Figure 1. Binomial Distribution

2.1.3 Poisson Distribution

The Poisson distribution is a statistical model used to describe the occurrence of events over a continuous space or time interval. It is closely related to the Binomial distribution but differs in focus. While the Binomial distribution models the number of successes in a fixed number of trials, the Poisson distribution focuses on the number of discrete events occurring within a continuous interval. The probability mass function (PMF) of the Poisson distribution is defined as:

In DolphinDB, the pmfPoisson function can be used to compute the PMF values of a Poisson distribution. Note that only integer values of X are valid, and visual grouping is often done with lines connecting discrete points.

def pmfPoisson(mean, upper=20){
   x = 0..upper
   cdf =cdfPoisson(mean, x)
   pmf =table(x as X, cdf - prev(cdf).nullFill(0) as P).sortBy!(`X)
   return pmf
}
plot([pmfPoisson(1)["P"] as "λ=1",
      pmfPoisson(4)["P"] as "λ=4", 
      pmfPoisson(10)["P"] as "λ=10"],
      0..20,"Poisson Distributions")
Figure 2. Poisson Distributions

2.2 Normal Distribution

The normal distribution (Gaussian distribution) is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. The Central Limit Theorem states that the mean of a sufficiently large sample from any distribution approaches a normal distribution. Mathematically, a normal distribution is characterized by its mean μ and standard deviation σ.

In DolphinDB, the norm function generates data that follows XN (μ, σ), returning a vector or matrix of the specified dimensions. For example:

setRandomSeed(123)
X = norm(2.0, 0.1, 1000);
print mean X // output:  2.003
print std X // output:  0.102

Using the pdfNormal and cdfNormal functions, we can plot the Probability Density Function (PDF) and Cumulative Density Function (CDF) for various values of μ and σ:

x = 0.1*(-100..100)
plot([pdfNormal(0,1.0,x) as "μ=0,σ=1.0", pdfNormal(0,0.4,x) as "μ=0,σ=0.4",
      pdfNormal(0,2.2,x) as "μ=0,σ=2.2", pdfNormal(-2,0.7,x) as "μ=-2,σ=0.7"],
      x, "Normal Distributions with Different Parameters μ and σ")
plot([cdfNormal(0,1.0,x) as "μ=0,σ=1.0", cdfNormal(0,0.4,x) as "μ=0,σ=0.4", 
      cdfNormal(0,2.2,x) as "μ=0,σ=2.2", cdfNormal(-2,0.7,x) as "μ=-2,σ=0.7"],
      x, "Normal Distributions with Different Parameters μ and σ")      
Figure 3. PDF of Normal Distributions
Figure 4. CDF of Normal Distributions

For smaller sample sizes, the variability in the sample distribution is larger. For example, drawing 100 sample points from N(0,1) repeatedly tends to form a normal distribution for the means, but individual samples may deviate significantly.

The randNormal function generates multiple sample groups. Below is an example generating 20 groups, each containing 100 sample points:

// run 20 times
plot(randNormal(0.0, 1.0, 100), , "hist", HISTOGRAM)
plot(norm(0.0, 1.0, 100), , "hist", HISTOGRAM)
Twenty Distributions of 100 Sample Points Generated from the Standard Normal Distribution

Central Limit Theorem

The Central Limit Theorem (CLT) states that the sum (or difference) of multiple independent random variables X1 , X2 , ... , Xn ∈N (μ, σ) that follow the same normal distribution also follows a normal distribution, i.e.,

Additionally, for independent random variables {Xi} , even if the data does not follow a normal distribution, the sum will assume a near-normal or normal distribution N(nμ, n2σ2) if the sample size is large enough.

In DolphinDB, by applying limit-averaging to multiple uniformly distributed data sets, it can be observed that as the number of distributions increases, the means of these random variables gradually converge to a smooth and nearly normal distribution.

def meanUniform(left, right, n, samples){
   setRandomSeed(123)
   result = []
   for (i in 0 : samples){
      result.append!(randUniform(left, right, n))
   }
   return matrix(result).transpose().mean()
}
x = randUniform(-1.0, 1.0, 1000000)
plot(x, , "hist", HISTOGRAM)
x = meanUniform(-1.0, 1.0, 1000000, 2)
plot(x, , "hist", HISTOGRAM)
x = meanUniform(-1.0,1.0,1000000, 10)
plot(x, , "hist", HISTOGRAM)
Figure 5. Random Data
Figure 6. Averaging Two Samples
Figure 7. Averaging Ten Samples

Application: Log-Normal Distribution

The log-normal distribution is widely used in finance to model the behavior of asset prices and returns. Empirical evidence suggests that long-term log-returns (e.g., annual or monthly returns) of stocks approximate a normal distribution. In DolphinDB, log-returns can be simulated using the norm function.

logreturns = norm(0, 1, 1000).sort()
returns = exp(logreturns).sub(1).sort()
t = select * from table(logreturns, returns) where returns<0.5 and returns>-0.5
plot([t.logreturns, t.returns], t.returns)
Figure 8. Regular Returns vs. Log-Returns

The above comparison between log-returns and regular returns reveals that within the range of -10% to +10%, both distributions largely overlap. However, as the absolute value of the return increases, the regular returns begin to stabilize, while the log-returns continue to maintain their normal distribution characteristics.

Log-returns are additive across time periods. The overall log-return for T periods is the sum of individual period log-returns. By the central limit theorem, assuming the logarithmic returns of different periods are independent and identically distributed, it can be inferred that as T increases, will gradually converge to the expectation.

When plotting the log-return curve for T periods, it is observed that as T increases, the log-returns gradually converge, but the converged return is not fixed. Therefore, if initial capital X0 is invested for the long term, it is important to focus on the trend of long-term returns, as greater returns in the final period are desirable. To maximize returns, investors need to consider two factors: on one hand, the investment horizon T should be sufficiently long to allow the returns to converge; on the other hand, the Kelly Criterion can be applied to maximize the expected value of the converging returns .

dates = getMarketCalendar("CFFEX", 2010.01.01, 2023.12.31)
logreturns = norm(0, 1, dates.size()).sort()
plot(logreturns, dates)
Figure 9. Distribution of Log Returns

2.3 Continuous Distributions

2.3.1 t-Distribution

The t-distribution is widely used in statistics, particularly when making inferences when the population mean and variance is unknown. If x̄ is the sample mean and s is the sample standard deviation, the resulting t statistic is expressed as:

A common application of the t-distribution is in calculating the confidence interval for the mean. In DolphinDB, the critical value corresponding to the t statistic can be obtained using the invStudent function. For example, assuming a sample size of n=20, the degrees of freedom df=19, and a significance level of 95%, the critical value corresponding to tα/2 can be calculated using invStudent(19, 0.975).

x = randNormal(-100, 1000, 10000)
n = x.size()
df = n-1
rt = invStudent(df, 0.975)
// output: 1.9602012636

2.3.2 Chi-Square Distribution

If a random variable X∈N(μ , σ), then . The sum of squares of n independent standard normal random variables follows a Chi-Square distribution with n degrees of freedom:

When the sample size is small, the ratio of the sample standard deviation S2 to the population standard deviation σ2 also follows a Chi-Square distribution, expressed as:

For example, consider stock returns with a restricted volatility σ2=0.05. To analyze whether the standard deviation exceeds the permissible value, the critical value of the Chi-Square statistic can be calculated using DolphinDB's invChiSquare function. The analysis shows that the Chi-Square statistic does not exceed the 95% significance level critical value, so the null hypothesis S2 <=σ2 cannot be rejected. This implies that the sample return volatility is preliminarily inferred to be within the specified range.

setRandomSeed(123)
x = randUniform(-0.1, 0.1, 100)
test = (x.size() - 1) * pow(x.std(), 2) \ 0.05
test > invChiSquare(x.size() - 1, 0.95)
// output: false

F-Distribution

The F-distribution is named after Ronald Fisher, who introduced it to determine critical values in ANOVA (Analysis of Variance). By calculating the ratio of variances from two investment portfolios, the F-distribution helps assess whether the two portfolios exhibit the same risk level:

In this context, Sx2 and Sy2 represent the sample standard deviations of the stock returns from the two portfolios, and the distribution of their ratio is the F-distribution. In ANOVA applications, the critical value of the F-distribution is determined based on the degrees of freedom of the numerator and denominator and the significance level, as shown below:

In DolphinDB, the critical value of the F-distribution can be calculated using the invF function. Assuming that the two investment portfolios come from different Gaussian distributions, with the latter having a higher standard deviation, the compareTwoSamples function is used to perform an F-test. The result indicates that we can reject the null hypothesis σ1 = σ2, which matches the expected outcome.

setRandomSeed(1)
x1 = randNormal(0, 1, 20)
x2 = randNormal(0, 5, 30)

def compareTwoSamples(x1, x2, alpha){
   test = (pow(x1.std(), 2)\(x1.size()-1))\(pow(x2.std(), 2)\(x2.size()-1))
   return test > invF(x1.size()-1, x2.size()-1, 1-alpha\2) || 
          test < invF(x1.size()-1, x2.size()-1, alpha\2)
}

compareTwoSamples(x1, x2, 0.05)
// output: true

3. Random Number Generation

DolphinDB not only supports random number generation using the rand function but also allows generating random numbers from specific probability distributions. Below is a list of commonly used random number generator functions in DolphinDB. This section provides detailed instructions on generating random numbers from specific distributions, performing random sampling, and simulating the price movement of Snowball options in DolphinDB.

Function Usage Return Value
rand rand(X, count) Returns a vector of elements randomly selected from X.
randDiscrete randDiscrete(v, p, count) Generates random samples from vector v based on the given probability distribution p.
randUniform randUniform(lower, upper, count) Generates the specified number of uniformly distributed random numbers.
randNormal randNormal(mean, stdev, count) Generates the specified number of normally distributed random numbers.
randBeta randBeta(alpha, beta, count) Generates the specified number of random numbers following a Beta distribution.
randBinomial randBinomial(trials, p, count) Generates the specified number of random numbers following a Binomial distribution.
randChiSquare randChiSquare(df, count) Generates the specified number of random numbers following a Chi-square distribution.
randExp randExp(mean, count) Generates the specified number of random numbers following an Exponential distribution.
randF randF(numeratorDF, denominatorDF, count) Generates the specified number of random numbers following an F-distribution.
randGamma randGamma(shape, scale, count) Generates the specified number of random numbers following a Gamma distribution.
randLogistic randLogistic(mean, s, count) Generates the specified number of random numbers following a Logistic distribution.
randMultivariateNormal randMultivariateNormal(mean, covar, count, [sampleAsRow=true]) Generates random numbers following a multivariate normal distribution.
randPoisson randPoisson(mean, count) Generates the specified number of random numbers following a Poisson distribution.
randStudent randStudent(df, count) Generates the specified number of random numbers following a t-distribution.
randWeibull randWeibull(alpha, beta, count) Generates the specified number of random numbers following a Weibull distribution.

3.1 Random Numbers Generation

Generating Random Vectors/Matrices

In DolphinDB, random number vectors can be generated using the rand function. If X is specified as a scalar, the generated random numbers will be less than or equal to X and follow a uniform distribution.

rand(1.0, 10)

To generate a random matrix, specify count as a pair of numbers representing the dimensions. For example, the following script generates a 2x2 random matrix:

rand(1.0, 2:2)

Generating Random Numbers from Specific Distributions

DolphinDB supports generating random numbers following specific distributions by providing a suite of generators with the distribution name. For example, to generate random numbers that follow a normal distribution, you can use the randNormal function.

fmean, fstd = 0, 1
random_normal_vector = randNormal(fmean, fstd, 10)
random_normal_vector

3.2 Random Sampling

Sampling with Replacement

The rand function supports random sampling with replacement from the vector X. For example, drawing 5 samples from X:

setRandomSeed(123)
x = 1..10
sample1 = rand(x, 5)
sample1
// output: [7, 8, 3, 5, 3]

Sampling Without Replacement

The shuffle function allows random shuffling of vector X. By using shuffle, we can implement sampling without replacement. The following example uses a user-defined function to sample without replacement from X:

def sampleWithoutReplacement(x,n){
    return x.shuffle()[:n]
}
x = 1..10
sample2 = sampleWithoutReplacement(x, 5)
sample2

Probability Sampling

DolphinDB supports probability sampling by specifying the sampling probabilities p for each element in the population using the randDiscrete function. This allows generating random samples according to a specified probability distribution.

X = `A`B`C`E`F
p = [0.2, 0.3, 0.5, 0.2, 0.2]
randDiscrete(X, p, 5)

3.3 Monte Carlo Simulation: Asset Price Movements

The Monte Carlo simulation for asset price movements assumes that asset prices follow a log-normal distribution. Incorporating the differential equation from the Black-Scholes-Merton (BSM) model, the pricing formula for the asset linked to a Snowball product can be derived as:

Where Δt is the time increment, ST and St represent the asset price at periods T and t, r is the risk-free interest rate, σ is the volatility, and ε is a random variable following a standard normal distribution.

In DolphinDB, Tdays*N random numbers following a standard normal distribution can be simulated with norm(0, 1, Tdays * N). Using cumsum for accumulation, the price movements of 128 stocks can be plotted as shown below.

// Simulation times
N = 1000000
// Underlying asset parameters
vol = 0.13
S0  = 1
r   = 0.03
q   = 0.08
// Date parameters
days_in_year = 240 // Trading days in a year
dt    = 1\days_in_year
all_t = dt + dt * (0..(T * days_in_year - 1))
Tdays = int(T * days_in_year)
// Assume initial asset price is 1
St = 1
S = St * exp( (r - q - vol*vol/2) * repmat(matrix(all_t),1,N) + 
    vol * sqrt(dt) * cumsum(norm(0, 1, Tdays*N)$Tdays:N))
// Plot simulated asset price movements
plot(S[,:128].rename!(1..128), 1..(T * days_in_year), "Simulated Stock Price Movements")
Figure 10. Simulated Stock Price Movements

4. Regression

DolphinDB provides a wide range of regression functions to address various regression scenarios, ranging from simple methods like ols and wls to advanced techniques. Below is an overview of the regression functions and their corresponding use cases.

Function Regression Use Case
ols Ordinary Least Squares For standard linear regression problems.
olsEx Ordinary Least Squares Extends ols to support distributed computation on partitioned tables.
wls Weighted Least Squares For handling heteroscedasticity (non-constant error variance) in regression problems.
logisticRegression Logistic Regression For binary classification problems, predicting probabilities, and classification boundaries.
randomForestRegressor Random Forest Regression For high-dimensional data and complex relationships in regression problems.
adaBoostRegressor Ensemble Learning Regression For linear/non-linear relationships and handling outliers in regression tasks.
lasso Lasso Regression For datasets with many features and for performing feature selection to simplify models.
ridge Ridge Regression For dealing with multicollinearity among independent variables.
elasticNet Elastic Net Regression Combines Lasso and Ridge regression, effective for correlated predictors and noise reduction.
glm Generalized Linear Model Used when the dependent variable is discrete or follows a non-normal distribution.

5. Hypothesis Testing

In DolphinDB, the steps of hypothesis testing are as outlined below:

  1. State the null and alternative hypotheses.
  2. Identify an appropriate test statistic and its distribution and select the corresponding DolphinDB hypothesis testing function.
  3. Set the confidence level α.
  4. Use inverse CDF to calculate critical values based on α and the distribution.
  5. Collect data and calculate the test statistic using DolphinDB’s hypothesis testing functions.
  6. Compare the test statistic with the critical values to determine whether to reject the null hypothesis and draw a conclusion.

DolphinDB provides a rich set of hypothesis testing functions to address various scenarios.

Function Test Type Use Case
tTest t-Test Parametric Test Compares whether the means of two samples differ significantly; suitable for normally distributed samples with unknown variance.
zTest z-Test Parametric Test Compares whether the means of two samples differ significantly; suitable for normally distributed samples with known variance or large samples.
fTest F-Test Variance Analysis Compares whether the variances of two or more samples differ significantly; suitable for normally distributed samples.
chiSquareTest Chi-Square Test Goodness-of-Fit or Independence Test Examines differences between observed and expected frequencies; used for both goodness-of-fit and independence tests.
mannWhitneyUTest Mann-Whitney U Test Non-Parametric Test Compares the medians of two independent samples; suitable for samples that do not satisfy the normality assumption.
shapiroTest Shapiro-Wilk Test Normality Test Tests whether a sample follows a normal distribution; suitable for small sample sizes.
ksTest Kolmogorov-Smirnov Test Normality or Distribution Fit Test Tests whether a sample follows a specified distribution or whether two samples come from the same distribution; suitable for large samples.

5.1 Hypothesis Testing for Mean

5.1.1 z-Test

Using the data from Chapter 4 as an example, the time series graph of stock returns is plotted below.

setRandomSeed(123)
facTable = getAllFactorTable(2, 240)
tmp = select return_day from facTable pivot by record_date, stock_code
plot(tmp[,1:], tmp["record_date"])
Figure 11. Time Series of Stock Returns

We observe that the returns for both stocks are nearly identical, suggesting that their mean returns may be the same. Next, we perform a hypothesis test to verify this assumption:

Hypothesis: The return samples for both stocks follow a normal distribution, with known variances, and both stocks have a standard deviation of 6.

We use the zTest function to perform the z-test, and the output is as follows. The calculated z-statistic's p-value is 0.68 (> 0.05), meaning we fail to reject the null hypothesis, and conclude that the mean returns of the two stocks are indeed the same.

zTest(tmp["600001"], tmp["600002"], 0, 6, 6)
/* output:
method: 'Two sample z-test'
zValue: 0.4050744166
confLevel: 0.9500000000
stat:
alternativeHypothesis	pValue	lowerBound	upperBound
0	difference of mean is not equal to 0	0.6854233525	-0.8516480908	1.2953848817
1	difference of mean is less than 0	0.6572883237	-∞	1.1227918307
2	difference of mean is greater than 0	0.3427116763	-0.6790550398	∞
*/

5.1.2 t-Test

In practice, since the volatility of stock returns is unknown and the sample size is typically small, the t-test should be used instead of z-test. Using DolphinDB's tTest function, the output is as follows. The calculated t-statistic's p-value is 0.13 (> 0.05), so we fail to reject the null hypothesis, concluding that the mean returns of the two stocks are indeed the same.

tTest(tmp["600001"], tmp["600002"], 0)
/* output:
method: 'Welch two sample t-test'
tValue: -0.6868291027
df: 473.7980010389
confLevel: 0.9500000000
stat:
    alternativeHypothesis	pValue	lowerBound	upperBound
0	difference of mean is not equal to 0	0.1314	-2.0292	0.2651
1	difference of mean is less than 0	0.0657	-∞	0.0801
2	difference of mean is greater than 0	0.9342	-1.8441	∞
*/

5.2 Hypothesis Testing for Variance

Variance and standard deviation are essential metrics in finance for assessing risk. Hypothesis tests for variance utilize the chi-square and F-tests.

5.2.1 Chi-Square Test

Using the dataset mentioned above, test if the volatility (variance) of stock 600001's returns is greater than a specific threshold (6). The null and alternative hypotheses are:

The chi-square statistic is calculated using the following formula, and the critical value at the 95% significance level is obtained using DolphinDB's invChiSquare function. The results show that the chi-square statistic is less than the critical value, so we fail to reject the null hypothesis, concluding that the volatility of stock 600001 is not greater than 6.

test_statistic =(size(tmp)-1)*tmp["600001"].var()/36
print(stringFormat("chi-squre test statistic:%.2F",test_statistic))
chi_statistic=invChiSquare(size(tmp)-1,0.95)
print(stringFormat("chi-squre test statistic:%.2F",chi_statistic))
/* output:
chi-squre test statistic:233.59
chi-squre test statistic:276.06
*/

5.2.2 F-Test

To compare whether the volatilities (variances) of returns for two stocks are consistent, an F-test can be used. The null and alternative hypotheses are as follows:

Using DolphinDB's fTest function, the F-test output is as follows. Based on the analysis, the calculated F-statistic is 0.79, which lies between the left and right critical values of the F-distribution. Therefore, we fail to reject the null hypothesis, concluding that the variances of the returns for the two stocks are equal.

fTest(tmp["600001"], tmp["600002"])
/*Output 
method: 'F test to compare two variances'
fValue: 0.7907099827
numeratorDf: 239
denominatorDf: 239
confLevel: 0.9500000000
stat:
	alternativeHypothesis	pValue	lowerBound	upperBound
0	ratio of variances is not equal to 1	0.0701139943	0.6132461206	1.0195291184
1	ratio of variances is less than 1	0.0350569972	0.0000000000	0.9786251182
2	ratio of variances is greater than 1	0.9649430028	0.6388782232	∞
*/