DataFrame

Constructor

DataFrame(data, index=None, lazy)

data can be a table (in-memory table or DFS table) or dictionary.
- If data is a table, the column values of the table correspond to the DataFrame's column data and the column names correspond to the DataFrame's column indices.
- If data is a dict, the values of the dict correspond to the DataFrame's column data and the keys correspond to the DataFrame's column indices.
index is the row index of the DataFrame. Multiple indices are currently not supported. Note that the function DataFrame.index can be used to access the index attribute of the DataFrame.
lazy (optional) is a Boolean value representing whether to create a lazy DataFrame. The default value is True. A lazy DataFrame is a view of the original object where calculations are not executedimmediately; a non-lazy DataFrame is a copy of the original object where calculations are executed immediately.
- The function is_lazy() can be used to check whether a DataFrame is lazy.
- Calculations involving a lazy DataFrame generate an intermediate object and do not directly produce execution results until the compute() function is called. Calling compute() triggers a full table scan, which can impact performance when being used within a loop. Therefore, it is recommended to call compute() outside the loop, assign the result to a variable, and use this variable within the loop.

Note:

data cannot be an empty table or dict.
If data is a partitioned table, lazy must be True. index must be a scalar of the name of a specified column and the values of the specified column serve as the row labels of the DataFrame, rather than data.
If data is a non-partitioned in-memory table, the default value of lazy is False. index can be unspecified or specified as a Python list or DolphinDB vector with a length the same as the number of rows of the table. If lazy is set to True, index must be a scalar of the name of a specified column and the values of the specified column serve as the row labels of the DataFrame, rather than data.
If data is a dict, the default value of lazy is False. index can be unspecified or specified as a Python list or DolphinDB vector with a length the same as the number of rows of the table.

Conversion

Currently, only the copy function is supported and the parameter deep is not supported.

Indexing / Iteration

DolphinDB pandas currently supports the following functions for indexing or iteration:


Function	Description	Compatibility Statement
DataFrame.at	Access a single value for a row/column pair by label.	When using `at[i, j]=val`, if the data types of val and the DataFrame differ, an attempt will be made to convert the DataFrame type. If the conversion fails, an error will be raised.
DataFrame.iat	Access a single value for a row/column pair by integer position.	Same as `DataFrame.at`
DataFrame.loc	Access a group of rows and columns by label(s) or a boolean array.	Unallowed inputs are: An alignable index. A `callable` function with one argument (the calling Series or DataFrame). Accessing the same row repetitively, such as `df.loc[["aa", "aa"]]`.
DataFrame.iloc	Access a group of rows and columns by integer position(s).	Same as `DataFrame.loc`
DataFrame.insert	Insert a column into a DataFrame at a specified location.	The allow_duplicates parameter is not supported; a lazy DataFrame does not support this method.
DataFrame.get()	Get the specified column(s) from a DataFrame.

Binary Operator Functions

DolphinDB pandas supports all binary operator functions of Python pandas.DataFrame.

Note:

The parameter level is currently not supported.
The parameter axis can only be specified as 0 (default value) or 1 and cannot be specified as a string ('index' or 'columns').
If the parameter other is a Series, the parameter fill_value cannot be specified.

When using the dot function, it is recommended to use a string for the column name of the DataFrame and the index of the other object, as shown in the following script:

import pandas as pd
// Create a DataFrame with a dict, whose keys are recommended to be strings
df = pd.DataFrame({"A1": [1,2,3],"A2": [11,12,13]},  ["a", "b", "c"])
// The index of the other object must be specified as strings
other = pd.DataFrame({"a": [1,2],"b": [11,12],"c": [1,2]},  ["A1", "A2"])
df.dot(other)

Computations / Descriptive Statistics

DolphinDB pandas supports the following functions related to computations / descriptive stats of Python pandas.DataFrame: abs, all, any, autocorr, between, corr, count, cov, diff, kurt, kurtosis, mad, max, mean, median, min, mode, nlargest, nsmallest, prod, sem, skew, std, sum, var, unique, nunique, is_unique, is_monotonic_increasing, is_monotonic_decreasing, value_counts, rank, round.

Some parameters are currently not supported by certain functions, as shown in the following table.


Function	Compatibility Statement
all	Parameters axis, bool_only, and skipna are not supported.
any	The parameter axis is not supported.
corr	The parameter method can only be 'pearson'.
corrwith	The parameter axis is not supported. The parameter method can only be 'pearson' and numeric_only can only be False.
cummax	No parameter is supported.
cummin	No parameter is supported.
cumprod	No parameter is supported.
cumsum	No parameter is supported.
describe	No parameter is supported.
diff	Parameters periods and axis are not supported.
kurt	The parameter axis is not supported and the parameter skipna can only be True.
kurtosis	The parameter axis is not supported and the parameter skipna can only be True.
max	The parameter skipna is not supported and the parameter axis can only be 0.
mean	Parameters skipna and axis are not supported.
median	Parameters skipna and axis are not supported.
min	The parameter skipna is not supported.
mode	The parameter axis can only be 0 and dropna can only be True.
prod	The parameter axis can only be 1 and parameters skipna and min_count are not supported.
product	Same as `prod`.
quantile	The parameter q can only be a scalar and parameters axis and method are not supported.
rank	The parameter axis can only be 0, method cannot be 'dense', and na_options cannot be 'bottom'. If the parameter pct is specified as True, the result of `DataFrame.rank` is the same as that of DolphinDB's built-in function, `rank`.
round	The parameter decimals cannot be negative.
sem	Parameters axis, skipna, and ddof are not supported.
skew	The parameter skipna can only be True.
sum	Parameters skipna and min_count are not supported. Applying the `sum` function on strings returns a NULL value.
std	Parameters skipna and ddof are not supported.
var	Parameters skipna and ddof are not supported.
value_counts	Parameters normalize and sort are not supported.
nunique	No parameter is supported.

Reindexing / Selection / Label Manipulation

DolphinDB pandas supports some functions related to reindexing / selection / label manipulation of Python pandas.DataFrame, as shown in the following table.


Function	Compatibility Statement
idxmin/idxmax	The parameter skipna is not supported. When axis = 0, the COMPLEX type is not supported; when axis = 1, data types COMPLEX, POINT, INT128, DECIMAL32, DECIMAL64, and DECIMAL128 are not supported.
reindex	Parameters label, columns, axis, copy, level, and tolerance are not supported.
reindex_like	Parameters copy and tolerance are not supported.
align	Only parameters other and join are supported.
drop	The parameter inplace is not supported.
sample	Parameters axis and ignore_index are not supported.
rename	Parameters copy, inplace, level, and error are not supported.
head / tail	The parameter n cannot be 0.

Missing Data Handling

DolphinDB pandas supports all functions related to missing data handling of Python pandas.DataFrame, as shown in the following table:


Function	Compatibility Statement
backfill	No parameter is supported.
bill	No parameter is supported.
dropna	The parameter thresh is not supported. In lazy mode, parameters inplace and ignore_index are not supported and axis cannot be specified as 1 or 'columns'.
ffill	No parameter is supported.
fillna	Parameters axis, downcast, inplace, and limit are not supported.
interpolate	Parameters axis and limitArea are not supported.
isna
isnull
notna
notnull
pad	No parameter is supported.
replace	Parameters inplace and limit are not supported. Nested dictionaries are not supported by parameters to_replace and value, such as {'a': {'b': None}}.

Exclusive Method of DolphinDB pandas

In addition to methods compatible with Python pandas DataFrame, DolphinDB pandas DataFrame provides a distinctive method — to_table. It is used to convert a DataFrame to an in-memory table of DolphinDB, as demonstrated in the following script:

import dolphindb as ddb
import pandas as pd
// Create an empty table t with two columns
t=table(pair(100,0), ['x','y'].toddb(), [ddb.INT, ddb.STRING].toddb())
// Use the to_table method to convert a DataFrame to a DolphinDB in-memory table and append the table to t
t.append!(df.to_table())
// Directly appending a DataFrame to a DolphinDB table is not supported
t.append!(df)   // Error: Only a table can append to another table.

Creating a DataFrame

Creating a DataFrame from a Dictionary

In lazy mode

import pandas as pd

df = pd.DataFrame({"A": [1,2,3], "B": ["a", "b", "C"]}, index=["A1", "A2", "A3"])
df

Output:

A B

A1 1 a

A2 2 b

A3 3 C

In non-lazy mode

df = pd.DataFrame({"A": [1,2,3], "B": ["a", "b", "C"]}, None, False)

Output:

A B

0 1 a

1 2 b

2 3 C

df = pd.DataFrame({"A": [1,2,3], "B": ["a", "b", "C"]}, [`x,`y,`z], False)

Output:

A B

x 1 a

y 2 b

z 3 C

Creating a DataFrame from an In-Memory Table

Create an in-memory table of DolphinDB, tableForDf.

timetag =[2018.01.02, 2018.01.03, 2018.01.04, 2018.01.08].toddb()
name = [`AAPL, `AAPL, `GS, `AAPL].toddb()
flag = [`A, `A, `B, `C].toddb()
p = [10, 20, 30, 40].toddb()
tableForDf = table(p as price, flag as bsFlag, name, timetag)

In lazy mode

When lazy = True, index can only be the name of a specified column in tableForDf.
```
df = pd.DataFrame(tableForDf, 'timetag', True)
df
```
Output:

timetag price bsFlag name

2018.01.02 10 A AAPL

2018.01.03 20 A AAPL

2018.01.04 30 B GS

2018.01.08 40 C AAPL
In non-lazy mode

When lazy = False, index cannot be the name of a specified column in tableForDf.
```
level1 = ['0001', '0002', '0002', '0001']
df = pd.DataFrame(tableForDf, level1, False)
df
```
Output:

price bsFlag name timetag

0001 10 A AAPL 2018.01.02

0002 20 A AAPL 2018.01.03

0002 30 B GS 2018.01.04

0001 40 C AAPL 2018.01.08

Creating a DataFrame from a DFS Table

Only lazy DataFrame can be created from a DFS table.

n=1000
month=take(seq(2000.01M, 2016.12M), n);
x=rand(1.0, n);
t=table(month, x);
dbName="dfs://test_pandas"
if(exists(dbName)):
dropDatabase(dbName)
db=database(dbName, ddb.VALUE, partitionScheme=seq(2000.01M, 2016.12M))
pt=db.createPartitionedTable(t, `pt, `month).append!(t)

// index must be the name of a specified column in the table and lazy must be True
df=pd.DataFrame(pt,index="month", lazy=True)  

df=pd.DataFrame(pt,lazy=False)   // If lazy is set to False, an error occurs: cannot create non-lazy DataFrame with segmented table

Accessing a DataFrame

Create an in-memory table tableForDf.

timetag =[2018.01.02, 2018.01.03, 2018.01.04, 2018.01.08].toddb()
name = [`AAPL, `AAPL, `GS, `AAPL].toddb()
flag = [`A, `A, `B, `C].toddb()
p = [10, 20, 30, 40].toddb()
tableForDf = table(p as price, flag as bsFlag, name, timetag)

Accessing Data by Position

A lazy DataFrame is a view of the original object and does not support accessing data by position. Take a non-lazy DataFrame df_nonlazy as an example.
```
level = ['0001', '0002', '0002', '0001']
df_nonlazy=pd.DataFrame(tableForDf, level, False)
```
- Access an element
```
df_nonlazy.iat[0, 3]
```
  Output: 2018.01.02
- Access column(s)
```
df_nonlazy.iloc[:,1]
```
  Output:
  
  0001 A
  
  0002 A
  
  0002 B
  
  0001 C
  
  dtype: STRING
```
df_nonlazy.iloc[:,1:3] 
```
  Output:
  
  bsFlag name
  
  0001 A AAPL
  
  0002 A AAPL
  
  0002 B GS
  
  0001 C AAPL
```
df_nonlazy.iloc[:,2:]
```
  Output:
  
  name timetag
  
  0001 AAPL 2018.01.02
  
  0002 AAPL 2018.01.03
  
  0002 GS 2018.01.04
  
  0001 AAPL 2018.01.08
```
df_nonlazy.iloc[:,:3] # select col0~col3
```
  Output:
  
  price bsFlag name
  
  0001 10 A AAPL
  
  0002 20 A AAPL
  
  0002 30 B GS
  
  0001 40 C AAPL
- Access row(s)
```
df_nonlazy.iloc[1]
```
  Output:
  
  price 20
  
  bsFlag A
  
  name AAPL
  
  timetag 2018.01.03
  
  dtype: ANY
```
df_nonlazy.iloc[1:3] 
```
  Output:
  
  price bsFlag name timetag
  
  0002 20 A AAPL 2018.01.03
  
  0002 30 B GS 2018.01.04
- Access pair(s) of row(s) and column(s)
```
df_nonlazy.iloc[[0,3], [2]]
```
  Output:
  
  name
  
  0001 AAPL
  
  0001 AAPL
```
df_nonlazy.iloc[0:3, 0:1]
```
  Output:
  
  price
  
  0001 10
  
  0002 20
  
  0002 30
```
df_nonlazy.iat[3, 3]
```
  Output:m 2018.01.08
Accessing Data by Label
- Access data by column label(s)
```
df_nonlazy.price // Accessing data using the DataFrame.columns method is not supported
```
```
df_nonlazy['price']
```
  Output:
  
  0001 10
  
  0002 20
  
  0002 30
  
  0001 40
  
  dtype: INT
```
df_nonlazy[['timetag', 'price']]
```
  Output:
  
  timetag price
  
  0001 2018.01.02 10
  
  0002 2018.01.03 20
  
  0002 2018.01.04 30
  
  0001 2018.01.08 40
```
df_nonlazy.loc[:, ['timetag', 'price']]
```
  Output:
  
  timetag price
  
  0001 2018.01.02 10
  
  0002 2018.01.03 20
  
  0002 2018.01.04 30
  
  0001 2018.01.08 40
- Access data by row label(s)
```
df_nonlazy.loc[`0001]
```
  Output:
  
  price bsFlag name timetag
  
  0001 10 A AAPL 2018.01.02
  
  0001 40 C AAPL 2018.01.08
```
df_nonlazy.loc[[`0001,`0002]]
```
  Output:
  
  price bsFlag name timetag
  
  0001 10 A AAPL 2018.01.02
  
  0002 20 A AAPL 2018.01.03
  
  0002 30 B GS 2018.01.04
  
  0001 40 C AAPL 2018.01.08
- Access pair(s) of row(s) and column(s)
```
df_nonlazy['name'][`0001]
```
  Output:
  
  0001 AAPL
  
  0001 AAPL
  
  dtype: STRING
```
df_nonlazy.at[`0001, 'name']
```
  Output:
  
  0001 AAPL
  
  0001 AAPL
  
  dtype: STRING
```
df_nonlazy.loc[[`0001, `0002], ['name']]
```
  Output:
  
  name
  
  0001 AAPL
  
  0002 AAPL
  
  0002 GS
  
  0001 AAPL
Accessing Data by Boolean Index
```
df_nonlazy.loc[[True,True,False, False]]
df_nonlazy.iloc[[True, True,False, False]]
df_nonlazy[[True, True,False, False]]
```
Output:

price bsFlag name timetag

0001 10 A AAPL 2018.01.02

0002 20 A AAPL 2018.01.03
Accessing Data by Condition
```
df_nonlazy[df_nonlazy["price"]>20]
```
Output:

price bsFlag name timetag

0002 30 B GS 2018.01.04

0001 40 C AAPL 2018.01.08

Operating a DataFrame

A lazy DataFrame does not support adding, changing, or removing data. Take a non-lazy DataFrame df_nonlazy as an example.

Adding data
- Add a column to a DataFrame using the insert method
```
df_nonlazy.insert(0, `tep, [33,33,37,37])
```
- Add a column to a DataFrame by directly assigning values to the added column (currently not supported)
```
row = {"price":30, "bsFlag":'B', "name":"GS", "timetag":2018.01.09}
df_nonlazy.iloc[1] = row
```

Updating data

// Create an example DataFrame
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

// Use methods df.at and df.iat to change the value of a single element
df.at[0, 'A'] = 10
df.iat[1, 1] = 50

// Use the method df.apply to apply a function to a column
df['A'] = df['A'].apply(lambda x: x * 2)

If a DataFrame is created from a table, the modification of its data through value assigning will also be applied to the original table.

// Create an example DataFrame from an in-memory tablee
t1=table([1,2,3].toddb() as value)
df=pd.DataFrame(t1,lazy=False)

// Use methods df.at and df.iat to change the value of a single element
df.at[0, 'value']=10
df.iat[1, 0] = 20
t1

Output:

value

10

20

3

Removing data
- Remove elements using the drop method
```
df_nonlazy.drop(['name'], axis=1)
```