DataFrame
As one of the most common data structures in DolphinDB pandas, DataFrame is a two-dimensional tabular data structure with labeled axes (rows and columns). DataFrame can be used to hold and process various data types, including numeric values, strings, and Boolean values.
Constructor
DataFrame(data, index=None, lazy)
-
data can be a table (in-memory table or DFS table) or dictionary.
-
If data is a table, the column values of the table correspond to the DataFrame's column data and the column names correspond to the DataFrame's column indices.
-
If data is a dict, the values of the dict correspond to the DataFrame's column data and the keys correspond to the DataFrame's column indices.
-
-
index is the row index of the DataFrame. Multiple indices are currently not supported. Note that the function
DataFrame.index
can be used to access the index attribute of the DataFrame. -
lazy (optional) is a Boolean value representing whether to create a lazy DataFrame. The default value is True. A lazy DataFrame is a view of the original object where calculations are not executedimmediately; a non-lazy DataFrame is a copy of the original object where calculations are executed immediately.
-
The function
is_lazy()
can be used to check whether a DataFrame is lazy. -
Calculations involving a lazy DataFrame generate an intermediate object and do not directly produce execution results until the
compute()
function is called. Callingcompute()
triggers a full table scan, which can impact performance when being used within a loop. Therefore, it is recommended to callcompute()
outside the loop, assign the result to a variable, and use this variable within the loop.
-
-
data cannot be an empty table or dict.
-
If data is a partitioned table, lazy must be True. index must be a scalar of the name of a specified column and the values of the specified column serve as the row labels of the DataFrame, rather than data.
-
If data is a non-partitioned in-memory table, the default value of lazy is False. index can be unspecified or specified as a Python list or DolphinDB vector with a length the same as the number of rows of the table. If lazy is set to True, index must be a scalar of the name of a specified column and the values of the specified column serve as the row labels of the DataFrame, rather than data.
-
If data is a dict, the default value of lazy is False. index can be unspecified or specified as a Python list or DolphinDB vector with a length the same as the number of rows of the table.
Conversion
Currently, only the copy function is supported and the parameter deep is not supported.
Indexing / Iteration
DolphinDB pandas currently supports the following functions for indexing or iteration:
Function |
Description |
Compatibility Statement |
---|---|---|
DataFrame.at | Access a single value for a row/column pair by label. | When using at[i, j]=val , if the
data types of val and the DataFrame differ, an attempt will be made
to convert the DataFrame type. If the conversion fails, an error
will be raised. |
DataFrame.iat | Access a single value for a row/column pair by integer position. | Same as DataFrame.at |
DataFrame.loc | Access a group of rows and columns by label(s) or a boolean array. |
Unallowed inputs are:
Accessing the same row repetitively, such as
|
DataFrame.iloc | Access a group of rows and columns by integer position(s). | Same as DataFrame.loc |
DataFrame.insert | Insert a column into a DataFrame at a specified location. | The allow_duplicates parameter is not supported; a lazy DataFrame does not support this method. |
DataFrame.get() | Get the specified column(s) from a DataFrame. |
Binary Operator Functions
DolphinDB pandas supports all binary operator functions of Python pandas.DataFrame.
Note:
-
The parameter level is currently not supported.
-
The parameter axis can only be specified as 0 (default value) or 1 and cannot be specified as a string ('index' or 'columns').
-
If the parameter other is a Series, the parameter fill_value cannot be specified.
-
When using the
dot
function, it is recommended to use a string for the column name of the DataFrame and the index of the other object, as shown in the following script:import pandas as pd // Create a DataFrame with a dict, whose keys are recommended to be strings df = pd.DataFrame({"A1": [1,2,3],"A2": [11,12,13]}, ["a", "b", "c"]) // The index of the other object must be specified as strings other = pd.DataFrame({"a": [1,2],"b": [11,12],"c": [1,2]}, ["A1", "A2"]) df.dot(other)
Computations / Descriptive Statistics
DolphinDB pandas supports the following functions related to computations / descriptive stats of Python
pandas.DataFrame: abs
, all
, any
,
autocorr
, between
, corr
,
count
, cov
, diff
,
kurt
, kurtosis
, mad
,
max
, mean
, median
,
min
, mode
, nlargest
,
nsmallest
, prod
, sem
,
skew
, std
, sum
,
var
, unique
, nunique
,
is_unique
, is_monotonic_increasing
,
is_monotonic_decreasing
, value_counts
,
rank
, round
.
Some parameters are currently not supported by certain functions, as shown in the following table.
Function |
Compatibility Statement |
---|---|
all | Parameters axis, bool_only, and skipna are not supported. |
any | The parameter axis is not supported. |
corr | The parameter method can only be 'pearson'. |
corrwith | The parameter axis is not supported. The parameter method can only be 'pearson' and numeric_only can only be False. |
cummax | No parameter is supported. |
cummin | No parameter is supported. |
cumprod | No parameter is supported. |
cumsum | No parameter is supported. |
describe | No parameter is supported. |
diff | Parameters periods and axis are not supported. |
kurt | The parameter axis is not supported and the parameter skipna can only be True. |
kurtosis | The parameter axis is not supported and the parameter skipna can only be True. |
max | The parameter skipna is not supported and the parameter axis can only be 0. |
mean | Parameters skipna and axis are not supported. |
median | Parameters skipna and axis are not supported. |
min | The parameter skipna is not supported. |
mode | The parameter axis can only be 0 and dropna can only be True. |
prod | The parameter axis can only be 1 and parameters skipna and min_count are not supported. |
product | Same as prod . |
quantile | The parameter q can only be a scalar and parameters axis and method are not supported. |
rank | The parameter axis can only be 0,
method cannot be 'dense', and na_options cannot be
'bottom'. If the parameter pct is specified as True, the
result of DataFrame.rank is the same as that of
DolphinDB's built-in function, rank . |
round | The parameter decimals cannot be negative. |
sem | Parameters axis, skipna, and ddof are not supported. |
skew | The parameter skipna can only be True. |
sum | Parameters skipna and min_count are
not supported. Applying the sum function on strings
returns a null value. |
std | Parameters skipna and ddof are not supported. |
var | Parameters skipna and ddof are not supported. |
value_counts | Parameters normalize and sort are not supported. |
nunique | No parameter is supported. |
Reindexing / Selection / Label Manipulation
DolphinDB pandas supports some functions related to reindexing / selection / label manipulation of Python pandas.DataFrame, as shown in the following table.
Function |
Compatibility Statement |
---|---|
idxmin/idxmax | The parameter skipna is not supported. When axis = 0, the COMPLEX type is not supported; when axis = 1, data types COMPLEX, POINT, INT128, DECIMAL32, DECIMAL64, and DECIMAL128 are not supported. |
reindex | Parameters label, columns, axis, copy, level, and tolerance are not supported. |
reindex_like | Parameters copy and tolerance are not supported. |
align | Only parameters other and join are supported. |
drop | The parameter inplace is not supported. |
sample | Parameters axis and ignore_index are not supported. |
rename | Parameters copy, inplace, level, and error are not supported. |
head / tail | The parameter n cannot be 0. |
Missing Data Handling
DolphinDB pandas supports all functions related to missing data handling of Python pandas.DataFrame, as shown in the following table:
Function |
Compatibility Statement |
---|---|
backfill | No parameter is supported. |
bill | No parameter is supported. |
dropna | The parameter thresh is not supported. In lazy mode, parameters inplace and ignore_index are not supported and axis cannot be specified as 1 or 'columns'. |
ffill | No parameter is supported. |
fillna | Parameters axis, downcast, inplace, and limit are not supported. |
interpolate | Parameters axis and limitArea are not supported. |
isna | |
isnull | |
notna | |
notnull | |
pad | No parameter is supported. |
replace | Parameters inplace and limit are not supported. Nested dictionaries are not supported by parameters to_replace and value, such as {'a': {'b': None}}. |
Exclusive Method of DolphinDB pandas
In addition to methods compatible with Python pandas DataFrame, DolphinDB pandas
DataFrame provides a distinctive method — to_table
. It is used to
convert a DataFrame to an in-memory table of DolphinDB, as demonstrated in the
following script:
import dolphindb as ddb
import pandas as pd
// Create an empty table t with two columns
t=table(pair(100,0), ['x','y'].toddb(), [ddb.INT, ddb.STRING].toddb())
// Use the to_table method to convert a DataFrame to a DolphinDB in-memory table and append the table to t
t.append!(df.to_table())
// Directly appending a DataFrame to a DolphinDB table is not supported
t.append!(df) // Error: Only a table can append to another table.
Creating a DataFrame
Creating a DataFrame from a Dictionary
-
In lazy mode
import pandas as pd df = pd.DataFrame({"A": [1,2,3], "B": ["a", "b", "C"]}, index=["A1", "A2", "A3"]) df
Output:
A B
A1 1 a
A2 2 b
A3 3 C
-
In non-lazy mode
df = pd.DataFrame({"A": [1,2,3], "B": ["a", "b", "C"]}, None, False)
Output:
A B
0 1 a
1 2 b
2 3 C
df = pd.DataFrame({"A": [1,2,3], "B": ["a", "b", "C"]}, [`x,`y,`z], False)
Output:
A B
x 1 a
y 2 b
z 3 C
Creating a DataFrame from an In-Memory Table
Create an in-memory table of DolphinDB, tableForDf.
timetag =[2018.01.02, 2018.01.03, 2018.01.04, 2018.01.08].toddb()
name = [`AAPL, `AAPL, `GS, `AAPL].toddb()
flag = [`A, `A, `B, `C].toddb()
p = [10, 20, 30, 40].toddb()
tableForDf = table(p as price, flag as bsFlag, name, timetag)
-
In lazy mode
When lazy = True, index can only be the name of a specified column in tableForDf.
df = pd.DataFrame(tableForDf, 'timetag', True) df
Output:
timetag price bsFlag name
2018.01.02 10 A AAPL
2018.01.03 20 A AAPL
2018.01.04 30 B GS
2018.01.08 40 C AAPL
-
In non-lazy mode
When lazy = False, index cannot be the name of a specified column in tableForDf.
level1 = ['0001', '0002', '0002', '0001'] df = pd.DataFrame(tableForDf, level1, False) df
Output:
price bsFlag name timetag
0001 10 A AAPL 2018.01.02
0002 20 A AAPL 2018.01.03
0002 30 B GS 2018.01.04
0001 40 C AAPL 2018.01.08
Creating a DataFrame from a DFS Table
Only lazy DataFrame can be created from a DFS table.
n=1000
month=take(seq(2000.01M, 2016.12M), n);
x=rand(1.0, n);
t=table(month, x);
dbName="dfs://test_pandas"
if(exists(dbName)):
dropDatabase(dbName)
db=database(dbName, ddb.VALUE, partitionScheme=seq(2000.01M, 2016.12M))
pt=db.createPartitionedTable(t, `pt, `month).append!(t)
// index must be the name of a specified column in the table and lazy must be True
df=pd.DataFrame(pt,index="month", lazy=True)
df=pd.DataFrame(pt,lazy=False) // If lazy is set to False, an error occurs: cannot create non-lazy DataFrame with segmented table
Accessing a DataFrame
Create an in-memory table tableForDf.
timetag =[2018.01.02, 2018.01.03, 2018.01.04, 2018.01.08].toddb()
name = [`AAPL, `AAPL, `GS, `AAPL].toddb()
flag = [`A, `A, `B, `C].toddb()
p = [10, 20, 30, 40].toddb()
tableForDf = table(p as price, flag as bsFlag, name, timetag)
-
Accessing Data by Position
A lazy DataFrame is a view of the original object and does not support accessing data by position. Take a non-lazy DataFrame df_nonlazy as an example.
level = ['0001', '0002', '0002', '0001'] df_nonlazy=pd.DataFrame(tableForDf, level, False)
-
Access an element
df_nonlazy.iat[0, 3]
Output: 2018.01.02
-
Access column(s)
df_nonlazy.iloc[:,1]
Output:
0001 A
0002 A
0002 B
0001 C
dtype: STRING
df_nonlazy.iloc[:,1:3]
Output:
bsFlag name
0001 A AAPL
0002 A AAPL
0002 B GS
0001 C AAPL
df_nonlazy.iloc[:,2:]
Output:
name timetag
0001 AAPL 2018.01.02
0002 AAPL 2018.01.03
0002 GS 2018.01.04
0001 AAPL 2018.01.08
df_nonlazy.iloc[:,:3] # select col0~col3
Output:
price bsFlag name
0001 10 A AAPL
0002 20 A AAPL
0002 30 B GS
0001 40 C AAPL
-
Access row(s)
df_nonlazy.iloc[1]
Output:
price 20
bsFlag A
name AAPL
timetag 2018.01.03
dtype: ANY
df_nonlazy.iloc[1:3]
Output:
price bsFlag name timetag
0002 20 A AAPL 2018.01.03
0002 30 B GS 2018.01.04
-
Access pair(s) of row(s) and column(s)
df_nonlazy.iloc[[0,3], [2]]
Output:
name
0001 AAPL
0001 AAPL
df_nonlazy.iloc[0:3, 0:1]
Output:
price
0001 10
0002 20
0002 30
df_nonlazy.iat[3, 3]
Output:m 2018.01.08
-
-
Accessing Data by Label
-
Access data by column label(s)
df_nonlazy.price // Accessing data using the DataFrame.columns method is not supported
df_nonlazy['price']
Output:
0001 10
0002 20
0002 30
0001 40
dtype: INT
df_nonlazy[['timetag', 'price']]
Output:
timetag price
0001 2018.01.02 10
0002 2018.01.03 20
0002 2018.01.04 30
0001 2018.01.08 40
df_nonlazy.loc[:, ['timetag', 'price']]
Output:
timetag price
0001 2018.01.02 10
0002 2018.01.03 20
0002 2018.01.04 30
0001 2018.01.08 40
-
Access data by row label(s)
df_nonlazy.loc[`0001]
Output:
price bsFlag name timetag
0001 10 A AAPL 2018.01.02
0001 40 C AAPL 2018.01.08
df_nonlazy.loc[[`0001,`0002]]
Output:
price bsFlag name timetag
0001 10 A AAPL 2018.01.02
0002 20 A AAPL 2018.01.03
0002 30 B GS 2018.01.04
0001 40 C AAPL 2018.01.08
-
Access pair(s) of row(s) and column(s)
df_nonlazy['name'][`0001]
Output:
0001 AAPL
0001 AAPL
dtype: STRING
df_nonlazy.at[`0001, 'name']
Output:
0001 AAPL
0001 AAPL
dtype: STRING
df_nonlazy.loc[[`0001, `0002], ['name']]
Output:
name
0001 AAPL
0002 AAPL
0002 GS
0001 AAPL
-
-
Accessing Data by Boolean Index
df_nonlazy.loc[[True,True,False, False]] df_nonlazy.iloc[[True, True,False, False]] df_nonlazy[[True, True,False, False]]
Output:
price bsFlag name timetag
0001 10 A AAPL 2018.01.02
0002 20 A AAPL 2018.01.03
-
Accessing Data by Condition
df_nonlazy[df_nonlazy["price"]>20]
Output:
price bsFlag name timetag
0002 30 B GS 2018.01.04
0001 40 C AAPL 2018.01.08
Operating a DataFrame
A lazy DataFrame does not support adding, changing, or removing data. Take a non-lazy DataFrame df_nonlazy as an example.
-
Adding data
-
Add a column to a DataFrame using the
insert
methoddf_nonlazy.insert(0, `tep, [33,33,37,37])
-
Add a column to a DataFrame by directly assigning values to the added column (currently not supported)
row = {"price":30, "bsFlag":'B', "name":"GS", "timetag":2018.01.09} df_nonlazy.iloc[1] = row
-
-
Updating data
// Create an example DataFrame data = {'A': [1, 2, 3], 'B': [4, 5, 6]} df = pd.DataFrame(data) // Use methods df.at and df.iat to change the value of a single element df.at[0, 'A'] = 10 df.iat[1, 1] = 50 // Use the method df.apply to apply a function to a column df['A'] = df['A'].apply(lambda x: x * 2)
If a DataFrame is created from a table, the modification of its data through value assigning will also be applied to the original table.
// Create an example DataFrame from an in-memory tablee t1=table([1,2,3].toddb() as value) df=pd.DataFrame(t1,lazy=False) // Use methods df.at and df.iat to change the value of a single element df.at[0, 'value']=10 df.iat[1, 0] = 20 t1
Output:
value
10
20
3
-
Removing data
-
Remove elements using the
drop
methoddf_nonlazy.drop(['name'], axis=1)
-