DataFrame

As one of the most common data structures in DolphinDB pandas, DataFrame is a two-dimensional tabular data structure with labeled axes (rows and columns). DataFrame can be used to hold and process various data types, including numeric values, strings, and Boolean values.

Constructor

DataFrame(data, index=None, lazy)
  • data can be a table (in-memory table or DFS table) or dictionary.

    • If data is a table, the column values of the table correspond to the DataFrame's column data and the column names correspond to the DataFrame's column indices.

    • If data is a dict, the values of the dict correspond to the DataFrame's column data and the keys correspond to the DataFrame's column indices.

  • index is the row index of the DataFrame. Multiple indices are currently not supported. Note that the function DataFrame.index can be used to access the index attribute of the DataFrame.

  • lazy (optional) is a Boolean value representing whether to create a lazy DataFrame. The default value is True. A lazy DataFrame is a view of the original object where calculations are not executedimmediately; a non-lazy DataFrame is a copy of the original object where calculations are executed immediately.

    • The function is_lazy() can be used to check whether a DataFrame is lazy.

    • Calculations involving a lazy DataFrame generate an intermediate object and do not directly produce execution results until the compute() function is called. Calling compute() triggers a full table scan, which can impact performance when being used within a loop. Therefore, it is recommended to call compute() outside the loop, assign the result to a variable, and use this variable within the loop.

Note:
  • data cannot be an empty table or dict.

  • If data is a partitioned table, lazy must be True. index must be a scalar of the name of a specified column and the values of the specified column serve as the row labels of the DataFrame, rather than data.

  • If data is a non-partitioned in-memory table, the default value of lazy is False. index can be unspecified or specified as a Python list or DolphinDB vector with a length the same as the number of rows of the table. If lazy is set to True, index must be a scalar of the name of a specified column and the values of the specified column serve as the row labels of the DataFrame, rather than data.

  • If data is a dict, the default value of lazy is False. index can be unspecified or specified as a Python list or DolphinDB vector with a length the same as the number of rows of the table.

Conversion

Currently, only the copy function is supported and the parameter deep is not supported.

Indexing / Iteration

DolphinDB pandas currently supports the following functions for indexing or iteration:

Function

Description

Compatibility Statement

DataFrame.at Access a single value for a row/column pair by label. When using at[i, j]=val, if the data types of val and the DataFrame differ, an attempt will be made to convert the DataFrame type. If the conversion fails, an error will be raised.
DataFrame.iat Access a single value for a row/column pair by integer position. Same as DataFrame.at
DataFrame.loc Access a group of rows and columns by label(s) or a boolean array.

Unallowed inputs are:

  • An alignable index.

  • A callable function with one argument (the calling Series or DataFrame).

Accessing the same row repetitively, such as df.loc[["aa", "aa"]].

DataFrame.iloc Access a group of rows and columns by integer position(s). Same as DataFrame.loc
DataFrame.insert Insert a column into a DataFrame at a specified location. The allow_duplicates parameter is not supported; a lazy DataFrame does not support this method.
DataFrame.get() Get the specified column(s) from a DataFrame.

Binary Operator Functions

DolphinDB pandas supports all binary operator functions of Python pandas.DataFrame.

Note:

  • The parameter level is currently not supported.

  • The parameter axis can only be specified as 0 (default value) or 1 and cannot be specified as a string ('index' or 'columns').

  • If the parameter other is a Series, the parameter fill_value cannot be specified.

  • When using the dot function, it is recommended to use a string for the column name of the DataFrame and the index of the other object, as shown in the following script:

    import pandas as pd
    // Create a DataFrame with a dict, whose keys are recommended to be strings
    df = pd.DataFrame({"A1": [1,2,3],"A2": [11,12,13]},  ["a", "b", "c"])
    // The index of the other object must be specified as strings
    other = pd.DataFrame({"a": [1,2],"b": [11,12],"c": [1,2]},  ["A1", "A2"])
    df.dot(other)

Computations / Descriptive Statistics

DolphinDB pandas supports the following functions related to computations / descriptive stats of Python pandas.DataFrame: abs, all, any, autocorr, between, corr, count, cov, diff, kurt, kurtosis, mad, max, mean, median, min, mode, nlargest, nsmallest, prod, sem, skew, std, sum, var, unique, nunique, is_unique, is_monotonic_increasing, is_monotonic_decreasing, value_counts, rank, round.

Some parameters are currently not supported by certain functions, as shown in the following table.

Function

Compatibility Statement

all Parameters axis, bool_only, and skipna are not supported.
any The parameter axis is not supported.
corr The parameter method can only be 'pearson'.
corrwith The parameter axis is not supported. The parameter method can only be 'pearson' and numeric_only can only be False.
cummax No parameter is supported.
cummin No parameter is supported.
cumprod No parameter is supported.
cumsum No parameter is supported.
describe No parameter is supported.
diff Parameters periods and axis are not supported.
kurt The parameter axis is not supported and the parameter skipna can only be True.
kurtosis The parameter axis is not supported and the parameter skipna can only be True.
max The parameter skipna is not supported and the parameter axis can only be 0.
mean Parameters skipna and axis are not supported.
median Parameters skipna and axis are not supported.
min The parameter skipna is not supported.
mode The parameter axis can only be 0 and dropna can only be True.
prod The parameter axis can only be 1 and parameters skipna and min_count are not supported.
product Same as prod.
quantile The parameter q can only be a scalar and parameters axis and method are not supported.
rank The parameter axis can only be 0, method cannot be 'dense', and na_options cannot be 'bottom'. If the parameter pct is specified as True, the result of DataFrame.rank is the same as that of DolphinDB's built-in function, rank.
round The parameter decimals cannot be negative.
sem Parameters axis, skipna, and ddof are not supported.
skew The parameter skipna can only be True.
sum Parameters skipna and min_count are not supported. Applying the sum function on strings returns a null value.
std Parameters skipna and ddof are not supported.
var Parameters skipna and ddof are not supported.
value_counts Parameters normalize and sort are not supported.
nunique No parameter is supported.

Reindexing / Selection / Label Manipulation

DolphinDB pandas supports some functions related to reindexing / selection / label manipulation of Python pandas.DataFrame, as shown in the following table.

Function

Compatibility Statement

idxmin/idxmax The parameter skipna is not supported. When axis = 0, the COMPLEX type is not supported; when axis = 1, data types COMPLEX, POINT, INT128, DECIMAL32, DECIMAL64, and DECIMAL128 are not supported.
reindex Parameters label, columns, axis, copy, level, and tolerance are not supported.
reindex_like Parameters copy and tolerance are not supported.
align Only parameters other and join are supported.
drop The parameter inplace is not supported.
sample Parameters axis and ignore_index are not supported.
rename Parameters copy, inplace, level, and error are not supported.
head / tail The parameter n cannot be 0.

Missing Data Handling

DolphinDB pandas supports all functions related to missing data handling of Python pandas.DataFrame, as shown in the following table:

Function

Compatibility Statement

backfill No parameter is supported.
bill No parameter is supported.
dropna The parameter thresh is not supported. In lazy mode, parameters inplace and ignore_index are not supported and axis cannot be specified as 1 or 'columns'.
ffill No parameter is supported.
fillna Parameters axis, downcast, inplace, and limit are not supported.
interpolate Parameters axis and limitArea are not supported.
isna
isnull
notna
notnull
pad No parameter is supported.
replace Parameters inplace and limit are not supported. Nested dictionaries are not supported by parameters to_replace and value, such as {'a': {'b': None}}.

Exclusive Method of DolphinDB pandas

In addition to methods compatible with Python pandas DataFrame, DolphinDB pandas DataFrame provides a distinctive method — to_table. It is used to convert a DataFrame to an in-memory table of DolphinDB, as demonstrated in the following script:

import dolphindb as ddb
import pandas as pd
// Create an empty table t with two columns
t=table(pair(100,0), ['x','y'].toddb(), [ddb.INT, ddb.STRING].toddb())
// Use the to_table method to convert a DataFrame to a DolphinDB in-memory table and append the table to t
t.append!(df.to_table())
// Directly appending a DataFrame to a DolphinDB table is not supported
t.append!(df)   // Error: Only a table can append to another table.

Creating a DataFrame

Creating a DataFrame from a Dictionary

  • In lazy mode

    import pandas as pd
    
    df = pd.DataFrame({"A": [1,2,3], "B": ["a", "b", "C"]}, index=["A1", "A2", "A3"])
    df

    Output:

    A B

    A1 1 a

    A2 2 b

    A3 3 C

  • In non-lazy mode

    df = pd.DataFrame({"A": [1,2,3], "B": ["a", "b", "C"]}, None, False)

    Output:

    A B

    0 1 a

    1 2 b

    2 3 C

    df = pd.DataFrame({"A": [1,2,3], "B": ["a", "b", "C"]}, [`x,`y,`z], False)

    Output:

    A B

    x 1 a

    y 2 b

    z 3 C

Creating a DataFrame from an In-Memory Table

Create an in-memory table of DolphinDB, tableForDf.

timetag =[2018.01.02, 2018.01.03, 2018.01.04, 2018.01.08].toddb()
name = [`AAPL, `AAPL, `GS, `AAPL].toddb()
flag = [`A, `A, `B, `C].toddb()
p = [10, 20, 30, 40].toddb()
tableForDf = table(p as price, flag as bsFlag, name, timetag)
  • In lazy mode

    When lazy = True, index can only be the name of a specified column in tableForDf.

    df = pd.DataFrame(tableForDf, 'timetag', True)
    df

    Output:

    timetag price bsFlag name

    2018.01.02 10 A AAPL

    2018.01.03 20 A AAPL

    2018.01.04 30 B GS

    2018.01.08 40 C AAPL

  • In non-lazy mode

    When lazy = False, index cannot be the name of a specified column in tableForDf.

    level1 = ['0001', '0002', '0002', '0001']
    df = pd.DataFrame(tableForDf, level1, False)
    df

    Output:

    price bsFlag name timetag

    0001 10 A AAPL 2018.01.02

    0002 20 A AAPL 2018.01.03

    0002 30 B GS 2018.01.04

    0001 40 C AAPL 2018.01.08

Creating a DataFrame from a DFS Table

Only lazy DataFrame can be created from a DFS table.

n=1000
month=take(seq(2000.01M, 2016.12M), n);
x=rand(1.0, n);
t=table(month, x);
dbName="dfs://test_pandas"
if(exists(dbName)):
dropDatabase(dbName)
db=database(dbName, ddb.VALUE, partitionScheme=seq(2000.01M, 2016.12M))
pt=db.createPartitionedTable(t, `pt, `month).append!(t)

// index must be the name of a specified column in the table and lazy must be True
df=pd.DataFrame(pt,index="month", lazy=True)  

df=pd.DataFrame(pt,lazy=False)   // If lazy is set to False, an error occurs: cannot create non-lazy DataFrame with segmented table

Accessing a DataFrame

Create an in-memory table tableForDf.

timetag =[2018.01.02, 2018.01.03, 2018.01.04, 2018.01.08].toddb()
name = [`AAPL, `AAPL, `GS, `AAPL].toddb()
flag = [`A, `A, `B, `C].toddb()
p = [10, 20, 30, 40].toddb()
tableForDf = table(p as price, flag as bsFlag, name, timetag)
  1. Accessing Data by Position

    A lazy DataFrame is a view of the original object and does not support accessing data by position. Take a non-lazy DataFrame df_nonlazy as an example.

    level = ['0001', '0002', '0002', '0001']
    df_nonlazy=pd.DataFrame(tableForDf, level, False)
    • Access an element

      df_nonlazy.iat[0, 3]

      Output: 2018.01.02

    • Access column(s)

      df_nonlazy.iloc[:,1]

      Output:

      0001 A

      0002 A

      0002 B

      0001 C

      dtype: STRING

      df_nonlazy.iloc[:,1:3] 

      Output:

      bsFlag name

      0001 A AAPL

      0002 A AAPL

      0002 B GS

      0001 C AAPL

      df_nonlazy.iloc[:,2:]

      Output:

      name timetag

      0001 AAPL 2018.01.02

      0002 AAPL 2018.01.03

      0002 GS 2018.01.04

      0001 AAPL 2018.01.08

      df_nonlazy.iloc[:,:3] # select col0~col3

      Output:

      price bsFlag name

      0001 10 A AAPL

      0002 20 A AAPL

      0002 30 B GS

      0001 40 C AAPL

    • Access row(s)

      df_nonlazy.iloc[1]

      Output:

      price 20

      bsFlag A

      name AAPL

      timetag 2018.01.03

      dtype: ANY

      df_nonlazy.iloc[1:3] 

      Output:

      price bsFlag name timetag

      0002 20 A AAPL 2018.01.03

      0002 30 B GS 2018.01.04

    • Access pair(s) of row(s) and column(s)

      df_nonlazy.iloc[[0,3], [2]]

      Output:

      name

      0001 AAPL

      0001 AAPL

      df_nonlazy.iloc[0:3, 0:1]

      Output:

      price

      0001 10

      0002 20

      0002 30

      df_nonlazy.iat[3, 3]

      Output:m 2018.01.08

  2. Accessing Data by Label

    • Access data by column label(s)

      df_nonlazy.price // Accessing data using the DataFrame.columns method is not supported
      df_nonlazy['price']

      Output:

      0001 10

      0002 20

      0002 30

      0001 40

      dtype: INT

      df_nonlazy[['timetag', 'price']]

      Output:

      timetag price

      0001 2018.01.02 10

      0002 2018.01.03 20

      0002 2018.01.04 30

      0001 2018.01.08 40

      df_nonlazy.loc[:, ['timetag', 'price']]

      Output:

      timetag price

      0001 2018.01.02 10

      0002 2018.01.03 20

      0002 2018.01.04 30

      0001 2018.01.08 40

    • Access data by row label(s)

      df_nonlazy.loc[`0001]

      Output:

      price bsFlag name timetag

      0001 10 A AAPL 2018.01.02

      0001 40 C AAPL 2018.01.08

      df_nonlazy.loc[[`0001,`0002]]

      Output:

      price bsFlag name timetag

      0001 10 A AAPL 2018.01.02

      0002 20 A AAPL 2018.01.03

      0002 30 B GS 2018.01.04

      0001 40 C AAPL 2018.01.08

    • Access pair(s) of row(s) and column(s)

      df_nonlazy['name'][`0001]

      Output:

      0001 AAPL

      0001 AAPL

      dtype: STRING

      df_nonlazy.at[`0001, 'name']

      Output:

      0001 AAPL

      0001 AAPL

      dtype: STRING

      df_nonlazy.loc[[`0001, `0002], ['name']]

      Output:

      name

      0001 AAPL

      0002 AAPL

      0002 GS

      0001 AAPL

  3. Accessing Data by Boolean Index

    df_nonlazy.loc[[True,True,False, False]]
    df_nonlazy.iloc[[True, True,False, False]]
    df_nonlazy[[True, True,False, False]]

    Output:

    price bsFlag name timetag

    0001 10 A AAPL 2018.01.02

    0002 20 A AAPL 2018.01.03

  4. Accessing Data by Condition

    df_nonlazy[df_nonlazy["price"]>20]

    Output:

    price bsFlag name timetag

    0002 30 B GS 2018.01.04

    0001 40 C AAPL 2018.01.08

Operating a DataFrame

A lazy DataFrame does not support adding, changing, or removing data. Take a non-lazy DataFrame df_nonlazy as an example.

  1. Adding data

    • Add a column to a DataFrame using the insert method

      df_nonlazy.insert(0, `tep, [33,33,37,37])
    • Add a column to a DataFrame by directly assigning values to the added column (currently not supported)

      row = {"price":30, "bsFlag":'B', "name":"GS", "timetag":2018.01.09}
      df_nonlazy.iloc[1] = row
  2. Updating data

    // Create an example DataFrame
    data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
    df = pd.DataFrame(data)
    
    // Use methods df.at and df.iat to change the value of a single element
    df.at[0, 'A'] = 10
    df.iat[1, 1] = 50
    
    // Use the method df.apply to apply a function to a column
    df['A'] = df['A'].apply(lambda x: x * 2)

    If a DataFrame is created from a table, the modification of its data through value assigning will also be applied to the original table.

    // Create an example DataFrame from an in-memory tablee
    t1=table([1,2,3].toddb() as value)
    df=pd.DataFrame(t1,lazy=False)
    
    // Use methods df.at and df.iat to change the value of a single element
    df.at[0, 'value']=10
    df.iat[1, 0] = 20
    t1

    Output:

    value

    10

    20

    3

  3. Removing data

    • Remove elements using the drop method

      df_nonlazy.drop(['name'], axis=1)