Parquet Plugin

Apache Parquet is a columnar storage format for efficient data storage and retrieval. This tutorial introduces how to use DolphinDB Parquet plugin to import and export Parquet files to and from DolphinDB.

The DolphinDB Parquet plugin has the branches release 200 and release130. Each plugin version corresponds to a DolphinDB server version. You're looking at the plugin documentation for release200. If you use a different DolphinDB server version, please refer to the corresponding branch of the plugin documentation.

Install the Plugin

Download Precompiled Binaries

You can download precompiled binaries for DolphinDB Parquet Plugin here. It can be installed directly on Linux (See Chap 02).

Please note that the plugin version should be consistent with your DolphinDB server version. You can switch branches to obtain the expected version.

(Optional) Build a Plugin

You can also manually compile a Parquet plugin with CMake on Linux following the instructions:

(1) Install CMake

sudo apt-get install cmake

(2) Install zlib

sudo apt-get install zlib1g

(3) Compile Parquet Development Kit

git clone https://github.com/apache/arrow.git
cd arrow/cpp
mkdir build
cd build
cmake .. -DARROW_PARQUET=ON 
        -DARROW_IPC=ON 
        -DARROW_BUILD_INTEGRATION=ON 
        -DARROW_BUILD_STATIC=ON 
        -DPARQUET_BUILD_SHARED=OFF 
        -DARROW_BUILD_SHARED=OFF 
        -DARROW_DEPENDENCY_USE_SHARED=OFF 
        -DARROW_WITH_ZLIB=ON 
        -DARROW_WITH_SNAPPY=ON 
        -DARROW_WITH_ZSTD=ON 
        -DARROW_WITH_LZ4=ON 
        -DARROW_WITH_BZ2=ON

Note: The dependency library libstdc++ in the compilation environment must be the same as that of the DolphinDB server. The precompiled Parquet plugin supports 5 compression algorithms: zlib, snappy, zstd, lz4 and bz2. You can change the compression options as needed. For more details on the optional building components, see Building Arrow C++.

(4) After compiling, copy the following files to the target directories.

FilesTarget Directory
arrow/cpp/build/src/parquet/parquet_version.h./parquetApi/include/parquet
arrow/cpp/src/arrowarrow/cpp/src/parquetarrow/cpp/src/generated./parquetApi/include
arrow/cpp/build/release/libarrow.aarrow/cpp/build/release/libparquet.aarrow/cpp/build/thrift_ep-install/lib/libthrift.aarrow/cpp/build/utf8proc_ep-install/lib/libutf8proc.aarrow/cpp/build/jemalloc_ep-prefix/src/jemalloc_ep/lib/libjemalloc_pic.aarrow/cpp/build/zstd_ep-install/lib64/libzstd.aarrow/cpp/build/zlib_ep/src/zlib_ep-install/lib/libz.aarrow/cpp/build/snappy_ep/src/snappy_ep-install/lib/libsnappy.aarrow/cpp/build/lz4_ep-prefix/src/lz4_ep/lib/liblz4.aarrow/cpp/build/bzip2_ep-install/lib/libbz2.aarrow/cpp/build/boost_ep-prefix/src/boost_ep/stage/lib/libboost_regex.a./parquetApi/lib/linux

(5) Build the entire project

mkdir build
cd build
cmake ../path_to_parquet_plugin/
make

Load the Plugin

Load the precompiled plugin provided by DolphinDB or the manually compiled plugin on Linux.

(1) Add the plugin path to the library search path LD_LIBRARY_PATH

export LD_LIBRARY_PATH=/path_to_parquet_plugin/:$LD_LIBRARY_PATH

(2) Start DolphinDB server and load the plugin

loadPlugin("/path_to_parquet_plugin/PluginParquet.txt")

Methods

parquet::extractParquetSchema

Syntax

parquet::extractParquetSchema(fileName)

Parameters

  • fileName: a STRING scalar indicating the Parquet file name.

Details

Return the schema of the input Parquet file. It includes 2 columns: column names and their data types.

Examples

parquet::extractParquetSchema("userdata1.parquet")

parquet::loadParquet

Syntax

parquet::loadParquet(fileName,[schema],[column],[rowGroupStart],[rowGroupNum])

Parameters

  • fileName: a STRING scalar indicating the Parquet file name.

  • schema: a table with the column names and their data types. Specify the parameter to modify the data types of the columns generated by the system. It is an optional parameter.

  • column: a vector of integers indicating the column index to be imported. It is an optional parameter and all columns will be read if it is not specified.

  • rowGroupStart: a non-negative integer indicating the index of the row group from which the data import starts. It is an optional parameter, and the file will be read from the beginning if it is not specified.

  • rowGroupNum: an integer indicating the number of row groups to be read. It is an optional parameter, and the method will read to the end of the file if it is not specified.

Details

Import a Parquet file to a DolphinDB in-memory table.

Regarding data type conversion, please see Data Type Mappings.

Examples

parquet::loadParquet("userdata1.parquet")

parquet::loadParquetEx

Syntax

parquet::loadParquetEx(dbHandle,tableName,[partitionColumns],fileName,[schema],[column],[rowGroupStart],[rowGroupNum],[tranform])

Parameters

  • dbHandle: a database handle

  • tableName: a STRING indicating the table name

  • partitionColumns: a STRING scalar or vector indicating the partitioning column(s). For a composite partition, it is a vector.

  • fileName: a STRING scalar indicating the Parquet file name.

  • schema: a table with the column names and their data types. Specify the parameter to modify the data types of the columns that are determined by DolphinDB automatically. It is an optional parameter.

  • column: a vector of integers indicating the column index to be imported. It is an optional parameter and all columns will be read if it is not specified.

  • rowGroupStart: a non-negative integer indicating the index of the row group from which the data import starts. It is an optional parameter, and the file will be read from the beginning if it is not specified.

  • rowGroupNum: an integer indicating the number of row groups to be read. It is an optional parameter, and the method will read to the end of the file if it is not specified.

  • transform: a unary function and the input argument must be a table. If it is specified, a partitioned table must be created before loading the file. The method will first apply the specified function to the data, and then save the result to the partitioned table.

Details

Load a Parquet file to a DolphinDB partitioned table and return a table object with metadata of the table.

  • If dbHandle is specified and is not an empty string "": load the file to a DFS database.

  • If dbHandle is an empty string "" or unspecified: load the file to a partitioned in-memory table.

Regarding data type conversion, please see Data Type Mappings.

Examples

  • Import to a partitioned DFS table
db = database("dfs://rangedb", RANGE, 0 500 1000)
parquet::loadParquetEx(db,`tb,`id,"userdata1.parquet")
  • Import to a partitioned in-memory table
db = database("", RANGE, 0 500 1000)
parquet::loadParquetEx(db,`tb,`id,"userdata1.parquet")
  • Specify the parameter transform to transform the default data type (e.g. 20200101) to a specific type (e.g. DATE)
dbPath="dfs://DolphinDBdatabase"
db=database(dbPath,VALUE,2020.01.01..2020.01.30)
dataFilePath="level.parquet"
schemaTB=parquet::extractParquetSchema(dataFilePath)
update schemaTB set type="DATE" where name="date"
tb=table(1:0,schemaTB.name,schemaTB.type)
tb1=db.createPartitionedTable(tb,`tb1,`date);
def i2d(mutable t){
    return t.replaceColumn!(`date,datetimeParse(t.date),"yyyy.MM.dd"))
}
t = parquet::loadParquetEx(db,`tb1,`date,dataFilePath,datasetName,,,,i2d)

parquet::parquetDS

Syntax

parquet::parquetDS(fileName,[schema])

Parameters

  • fileName: a STRING scalar indicating the Parquet file name.

  • schema: a table with the column names and their data types. Specify the parameter to modify the data types of the columns that are determined by DolphinDB automatically. It is an optional parameter.

Details

Create data sources based on the input file name. The number of tables is the same as the number of row groups.

Examples

>ds = parquet::parquetDS("userdata1.parquet")
>size ds;
1
>ds[0];
DataSource< loadParquet("userdata1.parquet",,,0,1) >

parquet::saveParquet

Syntax

parquet::saveParquet(table, fileName)

Parameters

  • table: the table to be exported

  • fileName: a STRING scalar indicating the Parquet file name to save the table

Details

Export a DolphinDB table to a Parquet file.

Examples

parquet::saveParquet(tb, "userdata1.parquet")

Data Type Mappings

Import

When a Parquet file is imported to DolphinDB, the data types are converted based on the LogicalType as annotated in the file. If the LogicalType or ConvertedType is not defined, the conversion will be performed based on the physical types.

Logical Type in ParquetTimeUnit in ParquetType in DolphinDB
INT(bit_width=8,is_signed=true)\CHAR
INT(bit_width=8,is_signed=false or bit_width=16,is_signed=true)\SHORT
INT(bit_width=16,is_signed=false or bit_width=32,is_signed=true)\INT
INT(bit_width=32,is_signed=false or bit_width=64,is_signed=true)\LONG
INT(bit_width=64,is_signed=false)\LONG
ENUM\SYMBOL
DECIMAL\DOUBLE
DATE\DATE
TIMEMILLIS\MICROS\NANOSTIME\NANOTIME\NANOTIME
TIMESTAMPMILLIS\MICROS\NANOSTIMESTAMP\NANOTIMESTAMP\NANOTIMESTAMP
INTEGER\INT\LONG
STRING\STRING
JSON\not supported
BSON\not supported
UUID\not supported
MAP\not supported
LIST\not supported
NIL\not supported
Converted Type in ParquetType in DolphinDB
INT_8CHAR
UINT_8\INT_16SHORT
UINT_16\INT_32INT
TIMESTAMP_MICROSNANOTIMESTAMP
TIMESTAMP_MILLISTIMESTAMP
DECIMALDOUBLE
UINT_32\INT_64\UINT_64LONG
TIME_MICROSNANOTIME
TIME_MILLISTIME
DATEDATE
ENUMSYMBOL
UTF8STRING
MAPnot supported
LISTnot supported
JSONnot supported
BSONnot supported
MAP_KEY_VALUEnot supported
Physical Type in ParquetType in DolphinDB
BOOLEANBOOL
INT32INT
INT64LONG
INT96NANOTIMESTAMP
FLOATFLOAT
DOUBLEDOUBLE
BYTE_ARRAYSTRING
FIXED_LEN_BYTE_ARRAYSTRING

Note:

  • Conversion of the Parquet repeated fields is not supported.
  • DECIMAL can be used to convert data of the following physical types: INT32, INT64 and FIXED_LEN_BYTE_ARRAY.
  • DolphinDB does not support unsigned data types. Therefore, in case of UINT_64 overflow when loading a Parquet file, the data will be converted to NULL values in DolphinDB.

Export

When exporting data from DolphinDB to a Parquet file, the system will convert the data types to Parquet types based on the given table schema.

Type in DolphinDBPhysical Type in ParquetLogical Type in Parquet
BOOLBOOLEAN\
CHARFIXED_LEN_BYTE_ARRAY\
SHORTINT32INT(16)
INTINT32INT(32)
LONGINT64INT(64)
DATEINT32DATE
MONTHINT32DATE
TIMEINT32TIME_MILLIS
MINUTEINT32TIME_MILLIS
SECONDINT32TIME_MILLIS
DATETIMEINT64TIMESTAMP_MILLIS
TIMESTAMPINT64TIMESTAMP_MILLIS
NANOTIMEINT64TIME_NANOS
NANOTIMESTAMPINT64TIMESTAMP_NANOS
FLOATFLOAT\
DOUBLEDOUBLE\
STRINGBYTE_ARRAYSTRING
SYMBOLBYTE_ARRAYSTRING