Parquet
Apache Parquet is a columnar storage format for efficient data storage and retrieval. This tutorial introduces how to use DolphinDB Parquet plugin to import and export Parquet files to and from DolphinDB.
Install the Plugin
Download Precompiled Binaries
You can download precompiled binaries for DolphinDB Parquet Plugin here. It can be installed directly on Linux (See Chap 02).
Please note that the plugin version should be consistent with your DolphinDB server version. You can switch branches to obtain the expected version.
(Optional) Build a Plugin
You can also manually compile a Parquet plugin with CMake on Linux following the instructions:
(1) Install CMake
sudo apt-get install cmake
(2) Install zlib
sudo apt-get install zlib1g
(3) Compile Parquet Development Kit
git clone https://github.com/apache/arrow.git
cd arrow/cpp
mkdir build
cd build
cmake .. -DARROW_PARQUET=ON
-DARROW_IPC=ON
-DARROW_BUILD_INTEGRATION=ON
-DARROW_BUILD_STATIC=ON
-DPARQUET_BUILD_SHARED=OFF
-DARROW_BUILD_SHARED=OFF
-DARROW_DEPENDENCY_USE_SHARED=OFF
-DARROW_WITH_ZLIB=ON
-DARROW_WITH_SNAPPY=ON
-DARROW_WITH_ZSTD=ON
-DARROW_WITH_LZ4=ON
-DARROW_WITH_BZ2=ON
Note: The dependency library libstdc++ in the compilation environment must be the same as that of the DolphinDB server. The precompiled Parquet plugin supports 5 compression algorithms: zlib, snappy, zstd, lz4 and bz2. You can change the compression options as needed. For more details on the optional building components, see Building Arrow C++.
(4) After compiling, copy the following files to the target directories.
Files | Target Directory |
---|---|
arrow/cpp/build/src/parquet/parquet_version.h | ./parquetApi/include/parquet |
arrow/cpp/src/arrowarrow/cpp/src/parquetarrow/cpp/src/generated | ./parquetApi/include |
arrow/cpp/build/release/libarrow.aarrow/cpp/build/release/libparquet.aarrow/cpp/build/thrift_ep-install/lib/libthrift.aarrow/cpp/build/utf8proc_ep-install/lib/libutf8proc.aarrow/cpp/build/jemalloc_ep-prefix/src/jemalloc_ep/lib/libjemalloc_pic.aarrow/cpp/build/zstd_ep-install/lib64/libzstd.aarrow/cpp/build/zlib_ep/src/zlib_ep-install/lib/libz.aarrow/cpp/build/snappy_ep/src/snappy_ep-install/lib/libsnappy.aarrow/cpp/build/lz4_ep-prefix/src/lz4_ep/lib/liblz4.aarrow/cpp/build/bzip2_ep-install/lib/libbz2.aarrow/cpp/build/boost_ep-prefix/src/boost_ep/stage/lib/libboost_regex.a | ./parquetApi/lib/linux |
(5) Build the entire project
mkdir build
cd build
cmake ../path_to_parquet_plugin/
make
Load the Plugin
Load the precompiled plugin provided by DolphinDB or the manually compiled plugin on Linux.
(1) Add the plugin path to the library search path LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/path_to_parquet_plugin/:$LD_LIBRARY_PATH
(2) Start DolphinDB server and load the plugin
loadPlugin("/path_to_parquet_plugin/PluginParquet.txt")
Methods
parquet::extractParquetSchema
Syntax
parquet::extractParquetSchema(fileName)
Parameters
- fileName: a STRING scalar indicating the Parquet file name.
Details
Return the schema of the input Parquet file. It includes 2 columns: column names and their data types.
Examples
parquet::extractParquetSchema("userdata1.parquet")
parquet::loadParquet
Syntax
parquet::loadParquet(fileName,[schema],[column],[rowGroupStart],[rowGroupNum])
Parameters
fileName: a STRING scalar indicating the Parquet file name.
schema: a table with the column names and their data types. Specify the parameter to modify the data types of the columns generated by the system. It is an optional parameter.
column: a vector of integers indicating the column index to be imported. It is an optional parameter and all columns will be read if it is not specified.
rowGroupStart: a non-negative integer indicating the index of the row group from which the data import starts. It is an optional parameter, and the file will be read from the beginning if it is not specified.
rowGroupNum: an integer indicating the number of row groups to be read. It is an optional parameter, and the method will read to the end of the file if it is not specified.
Details
Import a Parquet file to a DolphinDB in-memory table.
Regarding data type conversion, please see Data Type Mappings.
Examples
parquet::loadParquet("userdata1.parquet")
parquet::loadParquetEx
Syntax
parquet::loadParquetEx(dbHandle,tableName,[partitionColumns],fileName,[schema],[column],[rowGroupStart],[rowGroupNum],[tranform])
Parameters
dbHandle: a database handle
tableName: a STRING indicating the table name
partitionColumns: a STRING scalar or vector indicating the partitioning column(s). For a composite partition, it is a vector.
fileName: a STRING scalar indicating the Parquet file name.
schema: a table with the column names and their data types. Specify the parameter to modify the data types of the columns that are determined by DolphinDB automatically. It is an optional parameter.
column: a vector of integers indicating the column index to be imported. It is an optional parameter and all columns will be read if it is not specified.
rowGroupStart: a non-negative integer indicating the index of the row group from which the data import starts. It is an optional parameter, and the file will be read from the beginning if it is not specified.
rowGroupNum: an integer indicating the number of row groups to be read. It is an optional parameter, and the method will read to the end of the file if it is not specified.
transform: a unary function and the input argument must be a table. If it is specified, a partitioned table must be created before loading the file. The method will first apply the specified function to the data, and then save the result to the partitioned table.
Details
Load a Parquet file to a DolphinDB partitioned table and return a table object with metadata of the table.
If dbHandle is specified and is not an empty string "": load the file to a DFS database.
If dbHandle is an empty string "" or unspecified: load the file to a partitioned in-memory table.
Regarding data type conversion, please see Data Type Mappings.
Examples
- Import to a partitioned DFS table
db = database("dfs://rangedb", RANGE, 0 500 1000)
parquet::loadParquetEx(db,`tb,`id,"userdata1.parquet")
- Import to a partitioned in-memory table
db = database("", RANGE, 0 500 1000)
parquet::loadParquetEx(db,`tb,`id,"userdata1.parquet")
- Specify the parameter transform to transform the default data type (e.g. 20200101) to a specific type (e.g. DATE)
dbPath="dfs://DolphinDBdatabase"
db=database(dbPath,VALUE,2020.01.01..2020.01.30)
dataFilePath="level.parquet"
schemaTB=parquet::extractParquetSchema(dataFilePath)
update schemaTB set type="DATE" where name="date"
tb=table(1:0,schemaTB.name,schemaTB.type)
tb1=db.createPartitionedTable(tb,`tb1,`date);
def i2d(mutable t){
return t.replaceColumn!(`date,datetimeParse(t.date),"yyyy.MM.dd"))
}
t = parquet::loadParquetEx(db,`tb1,`date,dataFilePath,datasetName,,,,i2d)
parquet::parquetDS
Syntax
parquet::parquetDS(fileName,[schema])
Parameters
fileName: a STRING scalar indicating the Parquet file name.
schema: a table with the column names and their data types. Specify the parameter to modify the data types of the columns that are determined by DolphinDB automatically. It is an optional parameter.
Details
Create data sources based on the input file name. The number of tables is the same as the number of row groups.
Examples
>ds = parquet::parquetDS("userdata1.parquet")
>size ds;
1
>ds[0];
DataSource< loadParquet("userdata1.parquet",,,0,1) >
parquet::saveParquet
Syntax
parquet::saveParquet(table, fileName)
Parameters
table: the table to be exported
fileName: a STRING scalar indicating the Parquet file name to save the table
Details
Export a DolphinDB table to a Parquet file.
Examples
parquet::saveParquet(tb, "userdata1.parquet")
Data Type Mappings
Import
When a Parquet file is imported to DolphinDB, the data types are converted based on the LogicalType
as annotated in the file. If the LogicalType
or ConvertedType
is not defined, the conversion will be performed based on the physical types.
Logical Type in Parquet | TimeUnit in Parquet | Type in DolphinDB |
---|---|---|
INT(bit_width=8,is_signed=true) | \ | CHAR |
INT(bit_width=8,is_signed=false or bit_width=16,is_signed=true) | \ | SHORT |
INT(bit_width=16,is_signed=false or bit_width=32,is_signed=true) | \ | INT |
INT(bit_width=32,is_signed=false or bit_width=64,is_signed=true) | \ | LONG |
INT(bit_width=64,is_signed=false) | \ | LONG |
ENUM | \ | SYMBOL |
DECIMAL | \ | DOUBLE |
DATE | \ | DATE |
TIME | MILLIS\MICROS\NANOS | TIME\NANOTIME\NANOTIME |
TIMESTAMP | MILLIS\MICROS\NANOS | TIMESTAMP\NANOTIMESTAMP\NANOTIMESTAMP |
INTEGER | \ | INT\LONG |
STRING | \ | STRING |
JSON | \ | not supported |
BSON | \ | not supported |
UUID | \ | not supported |
MAP | \ | not supported |
LIST | \ | not supported |
NIL | \ | not supported |
Converted Type in Parquet | Type in DolphinDB |
---|---|
INT_8 | CHAR |
UINT_8\INT_16 | SHORT |
UINT_16\INT_32 | INT |
TIMESTAMP_MICROS | NANOTIMESTAMP |
TIMESTAMP_MILLIS | TIMESTAMP |
DECIMAL | DOUBLE |
UINT_32\INT_64\UINT_64 | LONG |
TIME_MICROS | NANOTIME |
TIME_MILLIS | TIME |
DATE | DATE |
ENUM | SYMBOL |
UTF8 | STRING |
MAP | not supported |
LIST | not supported |
JSON | not supported |
BSON | not supported |
MAP_KEY_VALUE | not supported |
Physical Type in Parquet | Type in DolphinDB |
---|---|
BOOLEAN | BOOL |
INT32 | INT |
INT64 | LONG |
INT96 | NANOTIMESTAMP |
FLOAT | FLOAT |
DOUBLE | DOUBLE |
BYTE_ARRAY | STRING |
FIXED_LEN_BYTE_ARRAY | STRING |
Note:
- Conversion of the Parquet repeated fields is not supported.
- DECIMAL can be used to convert data of the following physical types: INT32, INT64 and FIXED_LEN_BYTE_ARRAY.
- DolphinDB does not support unsigned data types. Therefore, in case of UINT_64 overflow when loading a Parquet file, the data will be converted to NULL values in DolphinDB.
Export
When exporting data from DolphinDB to a Parquet file, the system will convert the data types to Parquet types based on the given table schema.
Type in DolphinDB | Physical Type in Parquet | Logical Type in Parquet |
---|---|---|
BOOL | BOOLEAN | \ |
CHAR | FIXED_LEN_BYTE_ARRAY | \ |
SHORT | INT32 | INT(16) |
INT | INT32 | INT(32) |
LONG | INT64 | INT(64) |
DATE | INT32 | DATE |
MONTH | INT32 | DATE |
TIME | INT32 | TIME_MILLIS |
MINUTE | INT32 | TIME_MILLIS |
SECOND | INT32 | TIME_MILLIS |
DATETIME | INT64 | TIMESTAMP_MILLIS |
TIMESTAMP | INT64 | TIMESTAMP_MILLIS |
NANOTIME | INT64 | TIME_NANOS |
NANOTIMESTAMP | INT64 | TIMESTAMP_NANOS |
FLOAT | FLOAT | \ |
DOUBLE | DOUBLE | \ |
STRING | BYTE_ARRAY | STRING |
SYMBOL | BYTE_ARRAY | STRING |