HDF5
The DolphinDB hdf5 plugin imports HDF5 datasets into DolphinDB and supports data type conversions.
Installation (with installPlugin
)
Required server version: DolphinDB 2.00.10 or higher
OS: Windows and Linux x86-64
Installation Steps:
(1) Use listRemotePlugins to check plugin information in the plugin repository.
Note: For plugins not included in the provided list, you can install through precompiled binaries or compile from source. These files can be accessed from our GitHub repository by switching to the appropriate version branch.
login("admin", "123456")
listRemotePlugins()
(2) Invoke installPlugin for plugin installation
installPlugin("hdf5")
(3) Use loadPlugin to load the plugin before using the plugin methods.
loadPlugin("hdf5")
Method References
ls
Syntax
ls(fileName)
Parameters
- fileName: A STRING scalar indicating the HDF5 file name.
Details
List all the objects (including datasets and groups) and their types in an HDF5 file. The method returns the number of columns and rows of a dataset. For example, DataSet{(7,3)}
represents 7 columns and 3 rows.
Example
hdf5::ls("/smpl_numeric.h5")
/* output:
objName objType
--------------------
/ Group
/double DataSet{(7,3)}
/float DataSet{(7,3)}
/schar DataSet{(7,3)}
/sint DataSet{(7,3)}
/slong DataSet{(7,3)}
/sshort DataSet{(7,3)}
/uchar DataSet{(7,3)}
/uint DataSet{(7,3)}
/ulong DataSet{(1,1)}
/ushort DataSet{(7,3)}
*/
hdf5::ls("/named_type.h5")
/* output:
objName objType
----------------------
/ Group
/type_name NamedDataType
*/
lsTable
Syntax
lsTable(fileName)
Parameters
- fileName: A STRING scalar indicating the HDF5 file name.
Details
List all the table information in an HDF5 file, i.e., HDF5 dataset information, including table name, dimension, and type.
Example
hdf5::lsTable("/smpl_numeric.h5")
/* output:
tableName tableDims tableType
/double 7,3 H5T_NATIVE_DOUBLE
/float 7,3 H5T_NATIVE_FLOAT
/schar 7,3 H5T_NATIVE_SCHAR
/sint 7,3 H5T_NATIVE_INT
/slong 7,3 H5T_NATIVE_LLONG
/sshort 7,3 H5T_NATIVE_SHORT
/uchar 7,3 H5T_NATIVE_UCHAR
/uint 7,3 H5T_NATIVE_UINT
/ulong 1,1 H5T_NATIVE_ULLONG
/ushort 7,3 H5T_NATIVE_USHORT
*/
extractHDF5Schema
Syntax
extractHDF5Schema(fileName, datasetName)
Parameters
- fileName: A STRING scalar indicating the HDF5 file name.
- datasetName: A STRING scalar indicating the dataset name, i.e., the table name. It can be obtained using
ls
orlsTable
.
Details
Generate the schema table for the specified dataset in the HDF5 file. The schema table contains 2 columns: column names and data types.
Example
hdf5::extractHDF5Schema("/smpl_numeric.h5","sint")
/* output:
name type
col_0 INT
col_1 INT
col_2 INT
col_3 INT
col_4 INT
col_5 INT
col_6 INT
*/
hdf5::extractHDF5Schema("/compound.h5","com")
/* output:
name type
fs STRING
vs STRING
d DOUBLE
t TIMESTAMP
l LONG
f FLOAT
i INT
s SHORT
c CHAR
*/
loadHDF5
Syntax
loadHDF5(fileName,datasetName,[schema],[startRow],[rowNum])
Parameters
- fileName: A STRING scalar indicating the HDF5 file name.
- datasetName: A STRING scalar indicating the dataset name, i.e., the table name. It can be obtained using
ls
orlsTable
. - schema (optional): A table containing column names and types. To modify the data type of a column that is automatically determined by the system, the schema table needs to be modified and used as a parameter in
loadHdf5
. - startRow (optional): An integer indicating the start row from which to read the HDF5 dataset. If not specified, the dataset will be read from the beginning.
- rowNum (optional): An integer indicating the number of rows to read from the HDF5 dataset. If not specified, the reading continues until the end of the dataset.
Details
- Load an HDF5 file into a DolphinDB in-memory table. rowNum specifies the number of rows to read from the HDF5 dataset, instead of the output DolphinDB table. For supported data types and data conversion rules, refer to the "Data Type Mappings" section.
Example
hdf5::loadHDF5("/smpl_numeric.h5","sint")
/* output:
col_0 col_1 col_2 col_3 col_4 col_5 col_6
(758) 8 (325,847) 87 687 45 90
61 0 28 77 546 789 45
799 5,444 325,847 678 90 54 0
*/
scm = table(`a`b`c`d`e`f`g as name, `CHAR`BOOL`SHORT`INT`LONG`DOUBLE`FLOAT as type)
hdf5::loadHdf5("../hdf5/h5file/smpl_numeric.h5","sint",scm,1,1)
/* output:
a b c d e f g
'=' false 28 77 546 789 45
*/
Note: The dimension of the dataset must be 2 or less. Only 2D or 1D tables can be parsed.
loadPandasHDF5
Syntax
loadPandasHDF5(fileName,groupName,[schema],[startRow],[rowNum])
Parameters
- fileName: A STRING scalar indicating the HDF5 file name.
- groupName: The identifier of the group, i.e. the key name.
- schema (optional): A table containing column names and types. To modify the data type of a column that is automatically determined by the system, the schema table needs to be modified and used as a parameter in
loadHdf5
. - startRow (optional): An integer indicating the start row from which to read the HDF5 dataset. If not specified, the dataset will be read from the beginning.
- rowNum (optional): An integer indicating the number of rows to read from the HDF5 dataset. If not specified, the reading continues until the end of the dataset.
Details
Load an HDF5 file saved by Pandas into a DolphinDB in-memory table. rowNum specifies the number of rows to read from the HDF5 dataset, instead of the output DolphinDB table. For supported data types and data conversion rules, refer to the "Data Type Mappings" section.
Example
hdf5::loadPandasHDF5("/home/ffliu/Data/data.h5","/s",,1,1)
/* output:
A B C D E
28 77 54 78 9
*/
loadHDF5Ex
Syntax
loadHDF5Ex(dbHandle,tableName,[partitionColumns],fileName,datasetName,[schema],[startRow],[rowNum],[transform])
Parameters
- dbHandle: The database handle specified to save the input data into a distributed database.
- tableName: The table name specified to save the input data into a distributed database.
- partitionColumns (optional): A STRING scalar/vector indicating partitioning column(s), which needs to be specified when the database is not sequentially (SEQ) partitioned. In composite partitioning, partitionColumns is a STRING vector.
- fileName: A STRING scalar indicating the HDF5 file name.
- datasetName: A STRING scalar indicating the dataset name, i.e., the table name. It can be obtained using
ls
orlsTable
. - schema (optional): A table containing column names and types. To modify the data type of a column that is automatically determined by the system, the schema table needs to be modified and used as a parameter in
loadHdf5
. - startRow (optional): An integer indicating the start row from which to read the HDF5 dataset. If not specified, the dataset will be read from the beginning.
- rowNum (optional): An integer indicating the number of rows to read from the HDF5 dataset. If not specified, the reading continues until the end of the dataset.
- transform (optional): An unary function that takes a table as the parameter. If specified, it is necessary to create a partitioned table before loading the data. The system will apply the function specified by transform to the data in the HDF5 file and save the results into the database.
Details
Convert the datasets in an HDF5 file into a DolphinDB distributed table. The metadata of the table is loaded into the memory. rowNum specifies the number of rows to read from the HDF5 dataset, instead of the output DolphinDB table. For supported data types and data conversion rules, refer to the "Data Type Mappings" section.
Example
SEQ partitioned table on disk
db = database("seq_on_disk", SEQ, 16) hdf5::loadHDF5Ex(db,`tb,,"/large_file.h5", "large_table")
SEQ partitioned table in memory
db = database("", SEQ, 16) hdf5::loadHDF5Ex(db,`tb,,"/large_file.h5", "large_table")
Non-SEQ partitioned table on disk
db = database("non_seq_on_disk", RANGE, 0 500 1000) hdf5::loadHdf5Ex(db,`tb,`col_4,"/smpl_numeric.h5","sint")
Non-SEQ partitioned table in memory
db = database("", RANGE, 0 500 1000) t0 = hdf5::loadHDF5Ex(db,`tb,`col_4,"/smpl_numeric.h5","sint")
Specify parameter transform to convert numeric date and time values (e.g., 20200101) to a DATE type (e.g., 2020.01.01).
dbPath="dfs://DolphinDBdatabase" db=database(dbPath,VALUE,2020.01.01..2020.01.30) dataFilePath="/transform.h5" datasetName="/SZ000001/data" schemaTB=hdf5::extractHDF5Schema(dataFilePath,datasetName) update schemaTB set type="DATE" where name="trans_time" tb=table(1:0,schemaTB.name,schemaTB.type) tb1=db.createPartitionedTable(tb,`tb1,`trans_time); def i2d(mutable t){ return t.replaceColumn!(`trans_time,datetimeParse(string(t.trans_time),"yyyyMMdd")) } t = hdf5::loadHDF5Ex(db,`tb1,`trans_time,dataFilePath,datasetName,,,,i2d)
HDF5DS
Syntax
HDF5DS(fileName,datasetName,[schema],[dsNum])
Parameters
- fileName: A STRING scalar indicating the HDF5 file name.
- datasetName: A STRING scalar indicating the dataset name, i.e., the table name. It can be obtained using
ls
orlsTable
. - schema (optional): A table containing column names and types. To modify the data type of a column that is automatically determined by the system, the schema table needs to be modified and used as a parameter in
loadHdf5
. - dsNum (optional): The number of data sources to be generated. HDF5DS divides the whole table equally into
dsNum
tables. If not specified, it will generate one data source.
Details
Generate a tuple of data sources according to the input file name and dataset name.
Example
ds = hdf5::HDF5DS(smpl_numeric.h5","sint")
size ds;
// output:1
ds[0];
//output:DataSource< loadHDF5("/smpl_numeric.h5", "sint", , 0, 3) >
ds = hdf5::HDF5DS(smpl_numeric.h5","sint",,3)
size ds;
// output:3
ds[0];
// output:DataSource< loadHDF5("/smpl_numeric.h5", "sint", , 0, 1) >
ds[1];
// output:DataSource< loadHDF5("/smpl_numeric.h5", "sint", , 1, 1) >
ds[2];
// output:DataSource< loadHDF5("/smpl_numeric.h5", "sint", , 2, 1) >
Note: HDF5 does not support concurrent reads, for example:
ds = hdf5::HDF5DS("/smpl_numeric.h5", "sint", ,3)
res = mr(ds, def(x) : x)
To correct this error, set parameter parallel of function mr
to false:
ds = hdf5::HDF5DS("/smpl_numeric.h5", "sint", ,3)
res = mr(ds, def(x) : x,,,false)
saveHDF5
Syntax
saveHDF5(table, fileName, datasetName, [append], [stringMaxLength])
Parameters
- table: The table to be saved.
- fileName: A STRING scalar indicating the HDF5 file name.
- datasetName: A STRING scalar indicating the dataset name, i.e., the table name. It can be obtained using
ls
orlsTable
. - append (optional): A BOOL value indicating whether to append data to an existing table. The default value is false.
- stringMaxLength (optional): A numeric value indicating the maximum length of strings, which only applies to the data of STRING and SYMBOL type in the table. The default value is 16.
Details
Save the DolphinDB in-memory table to a specified dataset in an HDF5 file. For supported data types and data conversion rules, refer to the "Data Type Mappings" section.
Example
hdf5::saveHDF5(tb, "example.h5", "dataset name in hdf5")
Note:
NULL values are not supported in HDF5 files. If there are NULL values in DolphinDB tables, they will be saved as the default value defined in the "Data Type Mappings" section.
To read the h5 files generated by the hdf5 plugin through python, you can use the h5py library as follows:
import h5py f = h5py.File("/home/workDir/dolphindb_src/build/test.h5", 'r') print(f['aaa']['TimeStamp']) print(f['aaa']['StockID'])
Data Type Mappings
The floating point and integer types in the HDF5 file will be converted to H5T_NATIVE_* type (via H5Tget_native_type).
Integer
Type in HDF5 | Default Value in HDF5 | Type in C | Type in DolphinDB |
---|---|---|---|
H5T_NATIVE_CHAR | '\0' | signed char / unsigned char | char/short |
H5T_NATIVE_SCHAR | '\0' | signed char | char |
H5T_NATIVE_UCHAR | '\0' | unsigned char | short |
H5T_NATIVE_SHORT | 0 | short | short |
H5T_NATIVE_USHORT | 0 | unsigned short | int |
H5T_NATIVE_INT | 0 | int | int |
H5T_NATIVE_UINT | 0 | unsigned int | long |
H5T_NATIVE_LONG | 0 | long | int/long |
H5T_NATIVE_ULONG | 0 | unsigned long | long |
H5T_NATIVE_LLONG | 0 | long long | long |
H5T_NATIVE_ULLONG | 0 | unsigned long long | long |
- In DophinDB, all numeric types are signed. To prevent overflow, all unsigned types, except for 64-bit unsigned types, are converted to high-order signed types. Specifically, 64-bit unsigned types are converted to 64-bit signed types. If an overflow occurs during the conversion, the value is converted to the maximum value of the 64-bit signed type.
- H5T_NATIVE_CHAR corresponds to the char type in C. Whether char is signed or unsigned depends on the compiler and platform. Signed char is converted to the CHAR type in DolphinDB and unsigned char is converted to the SHORT type.
- H5T_NATIVE_LONG and H5T_NATIVE_ULONG correspond to the long type in C.
- All integer types can be converted to the numeric types (BOOL, CHAR, SHORT, INT, LONG, FLOAT, and DOUBLE) in DolphinDB. An overflow may occur during the conversion. For example, the conversion from LONG to INT returns the maximum or minimum value of INT.
Floating Point
Type in HDF5 | Default Value in HDF5 | Type in C | Type in DolphinDB |
---|---|---|---|
H5T_NATIVE_FLOAT | +0.0f | float | float |
H5T_NATIVE_DOUBLE | +0.0 | double | double |
- IEEE754 floating point types are all signed.
- All floating point types can be converted to the numeric types (BOOL, CHAR, SHORT, INT, LONG, FLOAT, and DOUBLE) in DolphinDB. An overflow may occur during the conversion. For example, the conversion from DOUBLE to FLOAT returns the maximum or minimum value of FLOAT.
Temporal
Type in HDF5 | Default Value in HDF5 | Type in C | Type in DolphinDB |
---|---|---|---|
H5T_UNIX_D32BE | 1970.01.01T00:00:00 | 4 bytes integer | DT_TIMESTAMP |
H5T_UNIX_D32LE | 1970.01.01T00:00:00 | 4 bytes integer | DT_TIMESTAMP |
H5T_UNIX_D64BE | 1970.01.01T00:00:00.000 | 8 bytes integer | DT_TIMESTAMP |
H5T_UNIX_D64LE | 1970.01.01T00:00:00.000 | 8 bytes integer | DT_TIMESTAMP |
- The predefined temporal types of HDF5 are 32-bit or 64-bit POSIX time. Due to the lack of an official definition for temporal types in HDF5, this plugin interprets 32-bit temporal types as the number of seconds since 1970 and 64-bit temporal types as milliseconds. All temporal types are uniformly converted by the plugin into a 64-bit integer and then converted to the timestamp type in DolphinDB.
- The aforementioned temporal types can be converted to time-related types in DolphinDB: DATE, MONTH, TIME, MINUTE, SECOND, DATETIME, TIMESTAMP, NANOTIME, NANOTIMESTAMP.
String
Type in HDF5 | Default Value in HDF5 | Type in C | Type in DolphinDB |
---|---|---|---|
H5T_C_S1 | "" | char* | DT_STRING |
- H5T_C_S1 includes fixed-length strings and variable-length strings.
- The string type can be converted to string-related types in DolphinDB: STRING and SYMBOL.
Enum
Type in HDF5 | Corresponding C Type | Corresponding DolphinDB Type |
---|---|---|
ENUM | enum | DT_SYMBOL |
- The enum type will be converted to a SYMBOL variable in DolphinDB. Note that the enum value and the order of size of each string will not be saved. For example, an enum value
HDF5_ENUM{"a"=100,"b"=2000,"c"=30000}
might be converted toSYMBOL{"a"=3,"b"=1"c"=2}
. - The enum type can be converted to string-related types in DolphinDB: STRING and SYMBOL.
Compound and Array
Type in HDF5 | Corresponding C Type | Corresponding DolphinDB Type |
---|---|---|
H5T_COMPOUND | struct | \ |
H5T_ARRAY | array | \ |
- Compound and array types can be parsed as long as they do not contain unsupported types. Nested data types can also be parsed.
- The conversion of complex types depends on their internal subtypes.
Table Structure
Simple Data Type Table Structure
The tables containing simple data types in HDF5 remain the same structure after being imported into DolphinDB.
Simple Table in HDF5
1 | 2 | |
---|---|---|
1 | int(10) | int(67) |
2 | int(20) | int(76) |
Simple Table in DolphinDB
col_1 | col_2 | |
---|---|---|
1 | 10 | 67 |
2 | 20 | 76 |
Complex Data Type Table Structure
For tables containing complex data types in HDF5, their structure after being imported into DolphinDB depends on their specific data type.
Table of Compound Data in HDF5
1 | 2 | |
---|---|---|
1 | struct | struct |
2 | struct | struct |
Table of Compound Data in DolphinDB
a | b | c | |
---|---|---|---|
1 | 1 | 2 | 3.7 |
2 | 11 | 21 | 31.7 |
3 | 12 | 22 | 32.7 |
4 | 13 | 23 | 33.7 |
Table of Arrays in HDF5
1 | 2 | |
---|---|---|
1 | array(1,2,3) | array(4,5,6) |
2 | array(8,9,10) | array(15,16,17) |
Table of Arrays in DolphinDB
array_1 | array_2 | array_3 | |
---|---|---|---|
1 | 1 | 2 | 3 |
2 | 4 | 5 | 6 |
3 | 8 | 9 | 10 |
4 | 15 | 16 | 17 |
Table of Nested Data in HDF5
For tables containing nested data types in HDF5, a prefix "A" is added to the converted tables in DolphinDB to represent arrays and a prefix "C" is added to represent compound data types.
1 | 2 | |
---|---|---|
1 | struct{a:array(1,2,3) b:2 c:struct{d:"abc"}} | struct{a:array(7,8,9) b:5 c:struct{d:"def"}} |
2 | struct{a:array(11,21,31) b:0 c:struct{d:"opq"}} | struct{a:array(51,52,53) b:24 c:struct{d:"hjk"}} |
Table of Nested Data in DolphinDB
Aa_1 | Aa_2 | Aa_3 | b | Cc_d | |
---|---|---|---|---|---|
1 | 1 | 2 | 3 | 2 | abc |
2 | 7 | 8 | 9 | 5 | def |
3 | 11 | 21 | 31 | 0 | opq |
4 | 51 | 52 | 53 | 24 | hjk |
Performance
Environment
- CPU: i7-7700 3.60GHZ
- SSD: sequential read with a speed up to 460~500MB/S
Dataset import performance
- Int
- Row number: 1024 * 1024 * 16
- Column number: 64
- File size: 4G
- Elapsed time: 8s
- Unsigned int
- Row number: 1024 * 1024 * 16
- Column number: 64
- File size: 4G
- Elapsed time: 9s
- Variable-length string
- Row number: 1024 * 1024
- Column number: 64
- File size: 3.6G
- Elapsed time: 17s
- Compound
- Row number: 1024 * 1024 * 62
- Column number: 9 for subtypes (STR, STR, DOUBLE, INT, LONG, FLOAT, INT, SHORT, CHAR)
- File size: 3.9G
- Elapsed time: 10s
- compound array
- Row number: 1024 * 128 * 62
- Column number: 8 * 9 for subtypes (STR, STR, DOUBLE, INT, LONG, FLOAT, INT, SHORT, CHAR)
- File size: 3.9G
- Elapsed time: 15s