Importing Text Files

DolphinDB provides the following 4 functions to import text files into memory or database:

  • loadText: Import a text file as an in-memory table.
  • ploadText: Import a text file in parallel as a partitioned in-memory tables. It is faster than function loadText.
  • loadTextEx: Import a text file into a database either on disk or in memory.
  • textChunkDS: Divide a text file into multiple data sources, then use function mr for fast and flexible data processing.

Importing text files in DolphinDB is very flexible and is also very fast. Compared with popular systems such as Clickhouse, MemSQL, Druid or Pandas, DolphinDB is faster up to an order of magnitude with single-threaded importing and is even faster with multi-threaded parallel importing.

1. Automatic Identification of Schema

In most other systems, users need to specify column names and data types when importing text files. For the convenience of users, DolphinDB can automatically identify the schema when importing text files.

If none of the columns in the first line of the file starts with a number, the system considers the first line to be the file header with column names. DolphinDB will take a small number of rows as a sample and automatically infer the data type of each column. As only a subset of data are used to infer data types, the data types of some columns may be incorrectly identified. For most text files, however, users don't need to manually specify column names and their data types when importing these files.

Note: The versions prior to 1.20.0 do not support importing UUID, IPADDR or INT128 types. To import these data types, please make sure to use version 1.20.0 or later.

The following example shows how to use function loadText to import a text file as an in-memory table. About the text file used in the examples, please refer to "Appendix".

dataFilePath = "/home/data/candle_201801.csv"
tmpTB = loadText(filename=dataFilePath)
select top 5 * from tmpTB;

symbol exchange cycle tradingDay date       time     open  high  low   close volume  turnover   unixTime
------ -------- ----- ---------- ---------- -------- ----- ----- ----- ----- ------- ---------- -------------
000001 SZSE     1     2018.01.02 2018.01.02 93100000 13.35 13.39 13.35 13.38 2003635 2.678558E7 1514856660000
000001 SZSE     1     2018.01.02 2018.01.02 93200000 13.37 13.38 13.33 13.33 867181  1.158757E7 1514856720000
000001 SZSE     1     2018.01.02 2018.01.02 93300000 13.32 13.35 13.32 13.35 903894  1.204971E7 1514856780000
000001 SZSE     1     2018.01.02 2018.01.02 93400000 13.35 13.38 13.35 13.35 1012000 1.352286E7 1514856840000
000001 SZSE     1     2018.01.02 2018.01.02 93500000 13.35 13.37 13.35 13.37 1601939 2.140652E7 1514856900000

Call function schema to view the table schema (column name, data type and other information):

tmpTB.schema().colDefs;

name       typeString typeInt comment
---------- ---------- ------- -------
symbol     SYMBOL     17
exchange   SYMBOL     17
cycle      INT        4
tradingDay DATE       6
date       DATE       6
time       INT        4
open       DOUBLE     16
high       DOUBLE     16
low        DOUBLE     16
close      DOUBLE     16
volume     INT        4
turnover   DOUBLE     16
unixTime   LONG       5

2. Specify Schema

All 4 text file loading functions covered in this tutorial support parameter 'schema', which specifies a table that may contain the following 4 columns:

Column nameMeaning
namea string indicating the column name
typea string indicating the data type of the column
formata string indicating the format of a date or time column
colan integer indicating the index of the column to be imported. The values in this column must be in ascending order.

The columns 'name' and 'type' are required and must be the first two columns. The columns 'format' and 'col' are optional.

For example, we can assign the following table to parameter 'schema':

nametype
timestampSECOND
IDINT
qtyINT
priceDOUBLE

2.1. Get Text File Schema

Function extractTextSchema is used to obtain the schema of a text file.

For example, to get the schema of the sample file in this tutorial:

dataFilePath = "/home/data/candle_201801.csv"
schemaTB = extractTextSchema(dataFilePath)
schemaTB;

name       type
---------- ------
symbol     SYMBOL
exchange   SYMBOL
cycle      INT
tradingDay DATE
date       DATE
time       INT
open       DOUBLE
high       DOUBLE
low        DOUBLE
close      DOUBLE
volume     INT
turnover   DOUBLE
unixTime   LONG

2.2. Specify Column Names and Types

If the columns names or data types automatically inferred by the system are not expected, we can specify the column names and data types by modifying the schema table generated by function extractTextSchema or creating the schema table directly.

For example, if the column 'volume' is automatically recognized as INT, and the required type is LONG, we need to specify the data type of column 'volume' as LONG in the schema table.

dataFilePath = "/home/data/candle_201801.csv"
schemaTB = extractTextSchema(dataFilePath)
update schemaTB set type="LONG" where name="volume";

Use function loadText to import a text file and import the data into the database according to the field data type specified by schemaTB.

tmpTB = loadText(filename = dataFilePath, schema = schemaTB);

We can use the same method to modify column names.

Please note that if automatically inferred date or time related data types do not meet expectations, please refer to "Specify the Format of Temporal Types".

2.3. Specify the Format of Temporal Types

For temporal values, if the automatically inferred data types do not meet expectations, we need to specify not only the data type, but also the format (represented by a string) in column 'format' of the schema table, such as "MM/dd/yyyy". For more details please refer to Parsing and Format of Temporal Variables.

Execute the following script in DolphinDB to generate the data file for this example.

dataFilePath = "/home/data/timeData.csv"
t = table(["20190623 14:54:57", "20190623 15:54:23", "20190623 16:30:25"] as time, `AAPL`MS`IBM as sym, 2200 5400 8670 as qty, 54.78 59.64 65.23 as price)
saveText(t, dataFilePath);

Before loading the file, use function extractTextSchema to get the schema of the data file:

schemaTB = extractTextSchema(dataFilePath)
schemaTB;

name  type
----- ------
time  SECOND
sym   SYMBOL
qty   INT
price DOUBLE

The automatically inferred data type of column 'time' is SECOND while the expected data type is DATETIME. If the file is loaded without specifying parameter 'schema', column 'time' will be empty. In order to load the file correctly, we need to specify the data type of column 'time' as DATETIME and the format of the column as "yyyyMMdd HH:mm:ss".

update schemaTB set type="DATETIME" where name="time"
schemaTB[`format]=["yyyyMMdd HH:mm:ss",,,]
tmpTB = loadText(dataFilePath,,schemaTB)
tmpTB;

time                sym  qty  price
------------------- ---- ---- -----
2019.06.23T14:54:57 AAPL 2200 54.78
2019.06.23T15:54:23 MS   5400 59.64
2019.06.23T16:30:25 IBM  8670 65.23

2.4. Import Selected Columns

We can import some columns instead of all columns from a text file by modifying the schema table.

In the following example, we only import 7 columns: symbol, date, open, high, close, volume and turnover from a text file.

First, get the text file schema:

dataFilePath = "/home/data/candle_201801.csv"
schemaTB = extractTextSchema(dataFilePath);

Use function rowNo to generate indices of the columns in the text file and assign them to column 'col' in the schema table. Modify the schema table so that only the rows indicating the 7 columns to be imported remain.

update schemaTB set col = rowNo(name)
schemaTB = select * from schemaTB where name in `symbol`date`open`high`close`volume`turnover;

Please note:

  1. Column indices start from 0. In the example above, the column index for the first column 'symbol' is 0.
  2. The order of the columns cannot be changed when loading text files. To adjust the order of the columns, use function reorderColumns! after loading the text file.

Finally, use function loadText and specify parameter 'schema' to import the selected columns from the text file.

tmpTB = loadText(filename=dataFilePath, schema=schemaTB);

Only the selected columns are imported:

select top 5 * from tmpTB

symbol date       open   high  close volume turnover
------ ---------- ------ ----- ----- ------ ----------
000001 2018.01.02 9.31E7 13.35 13.35 13     2.003635E6
000001 2018.01.02 9.32E7 13.37 13.33 13     867181
000001 2018.01.02 9.33E7 13.32 13.32 13     903894
000001 2018.01.02 9.34E7 13.35 13.35 13     1.012E6
000001 2018.01.02 9.35E7 13.35 13.35 13     1.601939E6

2.5. Skip Lines

We can skip a number of lines in the beginning of a text file (could be file description) by specifying parameter 'skipRows'. All 4 text file loading functions covered in this tutorial support parameter 'skipRows'. The maximum value allowed for parameter 'skipRows' is 1024.

dataFilePath = "/home/data/candle_201801.csv"
tmpTB = loadText(filename=dataFilePath)
select count(*) from tmpTB;

count
-----
5040

select top 5 * from tmpTB;

symbol exchange cycle tradingDay date       time     open  high  low   close volume  turnover   unixTime
------ -------- ----- ---------- ---------- -------- ----- ----- ----- ----- ------- ---------- -------------
000001 SZSE     1     2018.01.02 2018.01.02 93100000 13.35 13.39 13.35 13.38 2003635 2.678558E7 1514856660000
000001 SZSE     1     2018.01.02 2018.01.02 93200000 13.37 13.38 13.33 13.33 867181  1.158757E7 1514856720000
000001 SZSE     1     2018.01.02 2018.01.02 93300000 13.32 13.35 13.32 13.35 903894  1.204971E7 1514856780000
000001 SZSE     1     2018.01.02 2018.01.02 93400000 13.35 13.38 13.35 13.35 1012000 1.352286E7 1514856840000
000001 SZSE     1     2018.01.02 2018.01.02 93500000 13.35 13.37 13.35 13.37 1601939 2.140652E7 1514856900000

Now set skipRows=1000 in function loadText to skip the first 1000 lines of the text file when importing the file:

tmpTB = loadText(filename=dataFilePath, skipRows=1000)
select count(*) from tmpTB;

count
-----
4041

select top 5 * from tmpTB;

col0   col1 col2 col3       col4       col5      col6  col7  col8  col9  col10  col11      col12
------ ---- ---- ---------- ---------- --------- ----- ----- ----- ----- ------ ---------- -------------
000001 SZSE 1    2018.01.08 2018.01.08 101000000 13.13 13.14 13.12 13.14 646912 8.48962E6  1515377400000
000001 SZSE 1    2018.01.08 2018.01.08 101100000 13.13 13.14 13.13 13.14 453647 5.958462E6 1515377460000
000001 SZSE 1    2018.01.08 2018.01.08 101200000 13.13 13.14 13.12 13.13 700853 9.200605E6 1515377520000
000001 SZSE 1    2018.01.08 2018.01.08 101300000 13.13 13.14 13.12 13.12 738920 9.697166E6 1515377580000
000001 SZSE 1    2018.01.08 2018.01.08 101400000 13.13 13.14 13.12 13.13 469800 6.168286E6 1515377640000

Note: As shown in the example above, if we skip the first n rows and the first row is file header, file header will be skipped as the first of the n rows.

In the example above, as the file header is skipped in importing, the column names in the imported table are the default column names: col0, col1, col2, and so on. To retain the original column names and skip the first n rows, we can get the schema of the text file with function extractTextSchema, and then specify parameter 'schema' when importing:

schema=extractTextSchema(dataFilePath)
tmpTB=loadText(filename=dataFilePath,schema=schema,skipRows=1000)
select count(*) from tmpTB;

count
-----
4041

select top 5 * from tmpTB;

symbol exchange cycle tradingDay date       time      open  high  low   close volume turnover   unixTime
------ -------- ----- ---------- ---------- --------- ----- ----- ----- ----- ------ ---------- -------------
000001 SZSE     1     2018.01.08 2018.01.08 101000000 13.13 13.14 13.12 13.14 646912 8.48962E6  1515377400000
000001 SZSE     1     2018.01.08 2018.01.08 101100000 13.13 13.14 13.13 13.14 453647 5.958462E6 1515377460000
000001 SZSE     1     2018.01.08 2018.01.08 101200000 13.13 13.14 13.12 13.13 700853 9.200605E6 1515377520000
000001 SZSE     1     2018.01.08 2018.01.08 101300000 13.13 13.14 13.12 13.12 738920 9.697166E6 1515377580000
000001 SZSE     1     2018.01.08 2018.01.08 101400000 13.13 13.14 13.12 13.13 469800 6.168286E6 1515377640000

3. Import Data in Parallel

3.1. Import a Single File in Parallel

Function ploadText loads a text file into memory in a multi-threaded manner. The syntax of the syntax of ploadText is the same as loadText. The difference is that function ploadText loads a large file more quickly and generates a partitioned in-memory table. It makes full use of multi-core CPUs to load a file in parallel. The degree of parallelism depends on the number of CPU cores and the specification of parameter 'localExecutors'.

The following compares the performance of functions loadText and ploadText.

First generate a text file of about 4GB:

filePath="/home/data/testFile.csv"
appendRows=100000000
t=table(rand(100,appendRows) as int,take(string('A'..'Z'),appendRows) as symbol,take(2010.01.01..2018.12.30,appendRows) as date,rand(float(100),appendRows) as float,00:00:00.000 + rand(86400000,appendRows) as time)
t.saveText(filePath);

Load the file with loadText and ploadText respectively. A 6-core 12-thread CPU is used for this test.

timer loadText(filePath);
// Time elapsed: 12629.492 ms

timer ploadText(filePath);
// Time elapsed: 2669.702 ms

The results show that with the configuration of this test, the performance of ploadText is about 4.5 times as fast as that of loadText.

3.2. Import Multiple Files in Parallel

In big data applications, data import is often about the batch import of dozens or even hundreds of large files instead of importing one or two files. For optimal performance, we should import a large number of files in parallel.

Function loadTextEx imports a text file into a specified database, either on disk or in memory. As the partitioned tables in DolphinDB support concurrent reading and writing, they support multi-threaded data import.

When we use function loadTextEx to import a text file into a distributed database, the system imports data into memory first, and then writes the data to the database on disk. These two steps are completed by the same function to ensure optimal performance.

The following example shows how to batch write multiple files on disk to a DolphinDB partitioned database. First, execute the following script in DolphinDB to generate 100 files. These files have about 10 million records and have a total size of about 778MB.

n=100000
dataFilePath="/home/data/multi/multiImport_"+string(1..100)+".csv"
for (i in 0..99){
    trades=table(sort(take(100*i+1..100,n)) as id,rand(`IBM`MSFT`GM`C`FB`GOOG`V`F`XOM`AMZN`TSLA`PG`S,n) as sym,take(2000.01.01..2000.06.30,n) as date,10.0+rand(2.0,n) as price1,100.0+rand(20.0,n) as price2,1000.0+rand(200.0,n) as price3,10000.0+rand(2000.0,n) as price4,10000.0+rand(3000.0,n) as price5)
    trades.saveText(dataFilePath[i])
};

Create a partitioned database and a table in the database:

login(`admin,`123456)
dbPath="dfs://DolphinDBdatabase"
db=database(dbPath,VALUE,1..10000)
tb=db.createPartitionedTable(trades,`tb,`id);

Function cut can group elements in a vector. The following script calls function cut to group the file paths to be imported, and then calls function submitJob to assign write jobs to each thread.

def writeData(db,file){
   loop(loadTextEx{db,`tb,`id,},file)
}
parallelLevel=10
for(x in dataFilePath.cut(100/parallelLevel)){
    submitJob("loadData"+parallelLevel,"loadData",writeData{db,x})
};

Please note: In DolphinDB, multiple threads are not allowed to write to the same partition of a database at the same time. In the example above, each text file has a different value of the partitioning column (column 'id'), so we don't need to worry that multiple threads may write to the same partition at the same time. When designing concurrent reads and writes of partitioned tables in DolphinDB, please make sure that only one thread writes to the same partition at the same time.

Use function getRecentJobs to get the status of the last n batch jobs on the local node. The following script calculates the time consumed to import batch files in parallel. The result shows it took about 1.59 seconds with a 6-core 12-threaded CPU.

select max(endTime) - min(startTime) from getRecentJobs() where jobId like "loadData"+string(parallelLevel)+"%";

max_endTime_sub
---------------
1590

The following script shows it took 8.65 seconds to sequentially import 100 files into the database with a single thread.

timer writeData(db, dataFilePath);
// Time elapsed: 8647.645 ms

With the configuration of this test, the speed of importing 10 threads in parallel is about 5.5 times that of single thread import.

The number of records in the table:

select count(*) from loadTable("dfs://DolphinDBdatabase", `tb);

count
------
10000000

4. Data Preprocessing before Importing Text Files

Before importing text files, if we need to preprocess the data, such as changing data types or filling null values, we can specify parameter 'transform' when calling function loadTextEx. The parameter 'transform' accepts a function that accepts only one parameter. Both the input and the output of the function is an unpartitioned in-memory table. Only function loadTextEx supports parameter 'transform'.

4.1. Specify the Data Types of Temporal Columns

4.1.1. Convert Integers into Specified Temporal Types

A column representing time in a text file may be of type INT or LONG. We often need to save this type of data as a temporal type in the database. We can make convert the data type of columns while loading the text file with parameter 'transform' of function loadTextEx.

First, create a partitioned database and a table in the database.

login(`admin,`123456)
dataFilePath="/home/data/candle_201801.csv"
dbPath="dfs://DolphinDBdatabase"
db=database(dbPath,VALUE,2018.01.02..2018.01.30)
schemaTB=extractTextSchema(dataFilePath)
update schemaTB set type="TIME" where name="time"
tb=table(1:0,schemaTB.name,schemaTB.type)
tb=db.createPartitionedTable(tb,`tb1,`date);

The user-defined function i2t preprocesses the data.

def i2t(mutable t){
    return t.replaceColumn!(`time,time(t.time/10))
}

Please note: When processing data in a user-defined function, please try to use in-place modifications (functions that finish with !) if possible for optimal performance.

Call function loadTextEx and assign function i2t to parameter 'transform'. The system executes function i2t with the imported data, and then saves the result to the database.

tmpTB=loadTextEx(dbHandle=db, tableName=`tb1, partitionColumns=`date, filename=dataFilePath, transform=i2t);

Column 'time' is stored as data type TIME instead of INT in the text file:

select top 5 * from loadTable(dbPath,`tb1);

symbol exchange cycle tradingDay date       time               open  high  low   close volume  turnover   unixTime
------ -------- ----- ---------- ---------- ------------------ ----- ----- ----- ----- ------- ---------- -------------
000001 SZSE     1     2018.01.02 2018.01.02 02:35:10.000000000 13.35 13.39 13.35 13.38 2003635 2.678558E7 1514856660000
000001 SZSE     1     2018.01.02 2018.01.02 02:35:20.000000000 13.37 13.38 13.33 13.33 867181  1.158757E7 1514856720000
000001 SZSE     1     2018.01.02 2018.01.02 02:35:30.000000000 13.32 13.35 13.32 13.35 903894  1.204971E7 1514856780000
000001 SZSE     1     2018.01.02 2018.01.02 02:35:40.000000000 13.35 13.38 13.35 13.35 1012000 1.352286E7 1514856840000
000001 SZSE     1     2018.01.02 2018.01.02 02:35:50.000000000 13.35 13.37 13.35 13.37 1601939 2.140652E7 1514856900000

4.1.2. Data Type Conversion between Temporal Types

We may need to load a DATE type column in a text file as MONTH type to the database. We can use the same methodology as in the previous section.

login(`admin,`123456)
dbPath="dfs://DolphinDBdatabase"
db=database(dbPath,VALUE,2018.01.02..2018.01.30)
schemaTB=extractTextSchema(dataFilePath)
update schemaTB set type="MONTH" where name="tradingDay"
tb=table(1:0,schemaTB.name,schemaTB.type)
tb=db.createPartitionedTable(tb,`tb1,`date)
def d2m(mutable t){
    return t.replaceColumn!(`tradingDay,month(t.tradingDay))
}
tmpTB=loadTextEx(dbHandle=db,tableName=`tb1,partitionColumns=`date,filename=dataFilePath,transform=d2m);

Column 'tradingDay' is stored as MONTH in the database instead of as DATE in the text file:

select top 5 * from loadTable(dbPath,`tb1);

symbol exchange cycle tradingDay date       time     open  high  low   close volume  turnover   unixTime
------ -------- ----- ---------- ---------- -------- ----- ----- ----- ----- ------- ---------- -------------
000001 SZSE     1     2018.01M   2018.01.02 93100000 13.35 13.39 13.35 13.38 2003635 2.678558E7 1514856660000
000001 SZSE     1     2018.01M   2018.01.02 93200000 13.37 13.38 13.33 13.33 867181  1.158757E7 1514856720000
000001 SZSE     1     2018.01M   2018.01.02 93300000 13.32 13.35 13.32 13.35 903894  1.204971E7 1514856780000
000001 SZSE     1     2018.01M   2018.01.02 93400000 13.35 13.38 13.35 13.35 1012000 1.352286E7 1514856840000
000001 SZSE     1     2018.01M   2018.01.02 93500000 13.35 13.37 13.35 13.37 1601939 2.140652E7 1514856900000

4.2. Fill Null Values

As parameter 'transform' only takes a function with one parameter, to assign a DolphinDB built-in function with multiple parameters to 'transform', we can use Partial Application to convert the function into a function with only one parameter. The following example assigns function nullFill! to parameter 'transform' to fill the null values in the text file with 0.

db=database(dbPath,VALUE,2018.01.02..2018.01.30)
tb=db.createPartitionedTable(tb,`tb1,`date)
tmpTB=loadTextEx(dbHandle=db,tableName=`pt,partitionColumns=`date,filename=dataFilePath,transform=nullFill!{,0});

5. Custom Data Import with Map-Reduce

DolphinDB supports the use of Map-Reduce in custom data import. We can divide a text file into multiple parts and import all or some of the parts into DolphinDB with Map-Reduce.

We can use function textChunkDS to divide a text file into multiple data sources with each data source meaning a part of the text file, and then use function mr to write the data sources to the database. Function mr can also process data before saving data to database.

5.1. Save Stocks and Futures Data in Two Separate Tables

Execute the following script in DolphinDB to generate a text file that includes both stocks data and futures data. The size of the file is about 1GB.

n=10000000
dataFilePath="/home/data/chunkText.csv"
trades=table(rand(`stock`futures,n) as type, rand(`IBM`MSFT`GM`C`FB`GOOG`V`F`XOM`AMZN`TSLA`PG`S,n) as sym,take(2000.01.01..2000.06.30,n) as date,10.0+rand(2.0,n) as price1,100.0+rand(20.0,n) as price2,1000.0+rand(200.0,n) as price3,10000.0+rand(2000.0,n) as price4,10000.0+rand(3000.0,n) as price5,10000.0+rand(4000.0,n) as price6,rand(10,n) as qty1,rand(100,n) as qty2,rand(1000,n) as qty3,rand(10000,n) as qty4,rand(10000,n) as qty5,rand(10000,n) as qty6)
trades.saveText(dataFilePath);

Create a partitioned databases and tables to store stocks data and futures data, respectively:

login(`admin,`123456)
dbPath1="dfs://stocksDatabase"
dbPath2="dfs://futuresDatabase"
db1=database(dbPath1,VALUE,`IBM`MSFT`GM`C`FB`GOOG`V`F`XOM`AMZN`TSLA`PG`S)
db2=database(dbPath2,VALUE,2000.01.01..2000.06.30)
tb1=db1.createPartitionedTable(trades,`stock,`sym)
tb2=db2.createPartitionedTable(trades,`futures,`date);

Define a function to divide data into stocks data and futures data, and to save the data to corresponding databases.

def divideImport(tb, mutable stockTB, mutable futuresTB)
{
	tdata1=select * from tb where type="stock"
	tdata2=select * from tb where type="futures"
	append!(stockTB, tdata1)
	append!(futuresTB, tdata2)
}

Use function textChunkDS to divide the text file into multiple parts. In this example the size of each part is 300MB. The file is divided into 4 parts.

ds=textChunkDS(dataFilePath,300)
ds;

(DataSource<readTableFromFileSegment, DataSource<readTableFromFileSegment, DataSource<readTableFromFileSegment, DataSource<readTableFromFileSegment)

Call function mr and specify the result of function textChunkDS as the data source to import the file into the database. As the map function (specified by parameter 'mapFunc') only accepts a table as the input, here we use partial application to convert function divideImport into a function that only takes a table as the input.

mr(ds=ds, mapFunc=divideImport{,tb1,tb2}, parallel=false);

Please note that in this example, different data sources generated by function textChunkDS may contain data for the same partition. As DolphinDB does not allow multiple threads to write to the same partition at the same time, we need to set parallel=false in function mr, otherwise an exception will be thrown.

The stocks table contains only stocks data and the futures table contains only futures data.

stocks table:

select top 5 * from loadTable("dfs://DolphinDBTickDatabase", `stock);

type  sym  date       price1    price2     price3      price4       price5       price6       qty1 qty2 qty3 qty4 qty5 qty6
----- ---- ---------- --------- ---------- ----------- ------------ ------------ ------------ ---- ---- ---- ---- ---- ----
stock AMZN 2000.02.14 11.224234 112.26763  1160.926836 11661.418403 11902.403305 11636.093467 4    53   450  2072 9116 12
stock AMZN 2000.03.29 10.119057 111.132165 1031.171855 10655.048121 12682.656303 11182.317321 6    21   651  2078 7971 6207
stock AMZN 2000.06.16 11.61637  101.943971 1019.122963 10768.996906 11091.395164 11239.242307 0    91   857  3129 3829 811
stock AMZN 2000.02.20 11.69517  114.607763 1005.724332 10548.273754 12548.185724 12750.524002 1    39   270  4216 8607 6578
stock AMZN 2000.02.23 11.534805 106.040664 1085.913295 11461.783565 12496.932604 12995.461331 4    35   488  4042 6500 4826

futures table:

select top 5 * from loadTable("dfs://DolphinDBFuturesDatabase", `futures);

type    sym  date       price1    price2     price3      price4       price5       price6       qty1 qty2 qty3 qty4 qty5 ...
------- ---- ---------- --------- ---------- ----------- ------------ ------------ ------------ ---- ---- ---- ---- ---- ---
futures MSFT 2000.01.01 11.894442 106.494131 1000.600933 10927.639217 10648.298313 11680.875797 9    10   241  524  8325 ...
futures S    2000.01.01 10.13728  115.907379 1140.10161  11222.057315 10909.352983 13535.931446 3    69   461  4560 2583 ...
futures GM   2000.01.01 10.339581 112.602729 1097.198543 10938.208083 10761.688725 11121.888288 1    1    714  6701 9203 ...
futures IBM  2000.01.01 10.45422  112.229537 1087.366764 10356.28124  11829.206165 11724.680443 0    47   741  7794 5529 ...
futures TSLA 2000.01.01 11.901426 106.127109 1144.022732 10465.529256 12831.721586 10621.111858 4    43   136  9858 8487 ...

5.2. Quickly Load Data from the Beginning and the End of Large Files

We can use function textChunkDS to divide a large file into multiple small data sources (chunks), then load the first and the last one. Execute the following script in DolphinDB to generate the data file:

n=10000000
dataFilePath="/home/data/chunkText.csv"
trades=table(rand(`IBM`MSFT`GM`C`FB`GOOG`V`F`XOM`AMZN`TSLA`PG`S,n) as sym,sort(take(2000.01.01..2000.06.30,n)) as date,10.0+rand(2.0,n) as price1,100.0+rand(20.0,n) as price2,1000.0+rand(200.0,n) as price3,10000.0+rand(2000.0,n) as price4,10000.0+rand(3000.0,n) as price5,10000.0+rand(4000.0,n) as price6,rand(10,n) as qty1,rand(100,n) as qty2,rand(1000,n) as qty3,rand(10000,n) as qty4, rand(10000,n) as qty5, rand(1000,n) as qty6)
trades.saveText(dataFilePath);

Then use function textChunkDS to divide the text file into multiple parts. In this example the size of each part is 10MB.

ds=textChunkDS(dataFilePath, 10);

Call function mr to load the first chunk and the last chunk:

head_tail_tb = mr(ds=[ds.head(), ds.tail()], mapFunc=x->x, finalFunc=unionAll{,false});

View the number of records in table 'head_tail_tb':

select count(*) from head_tail_tb;

count
------
192262

6. Other Considerations

6.1. Encoding of Strings

As strings in DolphinDB are encoded in UTF-8, if a text file with string columns is not UTF-8 encoded, the string columns must be changed to UTF-8 encoding after importing. DolphinDB provides functions convertEncode, fromUTF8 and toUTF8 to convert encoding of strings.

For example, use function convertEncode to convert the encoding of column 'exchange' in the tmpTB table:

dataFilePath="/home/data/candle_201801.csv"
tmpTB=loadText(filename=dataFilePath, skipRows=0)
tmpTB.replaceColumn!(`exchange, convertEncode(tmpTB.exchange,"gbk","utf-8"));

6.2. Parsing Numeric Types

This section explains the automatic identification of numeric data types (CHAR, SHORT, INT, LONG, FLOAT and DOUBLE). The system can automatically recognize the following formats of numeric data:

-Numeric values, such as 123 -Numeric values with thousands separators, such as 100,000 -Numeric values with decimal point (floating point number), such as 1.231 -Scientific notation, such as 1.23E5

If a column is specified as a numeric type, the system ignores letters or other symbols before and after numeric values when importing. If there are no numbers in an entry, it will be loaded as a null value. Please see the following example for more details.

First, execute the following script to create a text file.

dataFilePath="/home/data/test.csv"
id=1..3
prices1=["2131","$2,131", "N/A"]
prices2=["213.1","$213.1", "N95C21t/2"]
price3 = [2.658E2, 2.658E3, 2.658E1]
total=["2.658E7","-2.658e7","2.658e-7"]
tt=table(id as id, prices1 as price1, prices2 as price2, price3 as price3, total as total)
saveText(tt,dataFilePath);

In the text file, both column 'price1' and column 'price2' contain numbers and characters. If we load the text file without specifying parameter 'schema', the system will identify these 2 columns as SYMBOL types:

tmpTB=loadText(dataFilePath)
tmpTB;

id price1 price2 total
-- ------ ------ --------
1  2131   213.1  2.658E7
2  $2,131 $213.1 -2.658E7
3  N/A    N/A    2.658E-7

tmpTB.schema().colDefs;

name   typeString typeInt comment
------ ---------- ------- -------
id     INT        4
price1 SYMBOL     17
price2 SYMBOL     17
total  DOUBLE     16

If column 'price1' is specified as INT type and column 'price2' is specified as DOUBLE type, the system will ignore characters and symbols before and after the number when importing. If there are no numbers, an entry is recognized as a null value.

schemaTB=table(`id`price1`price2`total as name, `INT`INT`DOUBLE`DOUBLE as type) 
tmpTB=loadText(dataFilePath,,schemaTB)
tmpTB;

id price1 price2 total
-- ------ ------ --------
1  2131   213.1  2.658E7
2  2131   213.1  -2.658E7
3                2.658E-7

6.3. Automatically Remove Double Quotation Marks

In CSV files, double quotation marks are sometimes used in numeric columns with special characters (such as delimiters). In DolphinDB, the system automatically strips double quotation marks in these columns.

In the text file of the following example, column 'num' contains numeric values with thousands separators.

dataFilePath="/home/data/testSym.csv"
tt=table(1..3 as id,  ["\"500\"","\"3,500\"","\"9,000,000\""] as num)
saveText(tt,dataFilePath);

We can see that the system automatically strips the double quotation marks when importing the text file.

tmpTB=loadText(dataFilePath)
tmpTB;

id num
-- -------
1  500
2  3500
3  9000000