loadText

Syntax

loadText(filename, [delimiter], [schema], [skipRows=0], [arrayDelimiter], [containHeader], [arrayMarker])

Arguments

filename is the input text file name with its absolute path. Currently only .csv files are supported.

delimiter (optional) is a STRING scalar indicating the table column separator. It can consist of one or more characters, with the default being a comma (',').

schema (optional) is a table. It can have the following columns, among which "name" and "type" columns are required.
Column Data Type Description
name STRING scalar column name
type STRING scalar data type
format STRING scalar the format of temporal columns
col INT scalar or vector the columns to be loaded
Note:

If "type" specifies a temporal data type, the format of the source data must match a DolphinDB temporal data type. If the format of the source data and the DolphinDB temporal data types are incompatible, you can specify the column type as STRING when loading the data and convert it to a DolphinDB temporal data type using the temporalParse function afterwards.

skipRows (optional) is an integer between 0 and 1024 indicating the rows in the beginning of the text file to be ignored. The default value is 0.

arrayDelimiter (optional) is a single character indicating the delimiter for columns holding the array vectors in the file. You must use the schema parameter to update the data type of the type column with the corresponding array vector data type before import.

containHeader (optional) is a Boolean value indicating whether the file contains a header row. The default value is NULL.

arrayMarker is a string containing 2 characters or a CHAR pair. These two characters represent the identifiers for the left and right boundaries of an array vector. The default identifiers are double quotes (").

  • It cannot contain spaces, tabs (\t), or newline characters (\t or \n).

  • It cannot contain digits or letters.

  • If one is a double quote ("), the other must also be a double quote.

  • If the identifier is ', ", or \, a backslash ( \ ) escape character should be used as appropriate. For example, arrayMarker="\"\"".

  • If delimiter specifies a single character, arrayMarker cannot contain the same character.

  • If delimiter specifies multiple characters, the left boundary of arrayMarker cannot be the same as the first character of delimiter.

Details

Load a text file into memory as a table. loadText loads data in single thread. To load data in multiple threads, use ploadText .

  • How a header row is determined:

    • When containHeader is NULL, the first row of the file is read in string format, and the column names are parsed from that data. Please note that the upper limit for the first row is 256 KB. If none of the columns in the first row of the file starts with a number, the first row is treated as the header with column names of the text file. If at least one of the columns in the first row of the file starts with a number, the system uses col0, col1, … as the column names;

    • When containHeader is true, the first row is determined as the header row, and the column names are parsed from that data;

    • When containHeader is false, the system uses col0, col1, … as the column names.

  • How the column types are determined:

    • When loading a text file, the system determines the data type of each column based on a random sample of rows. This convenient feature may not always accurately determine the data type of all columns. We recommend users check the data type of each column with the extractTextSchema function after loading.
    • When the input file contains dates and times:
      • For data with delimiters (date delimiters "-", "/" and ".", and time delimiter ":"), it will be converted to the corresponding type. For example, "12:34:56" is converted to the SECOND type; "23.04.10" is converted to the DATE type.
      • For data without delimiters, data in the format of "yyMMdd" that meets 0<=yy<=99, 0<=MM<=12, 1<=dd<=31, will be preferentially parsed as DATE; data in the format of "yyyyMMdd" that meets 1900<=yyyy<=2100, 0<=MM<=12, 1<=dd<=31 will be preferentially parsed as DATE.
    • If a column does not have the expected data type, then we need to enter the correct data type of the column in the schema table. Users can also specify data types for all columns. For a temporal column, if it does not have the expected data type, we also need to specify a format such as "MM/dd/yyyy" in the schema table. For details about temporal formats please refer to Parsing and Format of Temporal Variables.

To load a subset of columns, specify the column index in the "col" column of schema.

As string in DolphinDB is encoded in UTF-8, we require input text files be encoded in UTF-8.

Column names in DolphinDB must only contain letters, numbers or underscores and must start with a letter. If a column name in the text file does not meet the requirements, the system automatically adjusts it:
  • If the column name contains characters other than letters, numbers or underscores, these characters are converted into underscores.

  • If the column name does not start with a letter, add "c" to the column name so that it starts with "c".

A few examples:

Column name in data files Adjusted column name
1_test c1_test
test-a! test_a_
[test] c_test_
Note: Starting from version 1.30.22/2.00.10, loadText supports files containing multiple line breaks in a single record.

Examples

Use the following script to generate the data file to be used for the examples:

n=10
sym=rand(`AAPL`ORCL`MS`SUN,n)
permno=take(10001,n)
date=rand(2019.06.01..2019.06.10,n)
open=rand(100.0,n)
high=rand(200.0,n)
close=rand(200.0,n)
pre_close=rand(200.0,n)
change=rand(100.0,n)
vol=rand(10000,n)
amount=rand(100000.0,n)
t=table(sym,permno,date,open,high,close,pre_close,change,vol,amount)
saveText(t,"C:/DolphinDB/Data/stock.csv");

Example: Use loadText without specifying any optional parameters:

tt=loadText("C:/DolphinDB/Data/stock.csv");
tt;
sym permno date open high close pre_close change vol amount
MS 10001 2019.06.06 90.346594 80.530542 96.474428 146.305659 0.720236 1045 90494.568297
AAPL 10001 2019.06.07 91.165315 8.482074 85.514922 16.259077 76.797829 7646 91623.485996
AAPL 10001 2019.06.03 45.361885 14.077451 149.848419 89.110375 45.499145 9555 98171.601654
MS 10001 2019.06.04 8.98688 0.591778 155.54643 132.423187 69.95799 1202 3512.927634
MS 10001 2019.06.07 62.866173 33.465237 174.20712 102.695818 74.580523 3524 61943.64517
MS 10001 2019.06.09 32.819915 13.319577 136.729618 63.980405 60.66375 7078 85138.216568
MS 10001 2019.06.07 90.210866 22.728777 150.212291 59.454705 73.916303 5306 19883.845607
AAPL 10001 2019.06.06 83.752686 71.3501 98.211979 145.60098 94.428343 8852 9236.020781
ORCL 10001 2019.06.01 81.64719 129.702202 182.784373 117.575967 74.84595 2942 43394.871242
AAPL 10001 2019.06.02 10.068382 80.875383 181.674585 138.783821 25.298267 1088 82981.043775
schema(tt).colDefs;
name typeString typeInt comment
sym SYMBOL 17
permno INT 4
date DATE 6
open DOUBLE 16
high DOUBLE 16
close DOUBLE 16
pre_close DOUBLE 16
change DOUBLE 16
vol INT 4
amount DOUBLE 16

Example: Specify the data type of a column before loading the file.

We may want to change the data type of column "permno" to be SYMBOL. For this, we need to use function extractTextSchema to get the schema table, update it, then load the text file with the revised schema table.

schema=extractTextSchema("C:/DolphinDB/Data/stock.csv");
update schema set type=`SYMBOL where name=`permno;
tt=loadText("C:/DolphinDB/Data/stock.csv",,schema);
schema(tt).colDefs;
name typeString typeInt comment
sym SYMBOL 17
permno SYMBOL 17
date DATE 6
open DOUBLE 16
high DOUBLE 16
close DOUBLE 16
pre_close DOUBLE 16
change DOUBLE 16
vol INT 4
amount DOUBLE 16

You can also specify the data types of all columns:

schematable=table(`sym`permno`date`open`high`close`pre_close`change`vol`amount as name,`SYMBOL`SYMBOL`DATE`DOUBLE`DOUBLE`DOUBLE`DOUBLE`DOUBLE`INT`DOUBLE as type)
tt=loadText("C:/DolphinDB/Data/stock.csv",,schematable)
schema(tt).colDefs;
name typeString typeInt comment
sym SYMBOL 17
permno SYMBOL 17
date DATE 6
open DOUBLE 16
high DOUBLE 16
close DOUBLE 16
pre_close DOUBLE 16
change DOUBLE 16
vol INT 4
amount DOUBLE 16

Example: Load only a subset of columns.

For example, we may only need to load the following 7 columns: sym, date, open, high, close, vol, amount. Please note that we cannot change the order of columns when loading data. To change the order of columns in the loaded table, use function reorderColumns!.

schema=extractTextSchema("C:/DolphinDB/Data/stock.csv");
schema=select * from schema where name in `sym`date`open`high`close`vol`amount
schema[`col]=[0,2,3,4,5,8,9]

tt=loadText("C:/DolphinDB/Data/stock.csv",,schema);
tt;
sym date open high close vol amount
SUN 2019.06.10 18.675316 72.754005 136.463909 1376 31371.319038
AAPL 2019.06.05 42.098717 196.873587 41.513899 3632 9950.864129
ORCL 2019.06.05 62.223474 197.099027 123.785675 3069 38035.800937
SUN 2019.06.03 0.18163 50.669866 4.652098 6213 1842.198893
SUN 2019.06.06 32.54134 67.012502 130.312294 4891 55744.156823
SUN 2019.06.07 56.899091 81.709825 61.786176 1133 69057.849515
AAPL 2019.06.08 77.026838 38.504431 22.68496 3672 34420.187073
ORCL 2019.06.07 62.752656 39.33621 48.483091 4382 41601.601639
AAPL 2019.06.02 8.5487 17.623418 141.88325 8092 15449.159988
AAPL 2019.06.02 26.178685 197.320455 110.52407 5541 14616.820449

Example: Skip the first 2 rows when loading.

Please note that as the header is the first line of the text file, it is also skipped.

re=loadText(filename="C:/DolphinDB/Data/stock.csv",skipRows=2)
select count(*) from re;
count
9

Example: Specify the temporal format when loading the file.

Generate the text file to be used:

time=["20190623145457","20190623155423","20190623163025"]
sym=`AAPL`MS`IBM
qty=2200 5400 8670
price=54.78 59.64 65.23
t=table(time,sym,qty,price)
saveText(t,"C:/DolphinDB/Data/t2.csv");

Obtain the text schema with extractTextSchema before loading the file:

extractTextSchema("C:/DolphinDB/Data/t2.csv");
name type
time LONG
sym SYMBOL
qty INT
price DOUBLE

From the example above, if we load this text file without specifying the format of column "time", column "time" is empty as the system cannot parse the raw data correctly. For this scenario we must specify the format of the column.

schema=extractTextSchema("C:/DolphinDB/Data/t2.csv")
update schema set type = "DATETIME" where name = "time"
schema[`format]=["yyyyMMddHHmmss",,,];

loadText("C:/DolphinDB/Data/t2.csv",,schema);
time sym qty price
2019.06.23T14:54:57 AAPL 2200 54.78
2019.06.23T15:54:23 MS 5400 59.64
2019.06.23T16:30:25 IBM 8670 65.23

Example: Load array vectors

Create a .csv file
bid = array(DOUBLE[], 0, 20).append!([1.4799 1.479 1.4787, 1.4796 1.479 1.4784, 1.4791 1.479 1.4784])
ask = array(DOUBLE[], 0, 20).append!([1.4821 1.4825 1.4828, 1.4818 1.482 1.4821, 1.4814 1.4818 1.482])
TradeDate = 2022.01.01 + 1..3
SecurityID = rand(`APPL`AMZN`IBM, 3)
t = table(SecurityID as `sid, TradeDate as `date, bid as `bid, ask as `ask)
t;
saveText(t,filename="/home/t.csv",delimiter=',',append=true)
Load with loadText
path = "/home/t.csv"
schema=extractTextSchema(path);
update schema set type = "DOUBLE[]" where name="bid" or name ="ask"
t = loadText(path, schema=schema, arrayDelimiter=",")
t;
sid date bid ask
AMZN 2022.01.02 [1.4799,1.479,1.4787] [1.4821,1.4825,1.4828]
AMZN 2022.01.03 [1.4796,1.479,1.4784] [1.4818,1.482,1.4821]
IBM 2022.01.04 [1.4791,1.479,1.4784] [1.4814,1.4818,1.482]
When exporting data to a .csv file, the default delimiter for array vectors are double quotes ("). For example, a default exported t.csv would be:
sid,date,bid,ask
APPL,2022.01.02,"1.4799,1.479,1.4787","1.4821,1.4825,1.4828"
IBM,2022.01.03,"1.4796,1.479,1.4784","1.4818,1.482,1.4821"
APPL,2022.01.04,"1.4791,1.479,1.4784","1.4814,1.4818,1.482"
If the boundary identifier is not double quote for array vectors, you can load the file with arrayMarker specified. For example, if the t.csv is in the following format:
sid,date,bid,ask
APPL,2022.01.02,[1.4799,1.479,1.4787],[1.4821,1.4825,1.4828]
IBM,2022.01.03,[1.4796,1.479,1.4784],[1.4818,1.482,1.4821]
APPL,2022.01.04,[1.4791,1.479,1.4784],[1.4814,1.4818,1.482]

Load it with the following script:

path = "/home/DolphinDB/Data/t.csv"
schema=extractTextSchema(path);
update schema set type = "DOUBLE[]" where name="bid" or name ="ask"
t = loadText(path, schema=schema, arrayDelimiter=",",arrayMarker="[]")
t;