loadText#
- swordfish.function.loadText()#
Load a text file into memory as a table. loadText loads data in single thread.To load data in multiple threads, use ploadText.
How a header row is determined:
When containHeader is null, the first row of the file is read in string format, and the column names are parsed from that data. Please note that the upper limit for the first row is 256 KB. If none of the columns in the first row of the file starts with a number, the first row is treated as the header with column names of the text file. If at least one of the columns in the first row of the file starts with a number, the system uses col0, col1, … as the column names;
When containHeader is true, the first row is determined as the header row, and the column names are parsed from that data;
When containHeader is false, the system uses col0, col1, … as the column names.
How the column types are determined:
When loading a text file, the system determines the data type of each column based on a random sample of rows. This convenient feature may not always accurately determine the data type of all columns. We recommend users check the data type of each column with the extractTextSchema function after loading.
When the input file contains dates and times:
For data with delimiters (date delimiters “-”, “/” and “.”, and time delimiter “:”), it will be converted to the corresponding type. For example, “12:34:56” is converted to the SECOND type; “23.04.10” is converted to the DATE type.
For data without delimiters, data in the format of “yyMMdd” that meets 0<=yy<=99, 0<=MM<=12, 1<=dd<=31, will be preferentially parsed as DATE; data in the format of “yyyyMMdd” that meets 1900<=yyyy<=2100, 0<=MM<=12, 1<=dd<=31 will be preferentially parsed as DATE.
If a column does not have the expected data type, then we need to enter the correct data type of the column in the schema table. Users can also specify data types for all columns. For a temporal column, if it does not have the expected data type, we also need to specify a format such as “MM/dd/yyyy” in the schema table. For details about temporal formats please refer to Parsing and Format of Temporal Variables.
To load a subset of columns, specify the column index in the “col” column of schema.
As string in DolphinDB is encoded in UTF-8, we require input text files be encoded in UTF-8.
Column names in DolphinDB must only contain letters, numbers or underscores and must start with a letter. If a column name in the text file does not meet the requirements, the system automatically adjusts it:
If the column name contains characters other than letters, numbers or underscores, these characters are converted into underscores.
If the column name does not start with a letter, add “c” to the column name so that it starts with “c”.
- Parameters:
filename (Constant) – The input text file name with its absolute path. Currently only .csv files are supported.
delimiter (Constant, optional) – A STRING scalar indicating the table column separator. It can consist of one or more characters, with the default being a comma (‘,’).
schema (Constant, optional) –
A table. It can have the following columns, among which “name” and “type” columns are required.
Column
Data Type
Description
name
STRING scalar
column name
type
STRING scalar
data type
format
STRING scalar
the format of temporal columns
col
INT scalar or vector
the columns to be loaded
Note
If “type” specifies a temporal data type, the format of the source data must match a DolphinDB temporal data type. If the format of the source data and the DolphinDB temporal data types are incompatible, you can specify the column type as STRING when loading the data and convert it to a DolphinDB temporal data type using the temporalParse function afterwards.
skipRows (Constant, optional) – An integer between 0 and 1024 indicating the rows in the beginning of the text file to be ignored. The default value is 0.
arrayDelimiter (Constant, optional) – A single character indicating the delimiter for columns holding the array vectors in the file. You must use the schema parameter to update the data type of the type column with the corresponding array vector data type before import.
containHeader (Constant, optional) – A Boolean value indicating whether the file contains a header row. The default value is null.
arrayMarker (Constant, optional) –
A string containing 2 characters or a CHAR pair. These two characters represent the identifiers for the left and right boundaries of an array vector. The default identifiers are double quotes (“).
It cannot contain spaces, tabs (
\t), or newline characters (\tor\n).It cannot contain digits or letters.
If one is a double quote (
"), the other must also be a double quote.If the identifier is
',", or\, a backslash (\) escape character should be used as appropriate. For example,arrayMarker="\"\"".If delimiter specifies a single character, arrayMarker cannot contain the same character.
If delimiter specifies multiple characters, the left boundary of arrayMarker cannot be the same as the first character of delimiter.