Binary File Processing
DolphinDB provide a wide spectrum of functions to manipulate binary file processing, from raw bytes read/write to high level object read/write.
Read and Write Raw Bytes
The writeBytes
function writes the entire buffer to the file. The
buffer must be a CHAR scalar or CHAR vector. If the operation succeeds, the function
returns the actual number of bytes written; otherwise, an IOException will be
raised. The readBytes
function reads a given number of bytes from
the file. If the file reaches the end or an IO error occurs, an IOException will be
raised; otherwise a buffer containing the given number of bytes will return.
Therefore, one must know the exact number of bytes to read before calling
readBytes
.
// define a file copy function
def fileCopy(source, target){
s = file(source)
len = s.seek(0,TAIL)
s.seek(0,HEAD)
t = file(target,"w")
if(len==0) return
do{
buf = s.readBytes(min(len,1024))
t.writeBytes(buf)
len -= buf.size()
}while(len)
}
fileCopy("test.txt","testcopy.txt");
The readBytes
action always returns a new CHAR vector. As we
discussed earlier in the section of text file processing, it takes some time to
create a new vector buffer. To improve the performance, we can create a buffer and
reuse it. read! is such a function
that accepts an existing buffer. Another advantage of the read!
function is that one doesn't have to know the exact number of bytes to read. The
function returns if the file reaches the end or the give number of bytes have been
read. If the returned count is less than expected, it indicates the file has reached
the end.
// define a file copy function using read! and write functions
def fileCopy2(source, target){
s = file(source)
t = file(target,"w")
buf = array(CHAR,1024)
do{
numByte = s.read!(buf,0,1024)
t.write(buf,0, numByte)
}while(numByte==1024)
}
fileCopy2("test.txt","testcopy.txt");
The performance of file copy function is dominated by the write part. To compare the
performance of readBytes
with read!
, we design
another comparative experiment below.
fileLen = file("test.txt").seek(0, TAIL)
timer(1000){
fin = file("test.txt")
len = fileLen
do{
buf = fin.readBytes(min(len,1024))
len -= buf.size()
}while(len)
};
// Time elapsed: 210.593 ms
timer(1000){
fin = file("test.txt")
buf = array(CHAR,1024)
do{numBytes = fin.read!(buf,0,1024)}while(numBytes==1024)
};
// Time elapsed: 194.519 ms
We can conclude that function read!
is much faster than
readBytes
.
Function readRecord!
coverts a binary file to a DolphinDB object.
The binary files are read by row and each row should contain records with fixed data
types and lengths. For example, if a binary file contains 5 data fields with the
following types (length): char(1), boolean(1), short(2), int(4), long(8), and
double(8), the function readRecord!
will take every 24 bytes as a
new row.
The following example introduces how to import a binary file binSample.bin with
readRecord!
.
Create an in-memory table
tb=table(1000:0, `id`date`time`last`volume`value`ask1`ask_size1`bid1`bid_size1, [INT,INT,INT,FLOAT,INT,FLOAT,FLOAT,INT,FLOAT,INT])
Open files by function file
, then import binary files with function
readRecord!. Data will be loaded into table tb.
dataFilePath="/home/DolphinDB/binSample.bin"
f=file(dataFilePath)
f.readRecord!(tb);
select top 5 * from tb;
id | date | time | last | volume | value | ask1 | ask_size1 | bid1 | bid_size1 |
---|---|---|---|---|---|---|---|---|---|
1 | 20190902 | 91804000 | 0 | 0 | 0 | 11.45 | 200 | 11.45 | 200 |
2 | 20190902 | 92007000 | 0 | 0 | 0 | 11.45 | 200 | 11.45 | 200 |
3 | 20190902 | 92046000 | 0 | 0 | 0 | 11.45 | 1200 | 11.45 | 1200 |
4 | 20190902 | 92346000 | 0 | 0 | 0 | 11.45 | 1200 | 11.45 | 1200 |
5 | 20190902 | 92349000 | 0 | 0 | 0 | 11.45 | 5100 | 11.45 | 5100 |
Function readRecord!
doesn't support string type. The type of date
and time is INT. Users can convert their type from string into a temporal type with
function temporalParse
and replace the original columns with
function replaceColumn!
.
tb.replaceColumn!(`date, tb.date.string().temporalParse("yyyyMMdd"))
tb.replaceColumn!(`time, tb.time.format("000000000").temporalParse("HHmmssSSS"))
select top 5 * from tb;
id | date | time | last | volume | value | ask1 | ask_size1 | bid1 | bid_size1 |
---|---|---|---|---|---|---|---|---|---|
1 | 2019.09.02 | 09:18:04.000 | 0 | 0 | 0 | 11.45 | 200 | 11.45 | 200 |
2 | 2019.09.02 | 09:20:07.000 | 0 | 0 | 0 | 11.45 | 200 | 11.45 | 200 |
3 | 2019.09.02 | 09:20:46.000 | 0 | 0 | 0 | 11.45 | 1200 | 11.45 | 1200 |
4 | 2019.09.02 | 09:23:46.000 | 0 | 0 | 0 | 11.45 | 1200 | 11.45 | 1200 |
5 | 2019.09.02 | 09:23:49.000 | 0 | 0 | 0 | 11.45 | 5100 | 11.45 | 5100 |
Read and Write Multi-byte Integer and Floating Number
The write
function converts the specified buffer to a stream of
bytes and then saves to the file. The buffer could be a scalar or a vector with
various types. If an error occurs, an IOException is raised. Otherwise, the function
returns the number of elements (not the number of bytes) written. The
read!
function reads a given number of elements to the buffer.
For example, if the buffer is an INT vector, the function will convert the bytes
from the file to INT. Both write
and read
function
involve the conversion between streams of bytes and multi-byte words, which is
termed as endianness in computer science. The big endianness has the most
significant byte in the lowest address whereas the little endianness has the least
significant byte in the lowest address. The write
function always
uses the endianness of the operating system. The read!
function can
convert the endianness if the endianness of the file is different from the one of
the operating system. When one uses the file
function to open a
file, there is an optional boolean argument indicating if the file adopts the little
endian format. By default, it is the endianness of the operating system.
x=10h
y=0h
file("C:/DolphinDB/test.bin","w").write(x);
// output: 1
file("C:/DolphinDB/test.bin","r",true).read!(y);
// assume the file format is little endianness
// output: 1
y;
// output: 10
file("C:/DolphinDB/test.bin","r",false).read!(y);
// assume the file format is big endianness
// output: 1
y;
// output: 2560
We perform a simple experiment: write a short integer (2 bytes) with value of 10 to the file and read the number to another short integer variable y with 2 endianness: little and big. As expected, the two readouts are 10 and 2560, respectively. If one performs all file operations on the same machine, one doesn't have to worry about the endianness. But in a distributed system, one must pay attention to the endianness of the network streams or files. The above example uses scalar as the buffer for read and write. We give another example that takes an INT vector as the buffer. It generates one million random integers between 0 and 10000, saves them to a file, then reads them out using a small buffer and calculates the sum.
n=1000000
x=rand(10000,n)
file("test.bin","w").write(x,0,n)
sum=0
buf=array(INT,1024)
fin=file("test.bin")
do{
len = fin.read!(buf,0, 1024)
if(len==1024)
sum +=buf.sum()
else
sum += buf.subarray(0:len).sum()
}while(len == 1024)
fin.close()
sum;
// output: 4994363593
In addition to numbers, strings can also be saved to files in binary format. An
additional null character (a byte with value zero) will be appended as the delimiter
of a string. So if the length of a string is n bytes, the actual number of bytes
written to the file is n+1. The example below demonstrates the use of write
and read! for string read and write. We first generate one million random
stock tickers and save them to a file in binary format. Then we use a small buffer
to read out the entire file sequentially. After each readout, we use the
dictUpdate!
function to count the distribution of words.
file("test.bin","w").write(rand(`IBM`MSFT`GOOG`YHOO`C`FORD`MS`GS`BIDU,1000000));
// output: 1000000
words=dict(STRING,LONG)
buf=array(STRING,1024)
counts=array(LONG,1024,0,1)
fin=file("test.bin")
do{
len = fin.read!(buf,0,1024)
if(len==1024)
dictUpdate!(words, +, buf, counts)
else
dictUpdate!(words, +, buf.subarray(0:len), counts.subarray(0:len))
}while(len==1024)
fin.close();
words;
/* output
MSFT->111294
BIDU->110800
FORD->110916
GS->111233
MS->110859
C->110591
YHOO->111069
GOOG->111972
IBM->111266
*/
words.values().sum();
// output: 1000000
Read and Write Object
The read
! and write
functions provide much
flexibility to manipulate the read/write of binary data. However, one has to know
the exact number of elements to write and read as well as the types of data.
Therefore, when dealing with complex data structures such as matrix, table, or
tuple, one has to design a complicated protocol to coordinate the write and read. We
offer 2 high level functions, readObject and writeObject, to
manipulate object read and write. All data structures including scalar, vector,
matrix, set, dictionary, and table can use these two functions.
a1=10.5
a2=1..10
a3=cross(*,1..5,1..10)
a4=set(`IBM`MSFT`GOOG`YHOO)
a5=dict(a4.keys(),125.6 53.2 702.3 39.7)
a6=table(1 2 3 as id, `Jenny`Tom`Jack as name)
a7=(1 2 3, "hello world!", 25.6)
fout=file("test.bin","w")
fout.writeObject(a1)
fout.writeObject(a2)
fout.writeObject(a3)
fout.writeObject(a4)
fout.writeObject(a5)
fout.writeObject(a6)fout.writeObject(a7)
fout.close();
The script above writes 7 different types of objects to a file. The script below reads out those seven objects from the file and prints out a short description of the objects.
fin = file("test.bin")
for(i in 0:7) print typestr fin.readObject()
fin.close();
/* output
DOUBLE
FAST INT VECTOR
INT MATRIX
STRING SET
STRING->DOUBLE Dictionary
TABLE
ANY VECTOR
*/