Efficient Storage and Querying of News Data Using TextDB
In the era of information explosion, efficiently managing and searching news data has become a key challenge for every data analyst. DolphinDB, as a powerful data processing platform, provides rich tools to read, store, and retrieve various types of news files. This tutorial walks you through how to handle news data with special formats in DolphinDB and introduces a variety of storage schemes under its multi-model storage architecture. We will also analyze the performance and accuracy of different storage methods in text retrieval, demonstrating the unique advantages and optimal use cases of using TextDB for storing and querying news data.
DolphinDB supports the TextDB engine starting from version 3.00.2. All examples in this tutorial require DolphinDB server version 3.00.2 or later.
1. News Data Processing
News data typically refers to unstructured or semi-structured news primarily composed of text, such as market-related news, announcements, research reports, and social media content. These data may affect asset prices, market sentiment, or investment decisions. In DolphinDB, news data usually need to be processed into a specific format and stored in distributed tables for further usage.
1.1 News File Format
We use a JSONL-formatted news file as an example to demonstrate how to process, store, and utilize news data. In a JSONL file, each line is a JSON object representing an individual news record. A sample line in our tutorial:
{
"msgType": 3,
"dsCode": "116984908",
"scrapingTime": 1696089600000,
"delFlag": 0,
"rid": "c2d00b6723d54950ac06cb7b414a49b6",
"title": "Biomedicine...",
"_ext": {"index": "wechat_orginal_warrenq", "type": "wechat_orginal_warrenq"},
"sid": "897594237461088",
"newsUrl": "http://mp.weixin.qq.com/s?__biz=476E295a2edd",
"host": "mp.weixin.qq.com",
"id": "897594237461088",
"pushTime": "09:10:53",
"sector": "http://mp.weixin.qq.com",
"_hitScore": "",
"compared": 1,
"industryName": "Biomedicine",
"nature": "",
"introduce": "In-depth coverage of biomedical research and pharmaceutical innovations",
"updateUser": 1,
"updateTime": "2023-10-01 09:10:53",
"title_disp": "Biomedicine...",
"contentTxt_disp": "## Panasonic Energy...",
"repeatFlag": 0,
"msgSourceUrl": "http://mp.weixin.qq.com",
"collectType": 1,
"newsId": "1801450915567536",
"contentTxt": "## Panasonic Energy...",
"createTime": "2023-10-01 09:10:53",
"createUser": 1,
"pushDate": "2023-10-01",
"_id": "c2d00b6723d54950ac06cb7b414a49b6",
"mediaSource": "Pharma Trends",
"syncStatus": "init"
}
1.2 Parsing News Files
In DolphinDB, you can use the readLines
function to batch read
JSONL format files, ultimately obtaining a string vector containing the complete
file content:
// Read JSONL file into a vector, parameter is the file path
filePath = "./example.jsonl"
def readJsonl(filePath){
f = file(filePath)
jsonl = array(STRING,0,100)
do{
x = f.readLines()
jsonl.append!(x)
}
while(x.size()==1024)
f.close()
return jsonl
}
This function returns a vector that stores JSON strings, as shown in the figure below:

For JSON strings, you can use the parseExpr
function to convert
them to dictionaries, and use each
for vectorized processing to
convert the JSON string vector into a dictionary vector:
// Parse string vector into dictionary vector, parameter is the string vector storing JSON
def parseJsonl(jsonl){
dictionary = each(eval,parseExpr(jsonl))
return dictionary
}
For the dictionary vector returned from the above function, first filter the
fields that need to be stored, then use the transpose
function
to convert the dictionary to a table, and use the unionAll
function to merge them. Finally, convert certain fields to the required data
types to obtain the final result:
// Extract fields to be stored, parameters are fields to store and original dictionary
def getColumns(cols, d){
newD = dict(STRING, ANY)
newD[cols] = d[cols]
return newD
}
// Convert filtered dictionary vectors into tables one by one and merge them, then convert data types of certain fields according to requirements to get the final result
// Parameters are fields to store, original dictionary vector, and data type conversion dictionary
def transDict(columns, dictionary, transCols){
table = unionAll(each(transpose, eachRight(getColumns, columns, dictionary)), false)
for (col in transCols.keys()){
a = <exec temporalParse(_$col, transCols[col]) as m from table>.eval()
replaceColumn!(table, col, a)
}
return table
}
In the case of this tutorial, we need to store the fields msgType, title, newsUrl, host, industryName, contentTxt_disp, createTime, pushDate, and mediaSource. createTime and pushDate need to be converted to corresponding time types, so the input parameters columns and transCols are as follows:
columns = `msgType`title`newsUrl`host`industryName`contentTxt_disp`createTime`pushDate`mediaSource
cols = `pushDate`createTime
type = ["yyyy-MM-dd","yyyy-MM-dd HH:mm:ss"]
transCols = dict(cols, type)
The final result after parsing:

Through the above steps, you can convert a JSONL file into an in-memory table for subsequent writing to distributed tables.
The complete code for this section is provided in the Appendix.
2. News Data Storage
Based on DolphinDB’s multi-model storage architecture, users can flexibly choose the appropriate storage engine according to their specific scenarios to optimize performance. The TSDB engine supports storing text data but does not support text indexing, making it suitable for scenarios where text is primarily stored but queried infrequently. In contrast, the PKEY engine supports building inverted indexes on text columns, significantly improving performance for full-text search operations. This meets the needs of modern applications for efficient retrieval and fast response over massive text data.
2.1 Storage with TSDB Engine
News data can be treated as a special type of time series data and stored using the TSDB engine, which is designed for efficient storage and analysis of time series data.
Before creating databases and tables, you can estimate the daily data volume
using SQL and the strlen
function to determine the partitioning
scheme:
// Get the total bytes of all string type fields within a day
stringCols = exec name from t.schema().colDefs where typeString == "STRING"
len = select sum(rowLen)\nunique(pushDate)\1024\1024 from <select pushDate, sum(_$$stringCols) as rowLen from strlen(t)>.eval()
According to the estimation results, the daily data volume used in this tutorial
is approximately 7 MB. Therefore, we use createTime
as the
partitioning column and apply monthly value partitioning to ensure each
partition remains within a reasonable size range of 100 MB to 1 GB.
The TSDB engine allows frequently queried fields to be configured as index
columns insortColumns
to enhance query performance. In this
example, mediaSource
and title
are commonly
used query fields, so they are included as index columns, with
createTime
serving as the time column in
sortColumns
.
If deduplication is required based onsortColumns
, you can set
the keepDuplicates parameter to FIRST
or
LAST
in the table creation statement. For more details,
refer to createPartitionedTable.
Based on the above design, the database and table creation statements are as follows:
//Create database and table (TSDB)
if (existsDatabase("dfs://news_TSDB"))
{
dropDatabase("dfs://news_TSDB")
}
db = database(directory="dfs://news_TSDB", partitionType=VALUE, partitionScheme=2020.01M..2020.02M,engine='TSDB')
data = createPartitionedTable(
dbHandle = database("dfs://news_TSDB"),
table = table(1:0,`pushDate`createTime`mediaSource`contentTxt_disp`industryName`host`newsUrl`title`msgType,[DATE,DATETIME,STRING,STRING,STRING,STRING,STRING,STRING,INT]),
tableName = `data,
partitionColumns = `createTime,
sortColumns = [`mediaSource,`title,`createTime],
sortKeyMappingFunction = [hashBucket{,25},hashBucket{,25}])
2.2 Storage with PKEY Engine
The PKEY engine enforces primary key uniqueness, supports real-time updates and efficient querying, making it well-suited for storing news data.
Since the recommended partition size for the PKEY engine is the same as that of
the TSDB engine, we continue to use the same partitioning
strategy—createTime
as the partitioning column with monthly
value partitioning.
The PKEY engine allows you to define primary key columns. When new records share
the same primary key, they overwrite existing records, thereby achieving
deduplication. In this example, the combination of createTime
,
mediaSource
, and newsUrl
uniquely
identifies a record, so these fields are specified as the
primaryKey
.
When using the PKEY engine to store news data, you can leverage TextDB, introduced in version 3.00.2. TextDB enables the creation of inverted indexes on text columns, significantly improving full-text search performance. This is especially beneficial for modern applications that require efficient querying of massive volumes of text data.
For the PKEY engine, you can configure text indexes on non-primary key columns
via the indexesparameter. This enables index-based text search on those
fields and enhances query efficiency. In this tutorial, both
contentTxt_disp
and title
are commonly
queried text fields that are not part of the primary keys, so text indexes can
be applied to them.
Based on the above design, the database and table creation statements are as follows:
//Create database and table (PKEY)
if (existsDatabase("dfs://news_PKEY"))
{
dropDatabase("dfs://news_PKEY")
}
db2 = database(directory="dfs://news_PKEY", partitionType=VALUE, partitionScheme=2020.01M..2020.02M,engine='PKEY')
data2 = createPartitionedTable(
dbHandle = database("dfs://news_PKEY"),
table = table(1:0,`pushDate`createTime`mediaSource`contentTxt_disp`industryName`host`newsUrl`title`msgType,[DATE,DATETIME,STRING,STRING,STRING,STRING,STRING,STRING,INT]),
tableName = `data,
partitionColumns = `createTime,
primaryKey = `createTime`mediaSource`newsUrl,
indexes = {"contentTxt_disp":"textIndex(parser=chinese,full=true,lowercase=true,stem=false)",
"title":"textIndex(parser=chinese,full=true,lowercase=true,stem=false)"})
3. Storage Methods and Text Search
After writing news data to distributed tables, users often need to search for specific content. When dealing with large data volumes, query performance and accuracy become especially critical.
In engines other than TextDB, text search typically relies on the
like
function. In contrast, TextDB provides a suite of
functions specifically designed for text search, significantly improving query
efficiency. These functions are applicable only to columns with text indexes and
must be used within the WHERE clause of a query statement. They cannot be used
independently.
For a detailed list of supported text search functions, refer to: TextDB.
3.1 Search Performance
Based on the code provided in the previous section, we created distributed tables
using two different storage engines and imported the same dataset into both. We
then performed a series of text search tests to compare the performance of the
like
function in the TSDB engine with that of dedicated
text search functions in the TextDB engine.
The dataset used in the tests is simulated news data from October 2023, containing 100,000 records and occupying approximately 400 MB before compression.
We begin with a single-word search, querying records from October 2023 where the
contentTxt_disp
field contains the word "Tesla".
// Performance Test: Search Single Given Word
timer t1 = select * from loadTable("dfs://news_TSDB", "data") where month(createTime)=2023.10M and contentTxt_disp like "%Tesla%"
>> Time elapsed: 1230.376 ms
timer t2 = select * from loadTable("dfs://news_PKEY", "data") where month(createTime)=2023.10M and matchAll(contentTxt_disp,"Tesla")
>> Time elapsed: 63.552 ms
Next, we conduct a multi-word search test containing all specified words. Query
data from October 2023 where the contentTxt_disp
field contains
both "OpenAI" and "intelligence":
// Performance Test: Search Content Containing All Given Words
timer t1 = select * from loadTable("dfs://news_TSDB", "data") where month(createTime)=2023.10M and contentTxt_disp like "%OpenAI%" and contentTxt_disp like "%intelligence%"
>> Time elapsed: 1432.834 ms
timer t2 = select * from loadTable("dfs://news_PKEY", "data") where month(createTime)=2023.10M and matchAll(contentTxt_disp, "OpenAI intelligence")
>> Time elapsed: 37.248 ms
Finally, we conduct a multi-word retrieval test containing any of the given
words. Query data from October 2023 where the contentTxt_disp
field contains "Speed" or "Throughput":
// Performance Test: Search Content Containing Any Given Words
timer t1 = select * from loadTable("dfs://news_TSDB", "data") where month(createTime)=2023.10M and contentTxt_disp like "%Speed%" or contentTxt_disp like "%Throughput%"
>> Time elapsed: 2082.541 ms
timer t2 = select * from loadTable("dfs://news_PKEY", "data") where month(createTime)=2023.10M and matchAny(contentTxt_disp,"Speed Throughput")
>> Time elapsed: 499.213 ms
The test results are summarized in the table below:
TSDB | TextDB | |
---|---|---|
Single Keyword Search | 1230.376 ms | 63.552 ms |
Search Matching All Keywords | 1432.834 ms | 37.248 ms |
Search Matching Any Keyword | 2082.541 ms | 499.213ms |
The results clearly show that TextDB's specialized query functions offer significantly better search performance. As the dataset grows and matching records become sparser, the performance advantage over TSDB becomes even more pronounced.
3.2 Search Accuracy
TextDB uses tokenizers to split text into sequences of words and builds inverted indexes to enhance search efficiency. As a result, for most text data that can be accurately tokenized, TextDB delivers precise search results.
For example, we query the number of records from October 2023 where the
title
field contains the word "Biomedicine", using both
TSDB and TextDB:
t1 = select count(*) from loadTable("dfs://news_TSDB", "data") where month(createTime) = 2023.10M and title like "%Biomedicine%"
// 24985 records
t2 = select count(*) from loadTable("dfs://news_PKEY", "data") where month(createTime) = 2023.10M and matchAll(title, "Biomedicine")
// 24985 records
As shown above, TextDB returns an accurate match count.
However, for more complex texts, tokenization may be imperfect. In such cases, search results in TextDB might not be fully accurate.
For instance, the word "art" is commonly used as an independent search term, but
it may also appear as a prefix or suffix within other words. We execute a query
on October 2023 data to search records where the
contentTxt_disp
field contains the single word "art", or
words starting with "art", using both engines:
t1 = select count(*) from loadTable("dfs://news_TSDB", "data") where month(createTime) = 2023.10M and contentTxt_disp like "% art%"
// 24971 records
t2 = select count(*) from loadTable("dfs://news_PKEY", "data") where month(createTime) = 2023.10M and matchAll(contentTxt_disp,"art")
// 0 records
In this example, since the word does not appear as a standalone token, TextDB returns 0 results, but TSDB can search for results with "art" as a prefix.
To better handle such issues, TextDB provides multiple functions that
significantly improve text search efficiency while maintaining accuracy as much
as possible, achieving efficient text search. For example, in TextDB, you can
use the matchPrefix
function to query text containing content
with "art" as a prefix:
t3 = select count(*) from loadTable("dfs://news_PKEY", "data") where month(createTime) = 2023.10M and matchPrefix(contentTxt_disp, "art")
// 24971 records
As we can see, the query result is perfectly accurate.
4. Application Scenarios
Based on the comparison in the previous section, TextDB is particularly well-suited for scenarios where high search performance is required, but strict accuracy is not critical—especially when working with large volumes of data.
The following two examples illustrate typical application scenarios where TextDB excels.
4.1 News Hotspot Analysis
In news hotspot analysis, TextDB is widely used for its fuzzy matching capabilities and fast query performance. Rather than requiring exact matches, hotspot analysis emphasizes efficiently counting the occurrences of trending keywords across large datasets, making TextDB an ideal fit for this purpose.
For example, using the news data from this tutorial, we calculate the proportion
of records from October 2023 in which the contentTxt_disp
field
contains the keyword “Lithography”.
num = exec count(*) from loadTable("dfs://news_PKEY", "data") where month(createTime) = 2023.10M and matchAll(contentTxt_disp,"Lithography")
all = exec count(*) from loadTable("dfs://news_PKEY", "data") where month(createTime) = 2023.10M
num \ all
// 0.14
In the above scenario, text querying based on TextDB can quickly obtain the proportion of hotspot keyword content from large volumes of news data without having to process individual records in detail, thereby determining the ratio of trending news for hotspot analysis.
4.2 Real-time Sentiment Monitoring
In sentiment monitoring scenarios, billions of user-generated short texts may need to be processed each day. Such applications often rely on fuzzy matching to rapidly identify relevant content and extract high-frequency keywords. TextDB is also suitable for such scenarios.
For example, using the news data from this tutorial, we calculate the hourly
frequency of the keyword "Intel" in the contentTxt_disp
field
on October 23, 2023, and visualize the frequency trend over time:
num = select count(*) as number from loadTable("dfs://news_PKEY", "data") where date(createTime) = 2023.10.23 and matchAll(contentTxt_disp,"Intel") group by interval(createTime,1H,`null) as time
all = select count(*) as total from loadTable("dfs://news_PKEY", "data") where date(createTime) = 2023.10.23 group by interval(createTime,1H,`null) as time
p = select time, number\total as prob from num left join all on num.time = all.time
plot(p.prob, p.time, , LINE)

In the above scenario, text querying based on TextDB can quickly calculate the frequency of keyword occurrences across different time intervals without needing to focus on specific data quantities and content. This makes it an effective solution for real-time sentiment analysis and event tracking.
5. Summary
This article systematically explains how to parse and store news data in DolphinDB, and based on DolphinDB's multi-model storage architecture, provides guidance on selecting appropriate storage engines according to specific requirements. It then presents detailed comparative tests showcasing the high performance of the TextDB engine in text search scenarios and illustrates typical use cases where TextDB is most effective. In appropriate scenarios, TextDB can serve as a powerful tool for users to efficiently store and query news data.