createStreamDispatchEngine

Syntax

createStreamDispatchEngine(name, dummyTable, keyColumn, outputTable, [dispatchType='hash'], [hashByBatch=false], [outputLock=true], [queueDepth=4096], [outputElapsedTime=false], [mode='buffer'])

Details

createStreamDispatchEngine function creates a stream dispatch engine that distributes incoming data streams to specified output tables for computational load balancing. The output tables can be in-memory tables, DFS tables, or other streaming engines. The function returns a table object.

Key characteristics of the stream dispatch engine:

Supports multithreaded input and output of streams.
Used only for data dispatching, not for metrics computing.

Typical usage:

The stream dispatch engine can distribute market data to one or more computational streaming engines that calculate factors. This achieves optimal performance by balancing the computational load.

Parameters

name is a string indicating the name of the engine. It is the only identifier of an engine. It can contain letters, numbers and underscores but must start with a letter.

dummyTable is a table whose schema must be the same as the input stream table. Whether dummyTable contains data does not matter.

keyColumn is a string. If provided, the ingested data will be distributed to output tables based on the values in this column. Unique values in keyColumn are treated as keys.

outputTable is one or more tables that the engine outputs data to. When outputElapsedTime = false, outputTable must have the same schema as dummyTable; when outputElapsedTime = true, outputTable should have two additional columns - a LONG column and an INT column (see outputElapsedTime).

Up to 100 tables can be specified for outputTable. The engine starts a thread for each output table to process the distributed data. To specify multiple output tables, pass a tuple, embedding sub-tuples if needed. For examples:

To distribute evenly to 4 tables, specify outputTable=[table1, table2, table3, table4]
To distribute evenly to 2 replicated table sets, specify [[table1_1, table1_2], [table2_1, table2_2]]. This maintains 2 replicas of the ingested data - replica 1 distributed across table1_1 and table1_2, and replica 2 distributed across table2_1 and table2_2.

dispatchType (optional) is a string. It can be:

"hash" (default) - Apply a hash algorithm on keyColumn and distribute records based on the hash result. Hash distribution can be uneven across tables.
"uniform" - Evenly distribute records across output tables based on keyColumn values.
"saltedHash" - Apply a salted hash algorithm on keyColumn and distribute records based on the hash result. Salting ensures unique hashes even with the same input. This option is more suitable for scenarios that involve multi-level hash distribution (e.g., a dispatch engine with all nested engines using hash for data distribution).

The default "hash" is recommended unless data distribution is highly uneven and impacts performance. In that case, try "uniform".

hashByBatch (optional) is a Boolean value. The default is false, indicating that for each batch of data ingested into the engine, group records by keyColumn values first, then distribute groups across tables based on dispatchType.

To set hashByBatch to true, dispatchType must be 'hash'. In this case, for each ingested batch of data, the engine randomly selects a key, computes its hash value, and distributes the entire batch based on the hash result.

Note: Setting hashbyBatch = false ensures that records with identical keys are output to the same table. However, grouping records by key adds processing cost.

outputLock (optional) is a Boolean value indicating whether to apply a lock output table(s) to prevent concurrent write conflicts. The default is true (recommended). False means not to apply lock to the output table(s).

An output table, essentially an in-memory table, does not allow concurrent writes. As threads working for other streaming engines or subscriptions may also write to the output tables, the lock ensures thread safety. However, locking comes at a performance cost. If it can be guaranteed no other threads will write to the output tables concurrently, outputLock can be set to false to optimize performance.

queueDepth (optional) is a positive integer controlling the queue or buffer size for each output thread. The default is 4096 (records).

When mode = "buffer", queueDepth sets the size of the cache table for each thread of an output table;
When mode = "queue", queueDepth sets the maximum depth of each output queue.

Set queueDepth based on the expected data volume: if the ingested amount is small, a large queueDepth wastes memory; if the ingested amount is large, a small queueDepth may cause output blocking.

outputElapsedTime (optional) is a Boolean value indicating whether to print the elapsed time to process each ingested batch, from ingestion to output. The default is false. If outputElapsedTime = true, two extra columns are added to each output table: a LONG column for the time elapsed in microseconds to process each data batch internally, and an INT column for nanosecond timestamps of when each batch was output.

mode (optional) is a string. It can be:

"buffer" (default) - For each thread working for an output table, the engine creates an in-memory cache table to buffer pending writes. It copies data into the cache before writing to output table. Use this mode when (1) the input source(s) may have concurrent reads/writes while ingesting data into the engine; or (2) the input source(s) frequently appends small batches of data to the engine.
"queue" - For each thread working for an output table, the engine maintains a queue per with references to input data. Input data is not copied, only referenced. This requires no concurrent reads/writes to the input source(s) during ingestion. This mode is best when the input source(s) infrequently append large batches of data.

Returns

A table object.

Examples

Example 1 Distribute data of a stream table to 3 reactive state streaming engines for metric computation using the stream dispatch engine. The final results are output into one single table.

// define the input stream table for the stream dispatch engine
share streamTable(1:0, `sym`price, [STRING,DOUBLE]) as tickStream
share streamTable(1000:0, `sym`factor1, [STRING,DOUBLE]) as resultStream

// define the output table for the reactive state engines
for(i in 0..2){
rse = createReactiveStateEngine(name="reactiveDemo"+string(i), metrics =<cumavg(price)>, dummyTable=tickStream, outputTable=resultStream, keyColumn="sym")
}
// create the stream dispatch engine
dispatchEngine=createStreamDispatchEngine(name="dispatchDemo", dummyTable=tickStream, keyColumn=`sym, outputTable=[getStreamEngine("reactiveDemo0"),getStreamEngine("reactiveDemo1"),getStreamEngine("reactiveDemo2")])

// the stream dispatch engine subscribes to the stream table
subscribeTable(tableName=`tickStream, actionName="sub", handler=tableInsert{dispatchEngine}, msgAsTable = true)
    
// ingest data to the stream dispatch engine
n=100000
symbols=take(("A" + string(1..10)),n)
prices=100+rand(1.0,n)
t=table(symbols as sym, prices as price)
tickStream.append!(t)

select count(*) from resultStream
// output: 100,000

// check the status of the reactive state engines
getStreamEngineStat().ReactiveStreamEngine


name	user	status	lastErrMsg	numGroups	numRows	numMetrics	memoryInUsed
reactiveDemo2	admin	OK	1	10,000	1	921	-1
reactiveDemo1	admin	OK	5	50,000	1	1,437	-1
reactiveDemo0	admin	OK	4	40,000	1	1,308	-1

Example 2 Data from 30 stocks continuously streams in. To improve the throughput of factor calculation, use a streaming engine to distribute the data by stock code sym to 3 reactive state engines, and each engine independently calculates the cumavg factor.

Use hash distribution (dispatchType='hash')

dummy = table(1:0, `sym`price, [STRING, DOUBLE])
// Define output tables
share streamTable(1000:0, `sym`factor1, [STRING, DOUBLE]) as result_hash0
share streamTable(1000:0, `sym`factor1, [STRING, DOUBLE]) as result_hash1
share streamTable(1000:0, `sym`factor1, [STRING, DOUBLE]) as result_hash2
// Define reactive state engines
for(i in 0..2) {
    createReactiveStateEngine(name="rse_hash" + string(i), metrics=<cumavg(price)>,
        dummyTable=dummy, outputTable=objByName("result_hash" + string(i)), keyColumn="sym")
}
// Define the streaming engine
dispatchHash = createStreamDispatchEngine(name="dispatch_hash", dummyTable=dummy, keyColumn=`sym,
    outputTable=[getStreamEngine("rse_hash0"), getStreamEngine("rse_hash1"), getStreamEngine("rse_hash2")],
    dispatchType='hash')

// Simulate 90,000 rows of data for 30 stocks
n = 90000
t = table(take("A" + string(1..30), n) as sym, (100 + rand(1.0, n)) as price)
dispatchHash.append!(t)
sleep(1000)

result_hash0.size() //27,000
result_hash1.size() //36,000
result_hash2.size() //27,000

Use uniform distribution (dispatchType='uniform')

dummy = table(1:0, `sym`price, [STRING, DOUBLE])
// Define output tables
share streamTable(1000:0, `sym`factor1, [STRING, DOUBLE]) as result_uniform0
share streamTable(1000:0, `sym`factor1, [STRING, DOUBLE]) as result_uniform1
share streamTable(1000:0, `sym`factor1, [STRING, DOUBLE]) as result_uniform2
// Define reactive state engines
for(i in 0..2) {
    createReactiveStateEngine(name="rse_uniform" + string(i), metrics=<cumavg(price)>,
        dummyTable=dummy, outputTable=objByName("result_uniform" + string(i)), keyColumn="sym")
}
// Define the streaming engine
dispatchUniform = createStreamDispatchEngine(name="dispatch_uniform", dummyTable=dummy, keyColumn=`sym,
    outputTable=[getStreamEngine("rse_uniform0"), getStreamEngine("rse_uniform1"), getStreamEngine("rse_uniform2")],
    dispatchType='uniform')

// Simulate 90,000 rows of data for 30 stocks
n = 90000
t = table(take("A" + string(1..30), n) as sym, (100 + rand(1.0, n)) as price)
dispatchUniform.append!(t)
sleep(1000)

result_uniform0.size() //30,000
result_uniform1.size() //30,000
result_uniform2.size() //30,000

It can be seen that the row counts of the tables are more evenly distributed with uniform distribution.

Example 3 For multi-level distribution scenarios: engineA distributes to engine1 and engine2, engine1 distributes to table1_1 and table1_2, and engine2 distributes to table2_1 and table2_2. Using dispatchType='saltedHash' can effectively avoid uneven hash distribution.

dummy = table(1:0, `sym`price, [STRING, DOUBLE])
// Define output tables
share streamTable(1000:0, `sym`price, [STRING, DOUBLE]) as s_t1_1
share streamTable(1000:0, `sym`price, [STRING, DOUBLE]) as s_t1_2
share streamTable(1000:0, `sym`price, [STRING, DOUBLE]) as s_t2_1
share streamTable(1000:0, `sym`price, [STRING, DOUBLE]) as s_t2_2
// Define two levels of streaming engines
salt1 = createStreamDispatchEngine(name="salt1", dummyTable=dummy, keyColumn=`sym,
    outputTable=[s_t1_1, s_t1_2], dispatchType='saltedHash')
salt2 = createStreamDispatchEngine(name="salt2", dummyTable=dummy, keyColumn=`sym,
    outputTable=[s_t2_1, s_t2_2], dispatchType='saltedHash')

saltA = createStreamDispatchEngine(name="saltA", dummyTable=dummy, keyColumn=`sym,
    outputTable=[getStreamEngine("salt1"), getStreamEngine("salt2")], dispatchType='saltedHash')
// Write data
n = 100000
t = table(take("S" + string(1..100), n) as sym, rand(100.0, n) as price)
hashA.append!(t)
saltA.append!(t)
sleep(1000)

s_t1_1.size()//27,000
s_t1_2.size()//20,000
s_t2_1.size()//29,000
s_t2_2.size()//24,000

If hash distribution is used, obvious uneven distribution will occur in multi-level distribution scenarios.

dummy = table(1:0, `sym`price, [STRING, DOUBLE])
share streamTable(1000:0, `sym`price, [STRING, DOUBLE]) as h_t1_1
share streamTable(1000:0, `sym`price, [STRING, DOUBLE]) as h_t1_2
share streamTable(1000:0, `sym`price, [STRING, DOUBLE]) as h_t2_1
share streamTable(1000:0, `sym`price, [STRING, DOUBLE]) as h_t2_2

hash1 = createStreamDispatchEngine(name="hash1", dummyTable=dummy, keyColumn=`sym,
    outputTable=[h_t1_1, h_t1_2], dispatchType='hash')
hash2 = createStreamDispatchEngine(name="hash2", dummyTable=dummy, keyColumn=`sym,
    outputTable=[h_t2_1, h_t2_2], dispatchType='hash')

hashA = createStreamDispatchEngine(name="hashA", dummyTable=dummy, keyColumn=`sym,
    outputTable=[getStreamEngine("hash1"), getStreamEngine("hash2")], dispatchType='hash')

n = 100000
t = table(take("S" + string(1..100), n) as sym, rand(100.0, n) as price)
hashA.append!(t)
sleep(1000)

h_t1_1.size()//56,000
h_t1_2.size()//0
h_t2_1.size()//44,000
h_t2_2.size()//0

Example 4hashByBatch can be used to specify whether to send the entire batch of data to the same table.

dummy = table(1:0, `sym`price, [STRING, DOUBLE])

share streamTable(1000:0, `sym`price, [STRING, DOUBLE]) as out_batchOn_0
share streamTable(1000:0, `sym`price, [STRING, DOUBLE]) as out_batchOn_1

share streamTable(1000:0, `sym`price, [STRING, DOUBLE]) as out_batchOff_0
share streamTable(1000:0, `sym`price, [STRING, DOUBLE]) as out_batchOff_1

// hashByBatch=true: all data in one batch is output to the same table
dispatchBatchOn = createStreamDispatchEngine(name="dispatch_batchOn", dummyTable=dummy, keyColumn=`sym,
    outputTable=[out_batchOn_0, out_batchOn_1],
    dispatchType='hash', hashByBatch=true)

// hashByBatch=false: data in the same batch is grouped by key and then distributed to different tables
dispatchBatchOff = createStreamDispatchEngine(name="dispatch_batchOff", dummyTable=dummy, keyColumn=`sym,
    outputTable=[out_batchOff_0, out_batchOff_1],
    dispatchType='hash', hashByBatch=false)

// Write data
t = table(take(`A`B`C`D`E, 1000) as sym, rand(100.0, 1000) as price)
dispatchBatchOn.append!(t)
dispatchBatchOff.append!(t)
sleep(500)

out_batchOn_0.size()//0
out_batchOn_1.size()//1,000

out_batchOff_0.size()//600
out_batchOff_1.size()//400

It can be seen that when hashByBatch is enabled, the entire batch of data is sent to the same table; when it is disabled, both tables contain data.

Example 5 After outputElapsedTime is enabled, the output table records the distribution elapsed time and output timestamp.

dummy = table(1:0, `sym`price, [STRING, DOUBLE])

// When outputElapsedTime=true, the output table must have two more columns than dummyTable: LONG (elapsed time in microseconds) + NANOTIMESTAMP (output timestamp)
share streamTable(1000:0, `sym`price`elapsed`ts, [STRING, DOUBLE, LONG, NANOTIMESTAMP]) as out_elapsed_0
share streamTable(1000:0, `sym`price`elapsed`ts, [STRING, DOUBLE, LONG, NANOTIMESTAMP]) as out_elapsed_1

dispatchElapsed = createStreamDispatchEngine(name="dispatch_elapsed", dummyTable=dummy, keyColumn=`sym,
    outputTable=[out_elapsed_0, out_elapsed_1],
    outputElapsedTime=true)

t = table(take(`A`B`C, 3000) as sym, rand(100.0, 3000) as price)
dispatchElapsed.append!(t)
sleep(500)

select top 5 * from out_elapsed_0


sym	price	elapsed	ts
B	5.156729068876414	81	2026.03.21 19:39:01.860162086
C	14.33727939481318	81	2026.03.21 19:39:01.860162086
B	56.25857125414968	81	2026.03.21 19:39:01.860162086
C	65.14644184269626	81	2026.03.21 19:39:01.860162086
B	48.80402196403686	81	2026.03.21 19:39:01.860162086