# mr

## Syntax

`mr(ds, mapFunc, [reduceFunc], [finalFunc], [parallel=true])`

## Arguments

**ds** is the list of data sources. This required parameter must be a tuple and
each element of the tuple is a data source object. Even if there is only one data
source, we still need a tuple to wrap the data source.

**mapFunc** is the map function. It accepts one and only one argument, which is
the materialized data entity from a data source. If we would like the map function
to accept more parameters in addition to the materialized data source, we can use a
PartialApplication to convert a multiple-parameter function to a unary
function. The number of map function calls is the same as the number of data
sources. The map function returns a regular object (scalar, pair, array, matrix,
table, set, or dictionary) or a tuple (containing multiple regular objects).

**reduceFunc** (optional) is the binary reduce function that combines two map
function call results. The reduce function in most cases is trivial. An example is
the addition function. The reduce function is optional. If the reduce function is
not specified, the system returns all individual map call results to the final
function.

**finalFunc** (optional) is the final function which accepts one and only one
parameter. The output of the last reduce function call is the input of the final
function. If it is not specified, the system returns the individual map function
call results.

**parallel** (optional) is a boolean flag indicating whether to execute the map
function in parallel locally. The default value is true, i.e., enabling parallel
computing. When there is very limited available memory and each map call needs a
large amount of memory, we can disable parallel computing to prevent the
out-of-memory problem. We may also want to disable the parallel option in other
scenarios. For example, we may need to disable the parallel option to prevent
multiple threads from writing to the same partition simultaneously.

## Details

The Map-Reduce function is the core function of DolphinDB's generic distributed computing framework.

## Examples

The following is an example of distributed linear regression. Suppose *X* is the
matrix of independent variables and *y* is the dependent variable. *X* and
*y* are stored in multiple data sources. To estimate the ordinary least
square parameters, we need to calculate X ^{T}X and X ^{T}y. We can
calculate the tuple of (X ^{T}X, X ^{T}y) from each data source,
then aggregate the results from all data sources to get X ^{T}X and
X^{T} y for the entire dataset.

```
def myOLSMap(table, yColName, xColNames, intercept){
if(intercept)
x = matrix(take(1.0, table.rows()), table[xColNames])
else
x = matrix(table[xColNames])
xt = x.transpose();
return xt.dot(x), xt.dot(table[yColName])
}
def myOLSFinal(result){
xtx = result[0]
xty = result[1]
return xtx.inv().dot(xty)[0]
}
def myOLSEx(ds, yColName, xColNames, intercept){
return mr(ds, myOLSMap{, yColName, xColNames, intercept}, +, myOLSFinal)
}
```

In the example above, we define the map function and final function. In practice, we may define transformation functions for data sources as well. These functions only need to be defined in the local instance. Users don't need to compile them or deploy them to the remote instances. The distributed computing framework in DolphinDB handles these complicated issues for end users on the fly.

As a frequently used analytics tool, the distributed ordinary least square linear regression is implemented in the core library of DolphinDB already. The built-in version (olsEx) provides more features.