This section of the math expressions user guide covers machine learning functions.
Feature Scaling
Before performing machine learning operations its often necessary to scale the feature vectors so they can be compared at the same scale.
All the scaling function operate on vectors and matrices. When operating on a matrix the rows of the matrix are scaled.
Min/Max Scaling
The minMaxScale
function scales a vector or matrix between a minimum and maximum value. By default it will scale between 0 and 1 if min/max values are not provided.
Below is a simple example of min/max scaling between 0 and 1. Notice that once brought into the same scale the vectors are the same.
let(a=array(20, 30, 40, 50),
b=array(200, 300, 400, 500),
c=matrix(a, b),
d=minMaxScale(c))
This expression returns the following response:
{
"resultset": {
"docs": [
{
"d": [
[
0,
0.3333333333333333,
0.6666666666666666,
1
],
[
0,
0.3333333333333333,
0.6666666666666666,
1
]
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
Standardization
The standardize
function scales a vector so that it has a mean of 0 and a standard deviation of 1. Standardization can be used with machine learning algorithms, such as Support Vector Machine (SVM), that perform better when the data has a normal distribution.
let(a=array(20, 30, 40, 50),
b=array(200, 300, 400, 500),
c=matrix(a, b),
d=standardize(c))
This expression returns the following response:
{
"resultset": {
"docs": [
{
"d": [
[
1.161895003862225,
0.3872983346207417,
0.3872983346207417,
1.161895003862225
],
[
1.1618950038622249,
0.38729833462074165,
0.38729833462074165,
1.1618950038622249
]
]
},
{
"EOF": true,
"RESPONSE_TIME": 17
}
]
}
}
Unit Vectors
The unitize
function scales vectors to a magnitude of 1. A vector with a magnitude of 1 is known as a unit vector. Unit vectors are preferred when the vector math deals with vector direction rather than magnitude.
let(a=array(20, 30, 40, 50),
b=array(200, 300, 400, 500),
c=matrix(a, b),
d=unitize(c))
This expression returns the following response:
{
"resultset": {
"docs": [
{
"d": [
[
0.2721655269759087,
0.40824829046386296,
0.5443310539518174,
0.6804138174397716
],
[
0.2721655269759087,
0.4082482904638631,
0.5443310539518174,
0.6804138174397717
]
]
},
{
"EOF": true,
"RESPONSE_TIME": 6
}
]
}
}
Distance and Distance Measures
The distance
function computes the distance for two numeric arrays or a distance matrix for the columns of a matrix.
There are five distance measure functions that return a function that performs the actual distance calculation:

euclidean
(default) 
manhattan

canberra

earthMovers

haversineMeters
(Geospatial distance measure)
The distance measure functions can be used with all machine learning functions that support distance measures.
Below is an example for computing Euclidean distance for two numeric arrays:
let(a=array(20, 30, 40, 50),
b=array(21, 29, 41, 49),
c=distance(a, b))
This expression returns the following response:
{
"resultset": {
"docs": [
{
"c": 2
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
Below the distance is calculated using Manahattan distance.
let(a=array(20, 30, 40, 50),
b=array(21, 29, 41, 49),
c=distance(a, b, manhattan()))
This expression returns the following response:
{
"resultset": {
"docs": [
{
"c": 4
},
{
"EOF": true,
"RESPONSE_TIME": 1
}
]
}
}
Below is an example for computing a distance matrix for columns of a matrix:
let(a=array(20, 30, 40),
b=array(21, 29, 41),
c=array(31, 40, 50),
d=matrix(a, b, c),
c=distance(d))
This expression returns the following response:
{
"resultset": {
"docs": [
{
"e": [
[
0,
15.652475842498529,
34.07345007480164
],
[
15.652475842498529,
0,
18.547236990991408
],
[
34.07345007480164,
18.547236990991408,
0
]
]
},
{
"EOF": true,
"RESPONSE_TIME": 24
}
]
}
}
KMeans Clustering
The kmeans
functions performs kmeans clustering of the rows of a matrix. Once the clustering has been completed there are a number of useful functions available for examining the clusters and centroids.
The examples below cluster term vectors. The section Text Analysis and Term Vectors offers a full explanation of these features.
Centroid Features
In the example below the kmeans
function is used to cluster a result set from the Enron email dataset and then the top features are extracted from the cluster centroids.
let(a=select(random(enron, q="body:oil", rows="500", fl="id, body"), (1)
id,
analyze(body, body_bigram) as terms),
b=termVectors(a, maxDocFreq=.10, minDocFreq=.05, minTermLength=14, exclude="_,copyright"),(2)
c=kmeans(b, 5), (3)
d=getCentroids(c), (4)
e=topFeatures(d, 5)) (5)
Letâ€™s look at what data is assigned to each variable:

a
: Therandom
function returns a sample of 500 documents from the "enron" collection that match the query "body:oil". Theselect
function selects theid
and and annotates each tuple with the analyzed bigram terms from thebody
field. 
b
: ThetermVectors
function creates a TFIDF term vector matrix from the tuples stored in variablea
. Each row in the matrix represents a document. The columns of the matrix are the bigram terms that were attached to each tuple. 
c
: Thekmeans
function clusters the rows of the matrix into 5 clusters. The kmeans clustering is performed using the Euclidean distance measure. 
d
: ThegetCentroids
function returns a matrix of cluster centroids. Each row in the matrix is a centroid from one of the 5 clusters. The columns of the matrix are the same bigrams terms of the term vector matrix. 
e
: ThetopFeatures
function returns the column labels for the top 5 features of each centroid in the matrix. This returns the top 5 bigram terms for each centroid.
This expression returns the following response:
{
"resultset": {
"docs": [
{
"e": [
[
"enron enronxgate",
"north american",
"energy services",
"conference call",
"power generation"
],
[
"financial times",
"chief financial",
"financial officer",
"exchange commission",
"houston chronicle"
],
[
"southern california",
"california edison",
"public utilities",
"utilities commission",
"rate increases"
],
[
"rolling blackouts",
"public utilities",
"electricity prices",
"federal energy",
"price controls"
],
[
"california edison",
"regulatory commission",
"southern california",
"federal energy",
"power generators"
]
]
},
{
"EOF": true,
"RESPONSE_TIME": 982
}
]
}
}
Cluster Features
The example below examines the top features of a specific cluster. This example uses the same techniques as the centroids example but the top features are extracted from a cluster rather than the centroids.
let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"),
id,
analyze(body, body_bigram) as terms),
b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
c=kmeans(b, 25),
d=getCluster(c, 0), (1)
e=topFeatures(d, 4)) (2)

The
getCluster
function returns a cluster by its index. Each cluster is a matrix containing term vectors that have been clustered together based on their features. 
The
topFeatures
function is used to extract the top 4 features from each term vector in the cluster.
This expression returns the following response:
{
"resultset": {
"docs": [
{
"e": [
[
"electricity board",
"maharashtra state",
"power purchase",
"state electricity",
"reserved enron"
],
[
"electricity board",
"maharashtra state",
"state electricity",
"purchase agreement",
"independent power"
],
[
"maharashtra state",
"reserved enron",
"federal government",
"state government",
"dabhol project"
],
[
"purchase agreement",
"power purchase",
"electricity board",
"maharashtra state",
"state government"
],
[
"investment grade",
"portland general",
"general electric",
"holding company",
"transmission lines"
],
[
"state government",
"state electricity",
"purchase agreement",
"electricity board",
"maharashtra state"
],
[
"electricity board",
"state electricity",
"energy management",
"maharashtra state",
"energy markets"
],
[
"electricity board",
"maharashtra state",
"state electricity",
"state government",
"second quarter"
]
]
},
{
"EOF": true,
"RESPONSE_TIME": 978
}
]
}
}
Multi KMeans Clustering
Kmeans clustering will produce different results depending on the initial placement of the centroids. Kmeans is fast enough that multiple trials can be performed and the best outcome selected.
The multiKmeans
function runs the kmeans clustering algorithm for a given number of trials and selects the best result based on which trial produces the lowest intracluster variance.
The example below is identical to centroids example except that it uses multiKmeans
with 100 trials, rather than a single trial of the kmeans
function.
let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"),
id,
analyze(body, body_bigram) as terms),
b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
c=multiKmeans(b, 5, 100),
d=getCentroids(c),
e=topFeatures(d, 5))
This expression returns the following response:
{
"resultset": {
"docs": [
{
"e": [
[
"enron enronxgate",
"energy trading",
"energy markets",
"energy services",
"unleaded gasoline"
],
[
"maharashtra state",
"electricity board",
"state electricity",
"energy trading",
"chief financial"
],
[
"price controls",
"electricity prices",
"francisco chronicle",
"wholesale electricity",
"power generators"
],
[
"southern california",
"california edison",
"public utilities",
"francisco chronicle",
"utilities commission"
],
[
"california edison",
"power purchases",
"system operator",
"term contracts",
"independent system"
]
]
},
{
"EOF": true,
"RESPONSE_TIME": 1182
}
]
}
}
Fuzzy KMeans Clustering
The fuzzyKmeans
function is a soft clustering algorithm which allows vectors to be assigned to more then one cluster. The fuzziness
parameter is a value between 1 and 2 that determines how fuzzy to make the cluster assignment.
After the clustering has been performed the getMembershipMatrix
function can be called on the clustering result to return a matrix describing which clusters each vector belongs to. There is a row in the matrix for each vector that was clustered. There is a column in the matrix for each cluster. The values in the columns are the probability that the vector belonged to the specific cluster.
A simple example will make this more clear. In the example below 300 documents are analyzed and then turned into a term vector matrix. Then the fuzzyKmeans
function clusters the term vectors into 12 clusters with a fuzziness factor of 1.25.
let(a=select(random(collection3, q="body:oil", rows="300", fl="id, body"),
id,
analyze(body, body_bigram) as terms),
b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
c=fuzzyKmeans(b, 12, fuzziness=1.25),
d=getMembershipMatrix(c), (1)
e=rowAt(d, 0), (2)
f=precision(e, 5)) (3)

The
getMembershipMatrix
function is used to return the membership matrix; 
and the first row of membership matrix is retrieved with the
rowAt
function. 
The
precision
function is then applied to the first row of the matrix to make it easier to read.
This expression returns a single vector representing the cluster membership probabilities for the first term vector. Notice that the term vector has the highest association with the 12^{th} cluster, but also has significant associations with the 3^{rd}, 5^{th}, 6^{th} and 7^{th} clusters:
{
"resultset": {
"docs": [
{
"f": [
0,
0,
0.178,
0,
0.17707,
0.17775,
0.16214,
0,
0,
0,
0,
0.30504
]
},
{
"EOF": true,
"RESPONSE_TIME": 2157
}
]
}
}
KNearest Neighbor (KNN)
The knn
function searches the rows of a matrix for the knearest neighbors of a search vector. The knn
function returns a matrix of the knearest neighbors.
The knn
function supports changing of the distance measure by providing one of these distance measure functions as the fourth parameter:

euclidean
(Default) 
manhattan

canberra

earthMovers
The example below builds on the clustering examples to demonstrate the knn
function.
let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"),
id,
analyze(body, body_bigram) as terms),
b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
c=multiKmeans(b, 5, 100),
d=getCentroids(c), (1)
e=rowAt(d, 0), (2)
g=knn(b, e, 3), (3)
h=topFeatures(g, 4)) (4)

In the example, the centroids matrix is set to variable
d
. 
The first centroid vector is selected from the matrix with the
rowAt
function. 
Then the
knn
function is used to find the 3 nearest neighbors to the centroid vector in the term vector matrix (variableb
). 
The
topFeatures
function is used to request the top 4 featurs of the term vectors in the knn matrix.
The knn
function returns a matrix with the 3 nearest neighbors based on the default distance measure which is euclidean. Finally, the top 4 features of the term vectors in the nearest neighbor matrix are returned:
{
"resultset": {
"docs": [
{
"h": [
[
"california power",
"electricity supply",
"concerned about",
"companies like"
],
[
"maharashtra state",
"california power",
"electricity board",
"alternative energy"
],
[
"electricity board",
"maharashtra state",
"state electricity",
"houston chronicle"
]
]
},
{
"EOF": true,
"RESPONSE_TIME": 1243
}
]
}
}
KNearest Neighbor Regression
Knearest neighbor regression is a nonlinear, multivariate regression method. Knn regression is a lazy learning technique which means it does not fit a model to the training set in advance. Instead the entire training set of observations and outcomes are held in memory and predictions are made by averaging the outcomes of the knearest neighbors.
The knnRegress
function prepares the training set for use with the predict
function.
Below is an example of the knnRegress
function. In this example 10,000 random samples are taken, each containing the variables filesize_d
, service_d
and response_d
. The pairs of filesize_d
and service_d
will be used to predict the value of response_d
.
let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d, response_d"),
filesizes=col(samples, filesize_d),
serviceLevels=col(samples, service_d),
outcomes=col(samples, response_d),
observations=transpose(matrix(filesizes, serviceLevels)),
lazyModel=knnRegress(observations, outcomes , 5))
This expression returns the following response. Notice that knnRegress
returns a tuple describing the regression inputs:
{
"resultset": {
"docs": [
{
"lazyModel": {
"features": 2,
"robust": false,
"distance": "EuclideanDistance",
"observations": 10000,
"scale": false,
"k": 5
}
},
{
"EOF": true,
"RESPONSE_TIME": 170
}
]
}
}
Prediction and Residuals
The output of knnRegress
can be used with the predict
function like other regression models.
In the example below the predict
function is used to predict results for the original training data. The sumSq of the residuals is then calculated.
let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d, response_d"),
filesizes=col(samples, filesize_d),
serviceLevels=col(samples, service_d),
outcomes=col(samples, response_d),
observations=transpose(matrix(filesizes, serviceLevels)),
lazyModel=knnRegress(observations, outcomes , 5),
predictions=predict(lazyModel, observations),
residuals=ebeSubtract(outcomes, predictions),
sumSqErr=sumSq(residuals))
This expression returns the following response:
{
"resultset": {
"docs": [
{
"sumSqErr": 1920290.1204126712
},
{
"EOF": true,
"RESPONSE_TIME": 3796
}
]
}
}
Setting Feature Scaling
If the features in the observation matrix are not in the same scale then the larger features will carry more weight in the distance calculation then the smaller features. This can greatly impact the accuracy of the prediction. The knnRegress
function has a scale
parameter which can be set to true
to automatically scale the features in the same range.
The example below shows knnRegress
with feature scaling turned on.
Notice that when feature scaling is turned on the sumSqErr
in the output is much lower. This shows how much more accurate the predictions are when feature scaling is turned on in this particular example. This is because the filesize_d
feature is significantly larger then the service_d
feature.
let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d, response_d"),
filesizes=col(samples, filesize_d),
serviceLevels=col(samples, service_d),
outcomes=col(samples, response_d),
observations=transpose(matrix(filesizes, serviceLevels)),
lazyModel=knnRegress(observations, outcomes , 5, scale=true),
predictions=predict(lazyModel, observations),
residuals=ebeSubtract(outcomes, predictions),
sumSqErr=sumSq(residuals))
This expression returns the following response:
{
"resultset": {
"docs": [
{
"sumSqErr": 4076.794951120683
},
{
"EOF": true,
"RESPONSE_TIME": 3790
}
]
}
}
Setting Robust Regression
The default prediction approach is to take the mean of the outcomes of the knearest neighbors. If the outcomes contain outliers the mean value can be skewed. Setting the robust
parameter to true
will take the median outcome of the knearest neighbors. This provides a regression prediction that is robust to outliers.
Setting the Distance Measure
The distance measure can be changed for the knearest neighbor search by adding a distance measure function to the knnRegress
parameters. Below is an example using manhattan
distance.
let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d, response_d"),
filesizes=col(samples, filesize_d),
serviceLevels=col(samples, service_d),
outcomes=col(samples, response_d),
observations=transpose(matrix(filesizes, serviceLevels)),
lazyModel=knnRegress(observations, outcomes, 5, manhattan(), scale=true),
predictions=predict(lazyModel, observations),
residuals=ebeSubtract(outcomes, predictions),
sumSqErr=sumSq(residuals))
This expression returns the following response:
{
"resultset": {
"docs": [
{
"sumSqErr": 4761.221942288098
},
{
"EOF": true,
"RESPONSE_TIME": 3571
}
]
}
}