This section of the user guide explores techniques for retrieving streams of data from Solr and vectorizing the numeric fields.
See the section Text Analysis and Term Vectors which describes how to vectorize text fields.
Streams
Streaming Expressions has a wide range of stream sources that can be used to retrieve data from Solr Cloud collections. Math expressions can be used to vectorize and analyze the results sets.
Below are some of the key stream sources:

facet
: Multidimensional aggregations are a powerful tool for generating cooccurrence counts for categorical data. Thefacet
function uses the JSON facet API under the covers to provide fast, distributed, multidimension aggregations. With math expressions the aggregated results can be pivoted into a cooccurance matrix which can be mined for correlations and hidden similarities within the data. 
random
: Random sampling is widely used in statistics, probability and machine learning. Therandom
function returns a random sample of search results that match a query. The random samples can be vectorized and operated on by math expressions and the results can be used to describe and make inferences about the entire population. 
timeseries
: Thetimeseries
expression provides fast distributed time series aggregations, which can be vectorized and analyzed with math expressions. 
knnSearch
: Knearest neighbor is a core machine learning algorithm. TheknnSearch
function is a specialized knn algorithm optimized to find the knearest neighbors of a document in a distributed index. Once the nearest neighbors are retrieved they can be vectorized and operated on by machine learning and text mining algorithms. 
sql
: SQL is the primary query language used by data scientists. Thesql
function supports data retrieval using a subset of SQL which includes both full text search and fast distributed aggregations. The result sets can then be vectorized and operated on by math expressions. 
jdbc
: Thejdbc
function allows data from any JDBC compliant data source to be combined with streams originating from Solr. Result sets from outside data sources can be vectorized and operated on by math expressions in the same manner as result sets originating from Solr. 
topic
: Messaging is an important foundational technology for large scale computing. Thetopic
function provides publish/subscribe messaging capabilities by treating Solr Cloud as a distributed message queue. Topics are extremely powerful because they allow subscription by query. Topics can be use to support a broad set of use cases including bulk text mining operations and AI alerting. 
nodes
: Graph queries are frequently used by recommendation engines and are an important machine learning tool. Thenodes
function provides fast, distributed, breadth first graph traversal over documents in a Solr Cloud collection. The node sets collected by thenodes
function can be operated on by statistical and machine learning expressions to gain more insight into the graph. 
search
: Ranked search results are a powerful tool for finding the most relevant documents from a large document corpus. Thesearch
expression returns the top N ranked search results that match any Solr query, including geospatial queries. The smaller set of relevant documents can then be explored with statistical, machine learning and text mining expressions to gather insights about the data set.
Assigning Streams to Variables
The output of any streaming expression can be set to a variable. Below is a very simple example using the random
function to fetch three random samples from collection1. The random samples are returned as tuples which contain name/value pairs.
let(a=random(collection1, q="*:*", rows="3", fl="price_f"))
When this expression is sent to the /stream
handler it responds with:
{
"resultset": {
"docs": [
{
"a": [
{
"price_f": 0.7927976
},
{
"price_f": 0.060795486
},
{
"price_f": 0.55128294
}
]
},
{
"EOF": true,
"RESPONSE_TIME": 11
}
]
}
}
Creating a Vector with the col Function
The col
function iterates over a list of tuples and copies the values from a specific column into an array.
The output of the col
function is an numeric array that can be set to a variable and operated on by math expressions.
Below is an example of the col
function:
let(a=random(collection1, q="*:*", rows="3", fl="price_f"),
b=col(a, price_f))
{
"resultset": {
"docs": [
{
"b": [
0.42105234,
0.85237443,
0.7566981
]
},
{
"EOF": true,
"RESPONSE_TIME": 9
}
]
}
}
Applying Math Expressions to the Vector
Once a vector has been created any math expression that operates on vectors can be applied. In the example below the mean
function is applied to the vector assigned to variable b
.
let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
b=col(a, price_f),
c=mean(b))
When this expression is sent to the /stream
handler it responds with:
{
"resultset": {
"docs": [
{
"c": 0.5016035594638814
},
{
"EOF": true,
"RESPONSE_TIME": 306
}
]
}
}
Creating Matrices
Matrices can be created by vectorizing multiple numeric fields and adding them to a matrix. The matrices can then be operated on by any math expression that operates on matrices.
Tip

Note that this section deals with the creation of matrices from numeric data. The section Text Analysis and Term Vectors describes how to build TFIDF term vector matrices from text fields. 
Below is a simple example where four random samples are taken from different subpopulations in the data. The price_f
field of each random sample is vectorized and the vectors are added as rows to a matrix. Then the sumRows
function is applied to the matrix to return a vector containing the sum of each row.
let(a=random(collection1, q="market:A", rows="5000", fl="price_f"),
b=random(collection1, q="market:B", rows="5000", fl="price_f"),
c=random(collection1, q="market:C", rows="5000", fl="price_f"),
d=random(collection1, q="market:D", rows="5000", fl="price_f"),
e=col(a, price_f),
f=col(b, price_f),
g=col(c, price_f),
h=col(d, price_f),
i=matrix(e, f, g, h),
j=sumRows(i))
When this expression is sent to the /stream
handler it responds with:
{
"resultset": {
"docs": [
{
"j": [
154390.1293375,
167434.89453,
159293.258493,
149773.42769,
]
},
{
"EOF": true,
"RESPONSE_TIME": 9
}
]
}
}
Facet Cooccurrence Matrices
The facet
function can be used to quickly perform multidimension aggregations of categorical data from records stored in a Solr Cloud collection. These multidimension aggregations can represent cooccurrence counts for the values in the dimensions. The pivot
function can be used to move two dimensional aggregations into a cooccurrence matrix. The cooccurrence matrix can then be clustered or analyzed for correlations to learn about the hidden connections within the data.
In the example below the facet
expression is used to generate a two dimensional faceted aggregation. The first dimension is the US State that a car was purchased in and the second dimension is the car model. This two dimensional facet generates the cooccurrence counts for the number of times a particular car model was purchased in a particular state.
facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows=5, count(*))
When this expression is sent to the /stream
handler it responds with:
{
"resultset": {
"docs": [
{
"state": "NY",
"model": "camry",
"count(*)": 13342
},
{
"state": "NJ",
"model": "accord",
"count(*)": 13002
},
{
"state": "NY",
"model": "civic",
"count(*)": 12901
},
{
"state": "CA",
"model": "focus",
"count(*)": 12892
},
{
"state": "TX",
"model": "f150",
"count(*)": 12871
},
{
"EOF": true,
"RESPONSE_TIME": 171
}
]
}
}
The pivot
function can be used to move the facet results into a cooccurrence matrix. In the example below The pivot
function is used to create a matrix where the rows of the matrix are the US States (state) and the columns of the matrix are the car models (model). The values in the matrix are the cooccurrence counts (count(*)) from the facet results. Once the cooccurrence matrix has been created the US States can be clustered by car model, or the matrix can be transposed and car models can be clustered by the US States where they were bought.
let(a=facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows="1", count(*)),
b=pivot(a, state, model, count(*)),
c=kmeans(b, 7))
Latitude / Longitude Vectors
The latlonVectors
function wraps a list of tuples and parses a lat/lon location field into a matrix of lat/long vectors. Each row in the matrix is a vector that contains the lat/long pair for the corresponding tuple in the list. The row labels for the matrix are automatically set to the id
field in the tuples. The lat/lon matrix can then be operated on by distancebased machine learning functions using the haversineMeters
distance measure.
The latlonVectors
function takes two parameters: a list of tuples and a named parameter called field
, which tells the latlonVectors
function which field to parse the lat/lon vectors from.
Below is an example of the latlonVectors
.
let(a=random(collection1, q="*:*", fl="id, loc_p", rows="5"),
b=latlonVectors(a, field="loc_p"))
When this expression is sent to the /stream
handler it responds with:
{
"resultset": {
"docs": [
{
"b": [
[
42.87183530723629,
76.74102353397778
],
[
42.91372904094898,
76.72874889228416
],
[
42.911528804897564,
76.70537292977619
],
[
42.91143870500213,
76.74749913047408
],
[
42.904666267479705,
76.73933236046092
]
]
},
{
"EOF": true,
"RESPONSE_TIME": 21
}
]
}
}