Accurate Distinct Count and Values from Elasticsearch.

Pratik Patil
5 min readJan 20, 2021

Need Accurate Distinct count of fields from Elasticsearch documents ? Cardinality won’t always work.

Elasticsearch is a popular choice for many Analytical Products as it supports a lot of aggregations and provides option to inject a script in your query which will process documents and return response as per your use case. But when it comes to providing distinct count of a field, Elasticsearch does not provide accuracy which is much needed for Analytics Product.

In this article, we will be seeing an approach to fetch Distinct Count as well as fetch those Distinct values from a field in Elasticsearch.

The Accuracy Problem

When finding Distinct count of values for a field, Cardinality is the direct aggregation which elasticsearch offers. But, Cardinality does not ensure accurate count. Cardinality in ES is equivalent to SQL Statement:

select count(distinct column) from table;

In this article, we will be running our queries through a dataset that I have prepared. You can download it from here. This dataset is a json file which has 100K records for a Product Index. Here are the details of each field in Product Index:

name : Name of the Product : String

productId : Unique UUID : String

brand : Brand of the Product : String

seqId : Unique sequence Id : long

sellerId : Id of the seller of Product : long

price : Price of the Product : float

You can load this file in your local Elasticsearch with below cURL command:

curl -H "Content-Type: application/json" -XPOST "localhost:9200/products/_bulk?pretty&refresh" --data-binary "@products.json"

Note: The queries in this article are run on ES version 7.9.2 on a 3 node cluster. We will be using the default shard and other settings and also let ES dynamically map the fields for our article to see how the solution works without any setting changes.

Now, to test accuracy of Cardinality, let’s run cardinality in comparison to value_count aggregation on “seqId”. Since seqId is a unique field, the total count should be 100K. Cardinality also has a “precision_threshold” option which improves accuracy and has a highest value of 40K. Let’s test it too.

GET products/_search?size=0
{
"aggs": {
"Value_Count": {
"value_count": {
"field": "seqId"
}
},
"Cardinality_Default": {
"cardinality": {
"field": "seqId"
}
},
"Cardinality_40K": {
"cardinality": {
"field": "seqId",
"precision_threshold": 40000
}
}
}
}

Output:

"aggregations" : {
"Value_Count" : {
"value" : 100000
},
"Cardinality_Default" : {
"value" : 99763
},
"Cardinality_40K" : {
"value" : 99995
}
}

As can be seen, cardinality even with highest precision threshold does not return accurate count.

Scripted Metric: The Savior — Distinct Count

Scripted Metric runs scripts in 4 stages which we will be using for our solution. For more details on stages of scripted_metric, please read through the official document. Now, let’s write our query to find distinct count of seqId.

Query to Fetch Distinct Count of a field:

GET products/_search?size=0
{
"aggs": {
"Distinct_Count": {
"scripted_metric": {
"params": {
"fieldName": "seqId"
},
"init_script": "state.list = []",
"map_script": """
if(doc[params.fieldName] != null)
state.list.add(doc[params.fieldName].value);
""",
"combine_script": "return state.list;",
"reduce_script": """
Map uniqueValueMap = new HashMap();
int count = 0;
for(shardList in states) {
if(shardList != null) {
for(key in shardList) {
if(!uniqueValueMap.containsKey(key)) {
count +=1;
uniqueValueMap.put(key, key);
}
}
}
}
return count;
"""
}
}
}
}

Output:

"aggregations" : {
"Distinct_Count" : {
"value" : 100000
}
}

The count here is accurate. We initialized our state list in “init_script”. In “map_script”, we collected the field value from each document. “combine_script” returned the list of field values from each shard. And finally, in “reduce_script” we fetched the unique count of field values.

Scripted Metric: The Savior — Distinct Values

How about fetching the distinct values of the field. Now, let’s modify the above query to find count of unique brands in the products index and also return the names of those unique brands:

Query to Fetch Distinct Count and Distinct Values of a field:

GET products/_search?size=0
{
"aggs": {
"Distinct_Count": {
"scripted_metric": {
"params": {
"fieldName": "brand"
},
"init_script": "state.list = []",
"map_script": """
if(params['_source'][params.fieldName] != null)
state.list.add(params['_source'][params.fieldName]);
""",
"combine_script": "return state.list;",
"reduce_script": """
Map uniqueValueMap = new HashMap();
int count = 0;
for(shardList in states) {
if(shardList != null) {
for(key in shardList) {
if(!uniqueValueMap.containsKey(key)) {
count +=1;
uniqueValueMap.put(key, key);
}
}
}
}
return count;
"""
}
},
"Distinct_Values": {
"scripted_metric": {
"params": {
"fieldName": "brand"
},
"init_script": "state.list = []",
"map_script": """
if(params['_source'][params.fieldName] != null)
state.list.add(params['_source'][params.fieldName]);
""",
"combine_script": "return state.list;",
"reduce_script": """
Map uniqueValueMap = new HashMap();
List uniqueValueList = new ArrayList();
for(shardList in states) {
if(shardList != null) {
for(key in shardList) {
if(!uniqueValueMap.containsKey(key)) {
uniqueValueList.add(key);
uniqueValueMap.put(key, key);
}
}
}
}
return uniqueValueList;
"""
}
}
}
}

Output:

"aggregations" : {
"Distinct_Count" : {
"value" : 10
},
"Distinct_Values" : {
"value" : [
"WROGN",
"Zivame",
"Go Colors",
"Rodamo",
"GAS",
"Adidas",
"H&M",
"Puma",
"Peter England",
"Roadster"
]
}
}

These scripts can be converted to Stored Scripts as we have made them generic enough to work on any field that you want to apply for. Using Stored Scripts will be a cleaner approach.

Accuracy Comparison

Cardinality does provide accurate count up-to a certain limit of documents. I tried a comparison of Cardinality, Cardinality with precision_threshold of 40K and our Scripted Metric solution. Here is an accuracy comparison:

Accuracy Comparison: Distinct Count

As can be observed, for small amount of distinct values, Cardinality returns accurate results. As the number of distinct values increase, Cardinality with a default precision_threshold loses its accuracy. With even more distinct values, even Cardinality with 40K precision_threshold loses its accuracy. Whereas our implementation of Distinct Count using scripted_metric always returns accurate count irrespective of the number of unique values.

Conclusion

Our Scripted Metric based solution always returns accurate results. But note that just like any other script aggregation, scripted_metric can result in slower search speeds. Also, it is memory intensive. Use it in cases where Accuracy is of utmost importance and the total Distinct values of a field are many or are expected to grow.

On the other hand, observe that Cardinality works well for smaller count of Unique documents. Note that Cardinality’s accuracy is affected by Total Distinct values of the field and not by Total Documents. So, Cardinality with its precision_threshold can work well in cases where the Distinct values of a field are limited.

--

--

Pratik Patil

Backend Tech Lead @ Tekion. #TechnologyEnthusiast #Traveller