Elasticsearch aggs returns wrong count numbers

aggregationelasticsearch

i'm trying to do some aggregations queries and encouter some issues.

GET /my_index/_search
{
"size" : 0,
"aggs":{
   "group_by":{
       "terms": {
            "field" : "category"
       }
   }
  }
  }

this is returning me :

"hits": {
  "total": 180,
  "max_score": 0,
  "hits": []
 },
"aggregations": {
  "group_by": {
     "doc_count_error_upper_bound": 0,
     "sum_other_doc_count": 1,
     "buckets": [
        {
           "key": "pf_rd_m",
           "doc_count": 139
        },
        {
           "key": "other",
           "doc_count": 13
        },
        {
           "key": "_encoding",
           "doc_count": 12
        },
        {
           "key": "ie",
           "doc_count": 10
        },
        {
           "key": "cadeaux",
           "doc_count": 2
        },
        {
           "key": "cartes",
           "doc_count": 2
        },
        {
           "key": "cheques",
           "doc_count": 2
        },
        {
           "key": "home",
           "doc_count": 2
        },
        {
           "key": "nav_logo",
           "doc_count": 1
        },
        {
           "key": "ref",
           "doc_count": 1
        }
     ]
  }

}

as you can see, this tells me that there is 180 documents, but if i do the sum of doc_count of every single key in my buckets, i find more elements…

this is certainly do to elasticsearch tokenization mecanism (https://www.elastic.co/guide/en/elasticsearch/guide/current/aggregations-and-analysis.html)

so i tryed the solution in this es post, but still not working. here is my mapping

"properties":{
                            "status":{
                              "type":"integer",
                              "index":"analyzed"
                            },
                            "category":{
                              "type":"string",
                              "fields": {
                                "raw" : {
                                  "type": "string",
                                  "index": "not_analyzed"
                                }
                              }
                            },
                            "dynamic_templates": [
                                { "notanalyzed": {
                                      "match":              "*",
                                      "match_mapping_type": "string",
                                      "mapping": {
                                          "type":        "string",
                                          "index":       "not_analyzed"
                                      }
                                   }
                                }
                              ]
                          }

as you can see, i have a field named "category". and added "raw" as an not_analyzed string, but still returns me wrong numbers.

when i try this :

GET /my_index/_search
{
"size" : 0,
"aggs":{
   "group_by":{
       "terms": {
            "field" : "category.raw"
         }
      }
    }
  }

this returns :

"hits": {
  "total": 180,
  "max_score": 0,
  "hits": []
},
"aggregations": {
  "group_by": {
     "doc_count_error_upper_bound": 0,
     "sum_other_doc_count": 0,
     "buckets": []
  }
}

it's very strange. any help ?

Best Answer

As described in the documentation,

the document counts (and the results of any sub aggregations) in the terms aggregation are not always accurate. This is because each shard provides its own view of what the ordered list of terms should be and these are combined to give a final view

To overcome this issue at the expense of resources, Shard size parameter can be used.
Again, from the documentation:
Shard Size

The higher the requested size is, the more accurate the results will be, but also, the more expensive it will be to compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data transfers between the nodes and the client). The shard_size parameter can be used to minimize the extra work that comes with bigger requested size. When defined, it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the coordinating node will then reduce them to a final result which will be based on the size parameter - this way, one can increase the accuracy of the returned terms and avoid the overhead of streaming a big list of buckets back to the client. If set to 0, the shard_size will be set to Integer.MAX_VALUE.

If you add the shard size parameter to the query:

GET /my_index/_search
{
"size" : 0,
"aggs":{
   "group_by":{
       "terms": {
            "field" : "category.raw",
            "shard_size" : 0
         }
      }
    }
  }
Related Topic