Elasticsearch aggs returns wrong count numbers

aggregationelasticsearch

i'm trying to do some aggregations queries and encouter some issues.

GET /my_index/_search
{
"size" : 0,
"aggs":{
   "group_by":{
       "terms": {
            "field" : "category"
       }
   }
  }
  }

this is returning me :

"hits": {
  "total": 180,
  "max_score": 0,
  "hits": []
 },
"aggregations": {
  "group_by": {
     "doc_count_error_upper_bound": 0,
     "sum_other_doc_count": 1,
     "buckets": [
        {
           "key": "pf_rd_m",
           "doc_count": 139
        },
        {
           "key": "other",
           "doc_count": 13
        },
        {
           "key": "_encoding",
           "doc_count": 12
        },
        {
           "key": "ie",
           "doc_count": 10
        },
        {
           "key": "cadeaux",
           "doc_count": 2
        },
        {
           "key": "cartes",
           "doc_count": 2
        },
        {
           "key": "cheques",
           "doc_count": 2
        },
        {
           "key": "home",
           "doc_count": 2
        },
        {
           "key": "nav_logo",
           "doc_count": 1
        },
        {
           "key": "ref",
           "doc_count": 1
        }
     ]
  }

}

as you can see, this tells me that there is 180 documents, but if i do the sum of doc_count of every single key in my buckets, i find more elements…

this is certainly do to elasticsearch tokenization mecanism (https://www.elastic.co/guide/en/elasticsearch/guide/current/aggregations-and-analysis.html)

so i tryed the solution in this es post, but still not working. here is my mapping

"properties":{
                            "status":{
                              "type":"integer",
                              "index":"analyzed"
                            },
                            "category":{
                              "type":"string",
                              "fields": {
                                "raw" : {
                                  "type": "string",
                                  "index": "not_analyzed"
                                }
                              }
                            },
                            "dynamic_templates": [
                                { "notanalyzed": {
                                      "match":              "*",
                                      "match_mapping_type": "string",
                                      "mapping": {
                                          "type":        "string",
                                          "index":       "not_analyzed"
                                      }
                                   }
                                }
                              ]
                          }

as you can see, i have a field named "category". and added "raw" as an not_analyzed string, but still returns me wrong numbers.

when i try this :

GET /my_index/_search
{
"size" : 0,
"aggs":{
   "group_by":{
       "terms": {
            "field" : "category.raw"
         }
      }
    }
  }

this returns :

"hits": {
  "total": 180,
  "max_score": 0,
  "hits": []
},
"aggregations": {
  "group_by": {
     "doc_count_error_upper_bound": 0,
     "sum_other_doc_count": 0,
     "buckets": []
  }
}

it's very strange. any help ?

Best Answer

As described in the documentation,

the document counts (and the results of any sub aggregations) in the terms aggregation are not always accurate. This is because each shard provides its own view of what the ordered list of terms should be and these are combined to give a final view

To overcome this issue at the expense of resources, Shard size parameter can be used.
Again, from the documentation:
Shard Size

The higher the requested size is, the more accurate the results will be, but also, the more expensive it will be to compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data transfers between the nodes and the client). The shard_size parameter can be used to minimize the extra work that comes with bigger requested size. When defined, it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the coordinating node will then reduce them to a final result which will be based on the size parameter - this way, one can increase the accuracy of the returned terms and avoid the overhead of streaming a big list of buckets back to the client. If set to 0, the shard_size will be set to Integer.MAX_VALUE.

If you add the shard size parameter to the query:

GET /my_index/_search
{
"size" : 0,
"aggs":{
   "group_by":{
       "terms": {
            "field" : "category.raw",
            "shard_size" : 0
         }
      }
    }
  }

Related Solutions

Elasticsearch Terms or Cardinality Aggregation – Order by number of distinct values

I suggest to use filter query along with aggregations, since you are only interested in field1=AAA.

I have a similar example here.

For example, I have an index of all patients in my hospital. I store their drug use in a nested object DRUG. Each patient could take different drugs, and each could take a single drug for multiple times.

Now if I wanted to find the number of patients who took aspirin at least once, the query could be:

{
  "size": 0,
  "_source": false,
  "query": {
    "filtered": {
      "query": {
        "match_all": {}
      },
      "filter": {
        "nested": {
          "path": "DRUG",
          "filter": {
            "bool": {
              "must": [{ "term": { "DRUG.NAME": "aspirin" } }]
  }}}}}},
  "aggs": {
    "DRUG_FACETS": {
      "nested": {
        "path": "DRUG"
      },
      "aggs": {
        "DRUG_NAME_FACETS": {
          "terms": { "field": "DRUG.NAME", "size": 0 },
          "aggs": {
            "DISTINCT": { "cardinality": { "field": "DRUG.PATIENT" } }
          }
  }}}}
}

Sample result:

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 6,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "DRUG_FACETS": {
      "doc_count": 11,
      "DRUG_NAME_FACETS": {
        "buckets": [
          {
            "key": "aspirin",
            "doc_count": 6,
            "DISTINCT": {
              "value": 6
            }
          },
          {
            "key": "vitamin-b",
            "doc_count": 3,
            "DISTINCT": {
              "value": 2
            }
          },
          {
            "key": "vitamin-c",
            "doc_count": 2,
            "DISTINCT": {
              "value": 2
            }
          }
        ]
      }
    }
  }
}

The first one in the buckets would be aspirin. But you can see other 2 patients had also taken vitamin-b when they took aspirin.

If you change the field value of DRUG.NAME to another drug name for example "vitamin-b", I suppose you would get vitamin-b in the first position of the buckets.

Hopefully this is helpful to your question.

Elasticsearch terms aggregation by strings in an array

I think all you're missing is "states.raw" in your aggregation (note that, since no analyzer is specified, the "states" field is analyzed with the standard analyzer; the sub-field "raw" is "not_analyzed"). Though your mapping might bear looking at as well. When I tried your mapping against ES 2.0 I got some errors, but this worked:

PUT /test_index
{
   "mappings": {
      "doc": {
         "properties": {
            "states": {
               "type": "string",
               "fields": {
                  "raw": {
                     "type": "string",
                     "index": "not_analyzed"
                  }
               }
            }
         }
      }
   }
}

Then I added a couple of docs:

POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"states":["New York","New Jersey","California"]}
{"index":{"_id":2}}
{"states":["New York","North Carolina","North Dakota"]}

And this query seems to do what you want:

POST /test_index/_search
{
    "size": 0, 
    "aggs" : {
        "states" : {
            "terms" : { 
                "field" : "states.raw",
                "size": 10
            }
        }
    }
}

returning:

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "states": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "New York",
               "doc_count": 2
            },
            {
               "key": "California",
               "doc_count": 1
            },
            {
               "key": "New Jersey",
               "doc_count": 1
            },
            {
               "key": "North Carolina",
               "doc_count": 1
            },
            {
               "key": "North Dakota",
               "doc_count": 1
            }
         ]
      }
   }
}

Here's the code I used to test it:

http://sense.qbox.io/gist/31851c3cfee8c1896eb4b53bc1ddd39ae87b173e

Best Answer

Related Solutions

Elasticsearch Terms or Cardinality Aggregation – Order by number of distinct values

Elasticsearch terms aggregation by strings in an array

Related Topic