Thursday 4 September 2014

Deduping entries with logstash and elasticsearch

There's one annoying little problem with logstash and file / block based injestion. The whole file gets reprocessed in certain scenarios, causing lines that's already been processed before to be reprocessed. These scenarios include:
  1. S3 files been reuploaded with additional data appended.
    Logstash process all files after the datetime marked by the last run of processing.
  2. Local file system files being copy pasted over with updates from other locations
    Files are handled with descriptors, so they can be renamed or moved without affecting logstash tailing the file. However, when they are overwritten, they are considered to be a completely different file and all the data in the file will be reprocessed.
  3. Losing / deleting the since db file used to track progress.
  4. And I'm sure there can be more.
The solution is actually surprisingly simple. Calculate a hash of the event message and use that as the document id for elasticsearch. Here's a sample config:

input {
#something. anything.
}

filter {
mutate {
add_field => ["logstash_checksum", "%{message}"]
}
anonymize {
fields => ["logstash_checksum"]
algorithm => "MD5"
key => "a"
}
}

output {
elasticsearch {
host => "127.0.0.1"
document_id => '%{logstash_checksum}'
}
}

Note that this works best with events that already contains the timestamp such as web server access log from IIS, apache, etc, load balancer logs, etc. It would be a bad idea to apply this technique to stream based log entries that rely on timestamp at the time of injestion by logstash.

4 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. This looks perfect, it'll write the message to a document_id that already exists, so you don't get a duplicate entry. One thing, is there a way to check if the document_id exists in the index before writing to the output?

    ReplyDelete
    Replies
    1. Extremely late reply, but not that I'm aware of.

      Delete
  3. According to Stanford Medical, It is in fact the one and ONLY reason women in this country live 10 years more and weigh an average of 42 pounds lighter than we do.

    (And really, it is not about genetics or some hard exercise and really, EVERYTHING related to "how" they are eating.)

    P.S, What I said is "HOW", not "WHAT"...

    CLICK on this link to find out if this brief quiz can help you find out your true weight loss potential

    ReplyDelete