Tuesday 2 September 2014

Something subtle about logstash s3 input

There are two settings for the s3 input plugin for logstash that went under my radar in terms of their criticalilty; delete and backup_to_bucket. In some situations, it actually make or break the feasibility of this plugin entirely. So much so that it resulted in me having to significantly rework the architecture I had in mind for log processing.

People who has worked with s3 input for logstash extensively would have experienced a few major issues for bucket/key prefix combination that contains large number of objects:
  1. Extremely long start up time due to the registration of the s3 plugin within the logstash runtime.
  2. The long latency at which new files in s3 are picked up by logstash for processing.
Both of these problem stems from a quirk with the AWS s3 api. The list object api only returns 1000 items at once. This problem is further amplified by the fact that logstash s3 input does not deal with this matter intelligently and queries everything upfront then process it sequentially, rather than processing the 1000 items queried first before moving off to the next batch.

The delete and backup_to_bucket settings allow a configuration where a target location is used for logstash to collect the files to be processed and then pointing the backup_to_bucket to another location for archival. That as opposed to dumping all the data into a s3 location and pointing logstash at it and telling it to poll for updates.

This approach also frees up the reliance on the since db file logstash s3 uses to keep track of where it last injested up to. Logstash can just crunch everything it finds in the target location. This means now the logstash server can be put on an auto scaling group and throw away the server at will without having to worry about backing up / restore / losing the since db file.

Taking my setup as an example. We ship all our logs to a S3 location and then fan out from there into EMR, Redshift, Logstash, etc etc. With our IIS logs, having only 10 web servers shipping 1 log per site per hour, assuming there's an average of 2 iis sites per server, there would be a total of 10 servers x 2 sites x 24 hours x 30 days = 14400 s3 object per month. In just 6 months, there'd be around 86k objects in s3. That's 86 consecutive HTTP request to just scan the entire drop zone for changes which makes it completely unfeasible. Just imagine what's going to happen 3 years down the line.

Quite frankly, I would rather see the logstash s3 intelligently iterate and priorities yyyy/mm/dd key partitioning that most people uses for archiving large number of objects in s3 and injest and scan for the latest entry first. The combination of delete and backup_to_bucket while solves the problem, introduces new ones.
  1. It means that data would not be able to reach EMR and Redshift until Logstash has finished processing it.
  2. The logstash s3 input plugin code, as of the time of this post, uploads the local copy of the file it downloaded to the backup location as opposed to using the copy operation. This step can potentially introduce corruption to the file due to any unforeseen circumstances.
Tho, I doubt my wish would come true anytime soon, so I guess delete and backup_to_bucket is the best we have to work with for the time being.

1 comment:

  1. AWS Training in Bangalore - Live Online & Classroom
    myTectra Amazon Web Services (AWS) certification training helps you to gain real time hands on experience on AWS. myTectra offers AWS training in Bangalore using classroom and AWS Online Training globally. AWS Training at myTectra delivered by the experienced professional who has atleast 4 years of relavent AWS experince and overall 8-15 years of IT experience. myTectra Offers AWS Training since 2013 and retained the positions of Top AWS Training Company in Bangalore and India.



    IOT Training in Bangalore - Live Online & Classroom
    IOT Training course observes iot as the platform for networking of different devices on the internet and their inter related communication. Reading data through the sensors and processing it with applications sitting in the cloud and thereafter passing the processed data to generate different kind of output is the motive of the complete curricula. Students are made to understand the type of input devices and communications among the devices in a wireless media.

    ReplyDelete