miércoles, 12 de septiembre de 2018

Level up logs and ELK - ElasticSearch Replication Factor and Retention - VRR Estimation Strategy

Articles index:

  1. Introduction (Everyone)
  2. JSON as logs format (Everyone)
  3. Logging best practices with Logback (Targetting Java DEVs)
  4. Logging cutting-edge practices (Targetting Java DEVs) 
  5. Contract first log generator (Targetting Java DEVs)
  6. ElasticSearch VRR Estimation Strategy (Targetting OPS)
  7. VRR Java + Logback configuration (Targetting OPS)
  8. VRR FileBeat configuration (Targetting OPS)
  9. VRR Logstash configuration and Index templates (Targetting OPS)
  10. VRR Curator configuration (Targetting OPS)
  11. Logstash Grok, JSON Filter and JSON Input performance comparison (Targetting OPS)

ElasticSearch VRR Estimation Strategy


Estimating how much storage is needed by ElasticSearch to store your logs depends on three variables mostly.

  • Replication factor: ElasticSearch can allocate copies of your logs in distributed nodes so it becomes fault tolerant. If a node dies, other can continue with reading and writing info for that index/shard.
  • Retention: ElasticSearch is unable to choose by itself for how long data will be kept. However other tools like Curator can help to define this variable. Curator will be our option in this series of articles.
  • Volume: Logs to be stored, it's an obvious variable.

We are not going to consider other variables, but feel free to dig in anytime with this document from the horse's mouth The true story behind Elasticsearch storage requirements.

Monolithic estimation strategy

In order to estimate the storage needed by an application across all ElasticSearch nodes, you'll need to make the following simple calculation:

replication factor * retention (days) * volume (per day) = total storage needed.

One of my clients had 4TB of logs per day (averaged), let's make some numbers:
4TB/day * 365 days retention requiered * 3 copies = 4380TB storage required.
Someone suffered a heart attack in that meeting...

Variable-retention-replication (VRR) estimation strategy

Previous calculation has an implicit assumption, all information from the same file share the same importance, therefore same retention and replication factor. As a software developer and log generator, I know that's radically false.

The only way to solve this problem is to admit that not all the information in an application log file is equally important. If we consider three categories, "low importance", "important", "critical", we would need all project managers to define their logs' importance as a matrix of "importance percentage", "retention" and "replication" almost line by line (VRR matrix).

VRR (Variable Replication and Retention) Matrix for example application:

Being generous, most of the logs from the example application are useless after one or two weeks, they are only useful to investigate problems if they arose. This is purely debug information. In order to be conservative in calculations, we'll say it's 89% and it is required for two weeks retention with no replication (1 copy in total).

Around 10% of example application logs could be considered important, they are user tracking/activity, application events and synthetic information about errors. From this information we are creating most of the dashboards as well, and for that reason we need replication too (replication helps with search performance). 10% of the logs are required for three months retention and single-replication (2 copies in total).

Around 1% of example application logs are critical, information like user audit, log-in activity and product hiring / purchases events. 1% of the logs are required for 53 weeks retention and double-replication (3 copies in total, it's important not to lose this information).

Example application VRR Matrix:
Importance Percentage Retention Replication
Debug 89% 14 days 1 copy
Important 10% 90 days 2 copies
Critical 1% 371 days 3 copies

As an academical example, let's apply this matrix to the 4TB we mentioned before, knowing there would be a VRR matrix per application and those 4TB belonged to dozens of them.
The hypothetical result would be:

4TB * (0.89 * 14 * 1 + 0.10 * 90 * 2 + 0.01 * 371 * 3 ) = 166TB

By these numbers, we just need 166TB to cover a year round of logs, that's 3% of the originally estimated 4380TB we calculated with the monolithic strategy.

Maybe comparing with an entire year of 3 copies of everything it's a bit too much, let's compare with other strategies:
  • 1 Month 3 copies monolithic strategy -> 372TB, more than double
  • 1 Year 1 copy monolithic stragegy -> 1484TB, 9 times more (this is almost the 90% promised in the clickbait!)
You need to go as low as 2 weeks 3 copies to find a comparable size -> 168TB. Now, what do you prefer?

a) 1 Year 3 Copies for critical info, 3 months 2 copies for important, 2 weeks single copy for the rest or..

b) 3 copies 2 weeks for everything.

How VRR translates to configuration?

Replication is an index property, this is telling us we need a different index per "importance". We need to tell ElasticSearch what's the replication policy when we create the index (it could be told after creation, but this is just more comfortable).
See VRR Logstash configuration

Retention is a curator policy that we will apply per-index.
See VRR Curator configuration

Logstash will create the indices in ElasticSearch depending on the importance using index-templates. We need to put different "importance" logs in different indices. Logstash can do that as long as these logs are "tagged" in a way Logstash understands (e.g. JSON fields in the logs).
See VRR Logstash configuration and VRR FileBeat configuration

Logs in JSON format can be easily tagged without extra OPS time, untagged logs will be assimilated as the lowest importance so it's developer responsibility to tag them.
See VRR Java + Logback configuration



Steps to implement VRR Strategy

  1. Developers need to tag all lines of logs with the "importance" on them. As explained before, if they used JSON, it would be easier for everyone. Untagged logs will be considered of the lowest importance.
  2. Project managers and developers need to define VRR Matrix to estimate log storage requirements for OPS.
  3. For all application implementing VRR, OPS need to use one single special entry in Logstash to create importance-dependent indices in ElasticSearch using VRR matrix information. Index name will contain service name, date and importance so curator can distinguish them. They also need to change Index Templates accordingly.
  4. Curator configuration to be aware of VRR matrix to remove information as soon as allowed.
OPS will tell you to please have a common VRR policy for all applications, it's easier to manage. It's not a crazy request and you's still  be in a much better place than you used to be anyway.


Next: 7 - VRR Java + Logback configuration


No hay comentarios:

Publicar un comentario

Nota: solo los miembros de este blog pueden publicar comentarios.