miércoles, 14 de noviembre de 2018

Level up logs and ELK - Logging Cutting-Edge Practices - Documenting and sharing your most important logged information

 

Articles index:

  1. Introduction (Everyone)
  2. JSON as logs format (Everyone)
  3. Logging best practices with Logback (Targetting Java DEVs)
  4. Logging cutting-edge practices (Targetting Java DEVs) 
  5. Contract first log generator (Targetting Java DEVs)
  6. ElasticSearch VRR Estimation Strategy (Targetting OPS)
  7. VRR Java + Logback configuration (Targetting OPS)
  8. VRR FileBeat configuration (Targetting OPS)
  9. VRR Logstash configuration and Index templates (Targetting OPS)
  10. VRR Curator configuration (Targetting OPS)
  11. Logstash Grok, JSON Filter and JSON Input performance comparison (Targetting OPS)

Logging cutting-edge practices: Sharing your log strategy with other teams and departments



Once you have your ELK cluster in place, and your team is making an application (or several) logging into it, you still have a problem to solve.

Given either of the following scenarios:
  • A team newcomer is asked to investigate something that involves logs.
  • CTO has granted Support access to ELK so they can have some insights to application activity or problems.
  • Another team is investigating a cross-application problem.

Assuming they know the service, host and other common OPS-depending attributes...
  • What attributes can they query in ELK?
  • What to they mean?
  • What text does come with the data they need?
  • How do they know what data is correlated?

Bonus question
  • How do you enforce that same attribute has same name in all your logs? Moreover, how do you enforce it's always the same type?

I'll be proposing a number of different solutions in a iterative evolution, pick the one you prefer.

Baseline:


You are using  latest log format and best practices from Level up logs and ELK - Logging best practices with Logback
logger.info("New Market Status {} {}", kv("marketId", 1234), kv("newStatus", "ready"));

Cons:
  • Name or case could differ for same concept in different names.
  • Type mismatch is possible, same name, different type.
  • Enforcing usage as policy.
  • Not documentation available for sharing with other teams or for your own reference.

 

Iteration #1: Constant names


First problem to solve is ensuring we are not using different keys for the same concept.
Initially there's nothing that avoided this to happen:
logger.info("New Market Status {} {}", kv("marketId", 1234), kv("newStatus", "ready"));
[...] 
logger.info("Editing Market Status {} {}", kv("id", 1234), kv("newstatus", "ready"));

On one hand, no developer is going to manually check all the names for any variable to enforce consistency of names. On the other, even lower/upper case mismatch could ruin your queries as they are stored in case sensitive fields.

First iteration would make us think of a set of constant Strings or Enums to help us to deal with variety of names. Documentation about meaning, usage, required type and description could be explicit in comments.
public class LoggerUtils {
    public static final String MK_ID = "marketId"; //String - Market Id
    public static final String MK_NEW_STATUS = "newStatus"; //MarketStatus.java enum - New Market Status
}

Finally code would look like this (Enforced by code review and good will only)
logger.info("New Market Status {} {}", kv(MK_ID, 1234), kv(MK_NEW_STATUS, "ready"));
[...] 
logger.info("Editing Market Status {} {}", kv(MK_ID, 1234), kv(MK_NEW_STATUS, "ready"));

Cons:
  • Name or case could differ for same concept in different names.
  • Type mismatch is possible: same name, different type.
  • Enforce usage as policy.
  • Not documentation available for sharing with other teams or for your own reference.

Iteration #2: Enforcing types


This problem comes from accidentally reusing same name for different kinds of items, being "id" a paradigmatic example.
logger.info("New Market Status {} {}", kv("id", 1234), kv("newStatus", "ready"));
[...] 
logger.info("User has signed up {}", kv("id", "email@forexample.com"));

Unless forced, first occurrence of a type defines the type for that index. Therefore you will find an error as following in Logstash logs/output:

logstash_1       | [2018-11-15T22:08:29,392][WARN ][logstash.outputs.elasticsearch] 
Could not index event to Elasticsearch. {
:status=>400, 
:action=>[
  "index", {
    :_id=>nil, 
    :_index=>"vrr-leveluplogging-low-2018-11-15", 
    :_type=>"doc", 
    :_routing=>nil
  }, 
  #<LogStash::Event:0x9b41f3e>
], 
:response=>{
  "index"=>{
    "_index"=>
    "vrr-leveluplogging-low-2018-11-15", 
    "_type"=>"doc", 
    "_id"=>"hJZrGWcBSDbbRtN6HHw3", 
    "status"=>400, 
    "error"=>{
      "type"=>"illegal_argument_exception", 
      "reason"=>"mapper [id] of different type, current_type [long], merged_type [text]"
}}}}


If you check Kibana's Management -> Index Patterns -> vrr-* (in our example)


Field "id" has been defined as "number", therefore when Logstash tries to put a "string" it fails.

Once introduced the problem, let's try the solution.


import net.logstash.logback.argument.StructuredArgument;
import static net.logstash.logback.argument.StructuredArguments.kv;

public class LoggerUtils {

    public static StructuredArgument logMarketId(Long id) {
        return kv("marketId", id);
    }

    public static StructuredArgument logCustId(String id) {
        return kv("custId", id);
    }

}

We could create a helper class that centralized the responsibility of not reusing names, and name-type relationship.

It would be used like this now (old and new version):
logger.info("User has signed up {}", kv("id", "email@forexample.com"));
logger.info("User has signed up {}", logCustId("email@forexample.com"));

Yet you need to enforce creating these functions and its usage.

Cons:
  • Name or case could differ for same concept in different names
  • Type mismatch is possible, same name, different type
  • Enforce usage as policy
  • Not documentation available for sharing with other teams or for your own reference

Please notice that a field accepting a type, e.g. "id" accepting "number", also implies that it can take an array of the same type (and not only a single value); it will be handled by ElasticSearch transparently. They will be indexed as searches will return the same document for a query of either of the values in the array.


Iteration #3: Enforcing usage


How to avoid that a random developer adds their own log lines, breaking all the rules above? Good and Rogue developer below:

import net.logstash.logback.argument.StructuredArgument;
import org.slf4j.Logger;import org.slf4j.LoggerFactory; 
import static example.LoggerUtils.logCustId;
import static net.logstash.logback.argument.StructuredArguments.kv; 

[...] 

logger.info("Good developer using custom LoggerUtils {}", logCustId("email@forexample.com")); 

[...] 

logger.info("Roge developer not using custom LoggerUtils {}", kv("custId",1234L));


Fortunately for us, Checkstyle already allows to control arbitrary imports, and also allows suppresion of this rule for a given class. Let's put it together.

Checkstyle configuration forbidding usage of kv import in the project:

<module name="Checker">
    <property name="charset" value="UTF-8">

    <module name="SuppressionFilter">
        <property name="file" value="${config_loc}/suppressions.xml">
    </property></module>
    
    <module name="TreeWalker">
        <module name="IllegalImport">
            <property name="illegalClasses" value="net\.logstash\.logback\.argument\.StructuredArguments.*">
            <property name="regexp" value="true">
        </property></property></module>
    </module>
</property></module>

Making an exception with LoggerUtils class:

<suppressions>
    <suppress checks="IllegalImport" files="LoggerUtils\.java"></suppress>
</suppressions>

You could argue that there are ways to avoid this prohibition, I agree, always are.
Goal here is not to avoid by all means that a Rogue developer logged anything he liked. Goal is to avoid mistakes, and programmatically bypass this checks should really shine in a code review.

As a result of applying above configuration in our favorite build tool (Maven, Gradle, etc..) you would find a broken build (if you configure it so, and so I recommend) and an explanation as follows (Gradle output):


:compileJava
:processResources UP-TO-DATE
:classes
[ant:checkstyle] [ERROR] /home/alberto/code/leveluplogging/leveluplogging/src/main/java/example/ErrorType.java:9:1: 
Illegal import - net.logstash.logback.argument.StructuredArguments.kv. [IllegalImport]
:checkstyleMain FAILED

FAILURE: Build failed with an exception.


Cons:
  • Name or case could differ for same concept in different names.
  • Type mismatch is possible: same name, different type.
  • Enforce usage as policy.  
  • Not documentation available for sharing with other teams or for your own reference.

 

Iteration #4: Contract-first LoggerUtils code generation


A solution that kept the goodness of the three previous iterations, but focused on a "documentation first" approach (analogy of contract-first for developers) would involve, in my opinion, the following elements:

  1. A description file that explains in the closest to human language possible what are you logging that could be of any interest to anyone.
  2. A code generator application that took the description file and made some equivalent to LoggerUtils in your favorite language. 
  3. A documentation generator application that took the description file and made a full human readable documentation in your favorite markup / text format.
  4. A plugin to attack previous two steps in your build chain (Gradle plugin, Maven plugin..)
This topic will take me long enough to deserve another article. See link below "Next"
 




Next: 5 - Contract first log generator

lunes, 1 de octubre de 2018

Level up logs and ELK - Logstash Grok, JSON Filter and JSON Input performance comparision

Articles index:

  1. Introduction (Everyone)
  2. JSON as logs format (Everyone)
  3. Logging best practices with Logback (Targetting Java DEVs)
  4. Logging cutting-edge practices (Targetting Java DEVs) 
  5. Contract first log generator (Targetting Java DEVs)
  6. ElasticSearch VRR Estimation Strategy (Targetting OPS)
  7. VRR Java + Logback configuration (Targetting OPS)
  8. VRR FileBeat configuration (Targetting OPS)
  9. VRR Logstash configuration and Index templates (Targetting OPS)
  10. VRR Curator configuration (Targetting OPS)
  11. Logstash Grok, JSON Filter and JSON Input performance comparison (Targetting OPS)



Logstash Grok, JSON Filter and JSON Input performance comparison



As part of the VRR strategy altogether, I've performed a little experiment to compare performance for different configurations.

I keep using the FileBeat -> Logstash -> Elasticsearch <- Kibana, this time everything updated to 6.4.1 using Docker.

Experiment consist on letting FileBeat read a file containing 3 million entries, generated in JSON (923000198 bytes) and plain text (492000055 bytes) by a modified version of VRR.java. Logstash will be reading from the beats plugin and will use the information within to redirect these lines to the most suitable index (1m to critical, 1m to important, 1m to low).

Monitoring is enabled from Kibana and Logstash so all the information is collected and stored in ElasticSearch itself.

Given everything is running in the same box, under the same concurrent load (none), same CPU, memory and disk profile, and same cluster configuration (one ElasticSearch in total), this result is not a representative result for throughput in any case, however, it should be a valid experiment for performance comparison.


 Logstash + Grok Filter + KV Filter

This configuration takes input from Beats input, uses Grok to take some fields apart (message, timestamp, rid, loglevel..), then applies KV to find key-value pairs in the text, finally uses the information from key-value to choose index.

input {
  beats {
    port => 5044
  }
}

filter {
    grok {
        match => [ "message" , "%{TIME:@timestamp} %{UUID:rid} \[%{WORD:thread}\] 
%{LOGLEVEL:loglevel}\s+%{WORD:classname} - %{GREEDYDATA:message}" ]
        overwrite => ["message"]
}

kv {

}
}

output {
        if [importance] == "CRIT" { 
            elasticsearch { 
                hosts => "elasticsearch:9200"
                index => "vrr-%{[fields][service]}-crit-%{+YYYY-MM-dd}" 
                template => "templates/template-max.json" 
                template_overwrite => true
                template_name => "vrr-max"
            }
        } else if [importance] == "IMP" { 
            elasticsearch { 
                hosts => "elasticsearch:9200"
                index => "vrr-%{[fields][service]}-imp-%{+YYYY-MM-dd}" 
                template => "templates/template-mid.json" 
                template_overwrite => true
                template_name => "vrr-mid"
            }
        } else { 
            elasticsearch { 
                hosts => "elasticsearch:9200"
                index => "vrr-%{[fields][service]}-low-%{+YYYY-MM-dd}" 
                template => "templates/template-min.json" 
                template_overwrite => true
                template_name => "vrr-min"
            }
        }
        
  
}

 

Results:

  • 1 million entries per each VRR index, OK
  • Around 600 events per seconds went through Logstash, processing each one in approximately 1500 micro seconds (just above really)
  • CPU averages between 20% and 25%




 

Logstash + JSON Filter

In this case we are using a JSON log file with the same amount of information, and will be extracted using JSON filter plugin in Logstash. After that, sending to the right index based on the extracted information.

input {
  beats {
    port => 5044
  }
}

filter {
    if [fields][logschema] == "vrr" {
        json {
            source => "message"
        }
    }
}

output {
     if [fields][logschema] == "vrr" { 
        if [importance] == "CRIT" { 
            elasticsearch { 
                hosts => "elasticsearch:9200"
                index => "vrr-%{[fields][service]}-crit-%{+YYYY-MM-dd}" 
                template => "templates/template-max.json" 
                template_overwrite => true
                template_name => "vrr-max"
            }
        } else if [importance] == "IMP" { 
            elasticsearch { 
                hosts => "elasticsearch:9200"
                index => "vrr-%{[fields][service]}-imp-%{+YYYY-MM-dd}" 
                template => "templates/template-mid.json" 
                template_overwrite => true
                template_name => "vrr-mid"
            }
        } else { 
            elasticsearch { 
                hosts => "elasticsearch:9200"
                index => "vrr-%{[fields][service]}-low-%{+YYYY-MM-dd}" 
                template => "templates/template-min.json" 
                template_overwrite => true
                template_name => "vrr-min"
            }
        }
        
    } else { 
        elasticsearch {
            hosts => "elasticsearch:9200"
            index => "logstash-classic-%{[fields][service]}-%{+YYYY-MM-dd-HH}" 
        }
    }
}

 

Results

  • 1 million entries per each VRR index, OK
  • Around 875 events per seconds went through Logstash, processing each one in approximately 1150 micro seconds.
  • CPU averages between 15% and 20%




Logstash + JSON Input

Only difference with previous test is that JSON is applied at entry level instead of filter level, for my own sake, I really wanted to know what's the difference.


input {
  beats {
    port => 5044
    codec => json
  }
}

output {
     if [fields][logschema] == "vrr" { 
        if [importance] == "CRIT" { 
            elasticsearch { 
                hosts => "elasticsearch:9200"
                index => "vrr-%{[fields][service]}-crit-%{+YYYY-MM-dd}" 
                template => "templates/template-max.json" 
                template_overwrite => true
                template_name => "vrr-max"
            }
        } else if [importance] == "IMP" { 
            elasticsearch { 
                hosts => "elasticsearch:9200"
                index => "vrr-%{[fields][service]}-imp-%{+YYYY-MM-dd}" 
                template => "templates/template-mid.json" 
                template_overwrite => true
                template_name => "vrr-mid"
            }
        } else { 
            elasticsearch { 
                hosts => "elasticsearch:9200"
                index => "vrr-%{[fields][service]}-low-%{+YYYY-MM-dd}" 
                template => "templates/template-min.json" 
                template_overwrite => true
                template_name => "vrr-min"
            }
        }
        
    } else { 
        elasticsearch {
            hosts => "elasticsearch:9200"
            index => "logstash-classic-%{[fields][service]}-%{+YYYY-MM-dd-HH}" 
        }
    }
}


Results:

  • 1 million entries per each VRR index, OK
  • Around 1000 events per seconds went through Logstash, processing each one in approximately 1000 micro seconds.
  • CPU averages around 15%




 

Conclusions:

Using Grok with KV is the slowest option.
Using JSON as input is the fastest option, but reduces the flexibility of having several formats in the same input. This can be mitigated with different pipelines and ports.
Using JSON as filter provides the same flexibility than Grok without the big performance penalty. It is slower than JSON as input though.


miércoles, 12 de septiembre de 2018

Level up logs and ELK - ElasticSearch Replication Factor and Retention - VRR Estimation Strategy

Articles index:

  1. Introduction (Everyone)
  2. JSON as logs format (Everyone)
  3. Logging best practices with Logback (Targetting Java DEVs)
  4. Logging cutting-edge practices (Targetting Java DEVs) 
  5. Contract first log generator (Targetting Java DEVs)
  6. ElasticSearch VRR Estimation Strategy (Targetting OPS)
  7. VRR Java + Logback configuration (Targetting OPS)
  8. VRR FileBeat configuration (Targetting OPS)
  9. VRR Logstash configuration and Index templates (Targetting OPS)
  10. VRR Curator configuration (Targetting OPS)
  11. Logstash Grok, JSON Filter and JSON Input performance comparison (Targetting OPS)

ElasticSearch VRR Estimation Strategy


Estimating how much storage is needed by ElasticSearch to store your logs depends on three variables mostly.

  • Replication factor: ElasticSearch can allocate copies of your logs in distributed nodes so it becomes fault tolerant. If a node dies, other can continue with reading and writing info for that index/shard.
  • Retention: ElasticSearch is unable to choose by itself for how long data will be kept. However other tools like Curator can help to define this variable. Curator will be our option in this series of articles.
  • Volume: Logs to be stored, it's an obvious variable.

We are not going to consider other variables, but feel free to dig in anytime with this document from the horse's mouth The true story behind Elasticsearch storage requirements.

Monolithic estimation strategy

In order to estimate the storage needed by an application across all ElasticSearch nodes, you'll need to make the following simple calculation:

replication factor * retention (days) * volume (per day) = total storage needed.

One of my clients had 4TB of logs per day (averaged), let's make some numbers:
4TB/day * 365 days retention requiered * 3 copies = 4380TB storage required.
Someone suffered a heart attack in that meeting...

Variable-retention-replication (VRR) estimation strategy

Previous calculation has an implicit assumption, all information from the same file share the same importance, therefore same retention and replication factor. As a software developer and log generator, I know that's radically false.

The only way to solve this problem is to admit that not all the information in an application log file is equally important. If we consider three categories, "low importance", "important", "critical", we would need all project managers to define their logs' importance as a matrix of "importance percentage", "retention" and "replication" almost line by line (VRR matrix).

VRR (Variable Replication and Retention) Matrix for example application:

Being generous, most of the logs from the example application are useless after one or two weeks, they are only useful to investigate problems if they arose. This is purely debug information. In order to be conservative in calculations, we'll say it's 89% and it is required for two weeks retention with no replication (1 copy in total).

Around 10% of example application logs could be considered important, they are user tracking/activity, application events and synthetic information about errors. From this information we are creating most of the dashboards as well, and for that reason we need replication too (replication helps with search performance). 10% of the logs are required for three months retention and single-replication (2 copies in total).

Around 1% of example application logs are critical, information like user audit, log-in activity and product hiring / purchases events. 1% of the logs are required for 53 weeks retention and double-replication (3 copies in total, it's important not to lose this information).

Example application VRR Matrix:
Importance Percentage Retention Replication
Debug 89% 14 days 1 copy
Important 10% 90 days 2 copies
Critical 1% 371 days 3 copies

As an academical example, let's apply this matrix to the 4TB we mentioned before, knowing there would be a VRR matrix per application and those 4TB belonged to dozens of them.
The hypothetical result would be:

4TB * (0.89 * 14 * 1 + 0.10 * 90 * 2 + 0.01 * 371 * 3 ) = 166TB

By these numbers, we just need 166TB to cover a year round of logs, that's 3% of the originally estimated 4380TB we calculated with the monolithic strategy.

Maybe comparing with an entire year of 3 copies of everything it's a bit too much, let's compare with other strategies:
  • 1 Month 3 copies monolithic strategy -> 372TB, more than double
  • 1 Year 1 copy monolithic stragegy -> 1484TB, 9 times more (this is almost the 90% promised in the clickbait!)
You need to go as low as 2 weeks 3 copies to find a comparable size -> 168TB. Now, what do you prefer?

a) 1 Year 3 Copies for critical info, 3 months 2 copies for important, 2 weeks single copy for the rest or..

b) 3 copies 2 weeks for everything.

How VRR translates to configuration?

Replication is an index property, this is telling us we need a different index per "importance". We need to tell ElasticSearch what's the replication policy when we create the index (it could be told after creation, but this is just more comfortable).
See VRR Logstash configuration

Retention is a curator policy that we will apply per-index.
See VRR Curator configuration

Logstash will create the indices in ElasticSearch depending on the importance using index-templates. We need to put different "importance" logs in different indices. Logstash can do that as long as these logs are "tagged" in a way Logstash understands (e.g. JSON fields in the logs).
See VRR Logstash configuration and VRR FileBeat configuration

Logs in JSON format can be easily tagged without extra OPS time, untagged logs will be assimilated as the lowest importance so it's developer responsibility to tag them.
See VRR Java + Logback configuration



Steps to implement VRR Strategy

  1. Developers need to tag all lines of logs with the "importance" on them. As explained before, if they used JSON, it would be easier for everyone. Untagged logs will be considered of the lowest importance.
  2. Project managers and developers need to define VRR Matrix to estimate log storage requirements for OPS.
  3. For all application implementing VRR, OPS need to use one single special entry in Logstash to create importance-dependent indices in ElasticSearch using VRR matrix information. Index name will contain service name, date and importance so curator can distinguish them. They also need to change Index Templates accordingly.
  4. Curator configuration to be aware of VRR matrix to remove information as soon as allowed.
OPS will tell you to please have a common VRR policy for all applications, it's easier to manage. It's not a crazy request and you's still  be in a much better place than you used to be anyway.


Next: 7 - VRR Java + Logback configuration


Level up logs and ELK - VRR Curator configuration

Articles index:

  1. Introduction (Everyone)
  2. JSON as logs format (Everyone)
  3. Logging best practices with Logback (Targetting Java DEVs)
  4. Logging cutting-edge practices (Targetting Java DEVs) 
  5. Contract first log generator (Targetting Java DEVs)
  6. ElasticSearch VRR Estimation Strategy (Targetting OPS)
  7. VRR Java + Logback configuration (Targetting OPS)
  8. VRR FileBeat configuration (Targetting OPS)
  9. VRR Logstash configuration and Index templates (Targetting OPS)
  10. VRR Curator configuration (Targetting OPS)
  11. Logstash Grok, JSON Filter and JSON Input performance comparison (Targetting OPS)

VRR Curator configuration

 

Curator configuration

Last piece of the puzzle, Curator will take care of deleting information from ElasticSearch once it has expired according to our policies.

There are two ways of deleting information from ElasticSearch using its API
  1. By deleting documents through a query that uses timestamp -> SLOW, deleting a document requires modifying indices and caches.
  2. By deleting indices as a whole -> FAST. But you cannot choose what to delete inside an index, it all goes down.
The best way of deleting indices but only old information is to create an index per day (or other granularity) and let curator understand the day format from the index name.
Then curator can delete the oldest indices and keep the newest at the same time.

Finally, Curator is not a daemon, it needs to be executed, normally once a day is enough (really as often as your smaller granularity).

Configuration example:

File available here

---
actions:
  1: <- Sets an order of execution
    action: delete_indices <- action
    description: >-
      Delete indices older than 3 days (based on index name), for vrr-*-crit-
      prefixed indices. Application/Service name is irrelevant, but this only 
      applies to "crit" indices. Ignore the error if the filter does not result in an
      actionable list of indices (ignore_empty_list) and exit cleanly.
    options:
      ignore_empty_list: True
      timeout_override:
      continue_if_exception: False
      disable_action: False <- otherwise it wouldn't execute
    filters:
    - filtertype: pattern
      kind: regex
      value: vrr-.*-crit- <- Applies to all VRR critical indices, no matter the application. 
this is only valid if you get an agreement on retention being the same for all application for
the same importance. 
      exclude:
    - filtertype: age <- Second filter, we are filtering by age
      source: name
      direction: older
      timestring: '%Y-%m-%d' <- date format in the index (set in logstash!)
      unit: days
      unit_count: 371 <- delete if older than 371 days ("days" defined in unit, just above)
      exclude:
  2:
    action: delete_indices
    description: >-
      Delete indices older than 2 days (based on index name), for vrr-*-imp-
      prefixed indices. Application/Service name is irrelevant, but this only 
      applies to "imp" indices. Ignore the error if the filter does not result in an
      actionable list of indices (ignore_empty_list) and exit cleanly.
    options:
      ignore_empty_list: True
      timeout_override:
      continue_if_exception: False
      disable_action: False
    filters:
    - filtertype: pattern
      kind: regex
      value: vrr-.*-imp- <- Applies to all VRR important indices
      exclude:
    - filtertype: age
      source: name
      direction: older
      timestring: '%Y-%m-%d'
      unit: days
      unit_count: 90 <- delete if older than 90 days
      exclude:
  3:
    action: delete_indices
    description: >-
      Delete indices older than 1 days (based on index name), for vrr-*-low-
      prefixed indices. Application/Service name is irrelevant, but this only 
      applies to "low" indices. Ignore the error if the filter does not result in an
      actionable list of indices (ignore_empty_list) and exit cleanly.
    options:
      ignore_empty_list: True
      timeout_override:
      continue_if_exception: False
      disable_action: False
    filters:
    - filtertype: pattern
      kind: regex
      value: vrr-.*-low- <- Applies to all VRR low indices
      exclude:
    - filtertype: age
      source: name
      direction: older
      timestring: '%Y-%m-%d'
      unit: days
      unit_count: 14 <- delete if older than 14 days
      exclude:


Next: 11 - Logstash Grok, JSON Filter and JSON Input performance comparison

Level up logs and ELK - VRR FileBeat configuration

Articles index:

  1. Introduction (Everyone)
  2. JSON as logs format (Everyone)
  3. Logging best practices with Logback (Targetting Java DEVs)
  4. Logging cutting-edge practices (Targetting Java DEVs) 
  5. Contract first log generator (Targetting Java DEVs)
  6. ElasticSearch VRR Estimation Strategy (Targetting OPS)
  7. VRR Java + Logback configuration (Targetting OPS)
  8. VRR FileBeat configuration (Targetting OPS)
  9. VRR Logstash configuration and Index templates (Targetting OPS)
  10. VRR Curator configuration (Targetting OPS)
  11. Logstash Grok, JSON Filter and JSON Input performance comparison (Targetting OPS)

VRR FileBeat configuration



Filebeat doesn't need much configuration for JSON log files, just our typical agreement between parties:
  • DEVs agree to 
    • use JSON for logs, 
    • VRR as log retention strategy, 
    • "imp" JSON field for VRR "importance" fields with values LOW, IMP, CRIT
    • no "imp" field means LOW importance
  • OPS agree to
    • take this file and use retention and replication depending on those fields
    • add "service" in filebeat for application name
    • add "environment" in filebeat where applicable
    • add "logschema":"vrr" to distinguish a common approach for logs.

As contract is mostly the same for all applications, Filebeat configuration is very reusable, one entry per application and box.

This file, in a working example, can be found here.

- type: log
  enabled: true <- important ;)
  paths:
    - /path/to/logFile.json 
  encoding: utf-8
  fields:
    logschema: vrr <- this value will be reused in Logstash configuration
    service: leveluplogging <- application / service name
    environment: production <- optional, very.


Next: 9 - VRR Logstash configuration and Index templates

Level up logs and ELK - VRR Logstash configuration

Articles index:

  1. Introduction (Everyone)
  2. JSON as logs format (Everyone)
  3. Logging best practices with Logback (Targetting Java DEVs)
  4. Logging cutting-edge practices (Targetting Java DEVs) 
  5. Contract first log generator (Targetting Java DEVs)
  6. ElasticSearch VRR Estimation Strategy (Targetting OPS)
  7. VRR Java + Logback configuration (Targetting OPS)
  8. VRR FileBeat configuration (Targetting OPS)
  9. VRR Logstash configuration and Index templates (Targetting OPS)
  10. VRR Curator configuration (Targetting OPS)
  11. Logstash Grok, JSON Filter and JSON Input performance comparison (Targetting OPS)

VRR Logstash configuration and Index templates

Logstash configuration from example can be found here.

input {
  beats {
    port => 5044
    codec => json
  }
}

output {
     if [fields][logschema] == "vrr" { //for ALL VRR applications
        if [importance] == "CRIT" { //for ALL CRITICAL LINES
            elasticsearch { //send to SERVICE-LOGSCHEMA-IMP-DATE index (vrr-loggingup-crit-2018-09-10) with template template.max
                hosts => "localhost:9200"
                index => "vrr-%{[fields][service]}-crit-%{+YYYY-MM-dd}" //ONE INDEX PER APPLICATION AND DAY AND IMPORTANCE
                template => "/path/to/templates/template-max.json" //USING THIS TEMPLATE, NEXT CHAPTER!
                template_overwrite => true
                template_name => "vrr-max"
            }
        } else if [importance] == "IMP" { //for ALL CRITICAL LINES
            elasticsearch { //send to SERVICE-LOGSCHEMA-IMP-DATE index (vrr-loggingup-imp-2018-09-10) with template template.mid
                hosts => "localhost:9200"
                index => "vrr-%{[fields][service]}-imp-%{+YYYY-MM-dd}" //ONE INDEX PER APPLICATION AND DAY AND IMPORTANCE
                template => "/path/to/templates/template-mid.json" //USING THIS TEMPLATE, NEXT CHAPTER!
                template_overwrite => true
                template_name => "vrr-mid"
            }
        } else { //FOR BOTH "LOW" AND NO-EXPLICIT TAGGING
            elasticsearch { //send to SERVICE-LOGSCHEMA-IMP-DATE index (vrr-loggingup-low-2018-09-10) with template template.min
                hosts => "localhost:9200"
                index => "vrr-%{[fields][service]}-low-%{+YYYY-MM-dd}" //ONE INDEX PER APPLICATION AND DAY AND IMPORTANCE
                template => "/path/to/templates/template-min.json" //USING THIS TEMPLATE, NEXT CHAPTER!
                template_overwrite => true
                template_name => "vrr-min"
            }
        }
        
    } else { //OTHER NON-VRR APPLICATIONS
        elasticsearch {
            hosts => "localhost:9200"
            index => "logstash-classic-%{[fields][service]}-%{+YYYY-MM-dd-HH}" //STILL ONE SEPARATE INDEX PER APPLICATION AND DAY
        }
    }
}

Template template-max.json (here)

{
  "index_patterns": ["vrr-*-crit-*"], //FOR ALL INDICES THAT MATCH THIS EXPRESSION 
  "order" : 1, <- Overrides order 0 settings (default values like number of shards or mappings)
  "settings": {
    "number_of_replicas": 2 //WE WANT 2 EXTRA COPIES + MASTER
  }
}

Template template-mid.json (here)

{
  "index_patterns": ["vrr-*-imp-*"], //FOR ALL INDICES THAT MATCH THIS EXPRESSION 
  "order" : 1, <- Overrides order 0 settings (default values like number of shards or mappings)
  "settings": {
    "number_of_replicas": 1 //WE WANT AN EXTRA COPY + MASTER
  }
}

Template template-min.json (here)

{
  "index_patterns": ["vrr-*-low-*"], //FOR ALL INDICES THAT MATCH THIS EXPRESSION 
  "order" : 1, <- Overrides order 0 settings (default values like number of shards or mappings)
  "settings": {
    "number_of_replicas": 0 //WE DON'T WANT EXTRA COPIES, JUST MASTER
  }
}


Next: 10 - VRR Curator configuration


Level up logs and ELK - VRR Java + Logback configuration

Articles index:

  1. Introduction (Everyone)
  2. JSON as logs format (Everyone)
  3. Logging best practices with Logback (Targetting Java DEVs)
  4. Logging cutting-edge practices (Targetting Java DEVs) 
  5. Contract first log generator (Targetting Java DEVs)
  6. ElasticSearch VRR Estimation Strategy (Targetting OPS)
  7. VRR Java + Logback configuration (Targetting OPS)
  8. VRR FileBeat configuration (Targetting OPS)
  9. VRR Logstash configuration and Index templates (Targetting OPS)
  10. VRR Curator configuration (Targetting OPS)
  11. Logstash Grok, JSON Filter and JSON Input performance comparison (Targetting OPS)

VRR Java + Logback configuration



Applying VRR to Java Logback application.


Example application defining importance per line using Structured Arguments

Product Owner, OPS and Developers have agreed to use tag/flag/mark "importance" with possible values "LOW" for lowest importance, "IMP" for mid-importance and "CRIT" for critical importance.

Example code without comments available here.


public class VRR {

    private static final String IMPORTANCE = "importance";
    private static final StructuredArgument LOW = kv(IMPORTANCE, "LOW"); //CREATING OBJECTS
    private static final StructuredArgument IMP = kv(IMPORTANCE, "IMP"); //TO REUSE AND
    private static final StructuredArgument CRIT = kv(IMPORTANCE, "CRIT"); //AVOID REWRITING

    private static final Logger logger = LoggerFactory.getLogger(VRR.class);

    public static void main(String[] args) {
        MDC.put("rid", UUID.randomUUID().toString()); //SAME MDC USAGE AVAILABLE
        try {
            long startTime = currentTimeMillis();
            someFunction();
            logger.info("important message, useful to so some metrics {} {}",
                    kv("elapsedmillis", currentTimeMillis() - startTime),
                    IMP); //IMPORTANT MESSAGE
        } catch (Exception e) {
            logger.error("This is a low importance message as it won't have value after few weeks", 
                          e); //THIS IS A LOW IMPORTANCE MESSAGE AS IT'S NOT TAGGED
        }
    }

    static void someFunction() throws Exception {
        logger.info("low importance message, helps to trace errors, begin someFunction {} {} {}",
                kv("user","anavarro"),
                kv("action","file-create"),
                LOW); //LOW IMPORTANCE TAGGED MESSAGE, SLIGHTLY REDUNDANT, SAME THAN UNTAGGED

        Thread.sleep(500L); //some work

        logger.info("critical message, audit trail for user action {} {} {}",
                kv("user","anavarro"),
                kv("action","file-create"),
                CRIT); //CRITICAL MESSAGE
    }
}

Previously mentioned logback.xml configuration

<configuration>
    <appender class="ch.qos.logback.core.rolling.RollingFileAppender" name="stash">
        <file>logFile.json</file>
        <rollingpolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            <filenamepattern>file.log.%d{yyyy-MM-dd}</filenamepattern>
            <maxhistory>30</maxhistory>
        </rollingpolicy>
        <encoder class="net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder">
            <providers>
                <timestamp>
                <threadname>
                <mdc>
                <loggername>
                <message>
                <loglevel>
                <arguments>
                <stacktrace>
                <stackhash>
            </stackhash></stacktrace></arguments></loglevel></message></loggername></mdc></threadname></timestamp></providers>
        </encoder>
    </appender>

    <appender class="ch.qos.logback.core.ConsoleAppender" name="STDOUT">
        <encoder>
            <pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>

    <root level="all">
        <appender-ref ref="stash">
        <appender-ref ref="STDOUT">
    </appender-ref></appender-ref></root>
</configuration>

Code and configuration together, we get a result as JSON files as follow:


{"@timestamp":"2018-09-11T00:05:11.746+02:00","thread_name":"main",
 "rid":"7fac3070-0d7e-40a6-a3e8-246ec95e86e7","logger_name":"VRR", 
"message":"low importance message, helps to trace errors, begin someFunction 
user=anavarro action=file-create importance=LOW","level":"INFO","user":"anavarro", 
"action":"file-create","importance":"LOW"}

{"@timestamp":"2018-09-11T00:05:12.271+02:00","thread_name":"main",
 "rid":"7fac3070-0d7e-40a6-a3e8-246ec95e86e7","logger_name":"VRR",
 "message":"critical message, audit trail for user action 
user=anavarro action=file-create importance=CRIT","level":"INFO","user":"anavarro", 
"action":"file-create","importance":"CRIT"}

{"@timestamp":"2018-09-11T00:05:12.274+02:00","thread_name":"main", 
"rid":"7fac3070-0d7e-40a6-a3e8-246ec95e86e7","logger_name":"VRR",
 "message":"important message, useful to so some metrics elapsedmillis=528 importance=IMP", 
"level":"INFO","elapsedmillis":528,"importance":"IMP"}
 

Next: 8 - VRR FileBeat configuration


domingo, 9 de septiembre de 2018

Level up logs and ELK - Logging best practices with Logback

Articles index:

  1. Introduction (Everyone)
  2. JSON as logs format (Everyone)
  3. Logging best practices with Logback (Targetting Java DEVs)
  4. Logging cutting-edge practices (Targetting Java DEVs) 
  5. Contract first log generator (Targetting Java DEVs)
  6. ElasticSearch VRR Estimation Strategy (Targetting OPS)
  7. VRR Java + Logback configuration (Targetting OPS)
  8. VRR FileBeat configuration (Targetting OPS)
  9. VRR Logstash configuration and Index templates (Targetting OPS)
  10. VRR Curator configuration (Targetting OPS)
  11. Logstash Grok, JSON Filter and JSON Input performance comparison (Targetting OPS)

Logging best practices with Logback

 

Logs are the product of, so far, and in my experience only, very personal view of how software must be built and the amount of nights on call the developer has suffered.
In this chapter I am going to recommend some ways of logging information in order to make it more useful when it comes to Index, Search and Filter.

All examples expressed in Java + SLF4J.
Github repo: https://github.com/albertonavarro/leveluplogging/tree/master/leveluplogging

Simple Message:

Most of the free log information will come as message information. From "Application is ready to listen messages" to "Unknown error trying to upload file". These messages have a low information interest, I struggled to make these kind of messages without any parameter.

Example code in GitHub

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Main {
    private static final Logger logger = LoggerFactory.getLogger(Main.class);

    public static void main(String[] args) {
        logger.info("Starting main - without arguments");
    }
}

This could be the output for a simple Message in plain text:
20:15:52.106 [main] INFO  Main1 - Starting main - without arguments

You would still need some help to make that log line useful, like Grok or Splunk.
They could expect that your format to be constant and immutable, consisting of

input {
  beats {
    port => 5044
  }
}

filter {
    grok {
        match => [ "message" , "%{TIME:@timestamp} \[%{WORD:thread}\] 
%{LOGLEVEL:loglevel}\s+%{WORD:classname} - %{GREEDYDATA:message}" ]
        overwrite => ["message"]
    }
}

output {
    elasticsearch {
        hosts => "localhost:9200"
        index => "logstash-classic"
    }
}

Later on, these "messages" are translated to an identical message field in the ElasticSearch document that you may search by exact or partial match.




Notice than thanks to Grok, we have extracted "classname" and "loglevel" as long as that format never changes.

You might event count, compare, dashboard the number of events of a type VS the number of events of a different type, correlate by time, etc.

This would be the aspect of the same message in JSON:

{"@timestamp":"2018-09-06T20:15:52.106+02:00","message":"Starting main - without arguments",
"level":"INFO","level_value":20000}
 
Fields are already split and named, no need to do extra parsing to extract the same benefits.

Arguments: 

Example code in GitHub

At some point, we find it necessary to add dynamic information to our logs, flavor can go from free text to semi-structured format. This could be my own evolution:

Initially I wrote them like this:
logger.info("Market {} is ready", marketId)

Then I tried to give it some more structure:
logger.info("Market marketId={} is ready", marketId)

I ended like this in order to improve queries format:
logger.info("New Market Status marketId={} newStatus={}", marketId, "ready")
The reason for accumulating free text on the left and data on the right is simplifying the queries. In Kibana it would easier and more performing querying "Market is ready" than "Market" AND "is ready".

Output for this last test would look like:
23:26:52.164 [main] INFO  Main2 - New Market Status marketId=1234 newStatus=ready 

Reusing previous Grok expression, we could extract message="New Market Status marketId=1234 newStatus=ready", but we would need to run something else on top of it to get ElasticSearch to understand there is extra information there that it can index separately.

Logstash's KV plugin (Elastic KV plugin) will extract key/value pairs from a given input, so in this case, output would look like this (notice new fields marketId and newStatus)
 


Now ElasticSearch can filter by "marketId", by "status", and go deeper and wider in data analysis.

The only weak point for KV is that it could have "false positives" detecting key/value pairs. Until recently, only single character separators were supported. Now char sequences defined by regex are as well (finally). By default. separator is character "=", that you would easily find in base64 strings or URLs. That would dirt and damage your index performance rather quickly.

JSON output would have same output and problems, we would still need to separate information from within message text.

Structured Arguments

Example code in GitHub

The amazing library logstash-logback-encoder also provides another way of expressing arguments in a more structured way. Nicely wrapped in the StructuredArguments class, it offers a number of convenient methods to parameterize Key/Values, Lists and Maps.

Syntax is quite similar, just a little bit more verbose:
logger.info("New Market Status {} {}", kv("marketId", 1234), kv("newStatus", "ready"));

It leaves the following console output, as it automatically formats key value pairs. Likewise previous method, information can be extracted using KV plugin.
20:20:02.037 [main] INFO  Main3 - New Market Status marketId=1234 newStatus=ready

However this is way more valuable when the output is JSON instead of plain text, as it not only formats the message the same way, but it also introduces at root level the new pair of attributes, marketId and newStatus, as independent fields.

{"@timestamp":"2018-09-06T20:20:02.037+02:00","message":"New Market Status marketId=1234 
newStatus=ready","level":"INFO","level_value":20000,"marketId":1234,"newStatus":"ready"}

This is very important, because this document, either directly sent to ElasticSearch, or gone through Logstash, will be indexed by ElasticSearch taking "marketId" and "newStatus" as independently indexed information without extra effort at all (no need for Grok or KV).

Compare required Logstash configuration with previous one:

input {
  beats {
    port => 5044
    codec => json
  }
  
}

output {
    elasticsearch {
        hosts => "localhost:9200"
        index => "logstash-classic"
    }
}

Finally resulting in this output in ElasticSearch



Notice (again) that "marketId" and "newStatus" are recognized without Grok or KV plugins configured at all in Logstash. Moreover, there won't be false-positives (I talked about this in the KV section some lines above). Some more extra fields pop-up coming from logstash-logback-encoder.

MDC

Example code in GitHub

One of the most important and useful tools in modern log frameworks is the Mapped Diagnostic Context (MDC).

The most common scenario where MDC highlights is the log correlation within a service and across services; MDC uses a ThreadLocal to store a map of key/values per any thread that uses the logger.

If you created an aspect or a filter in your service that assigned in that map a well-known key (e.g. "rid" for "request id") with a new UUID value per request, all logs made by that request would have the same "rid". This value could be also taken by http/jms clients to propagate the "rid" to other services that we own, and keep the same id across all services.
This is an extended pattern to trace calls across distributed services in industry.

In any service that used multiple threads to execute a given code in parallel, this is as good as using the thread name to distinguish actions from one or other thread. However, thread name won't help you if task is started by one thread and finished by another, or you wanted to know what happened with that request in next service it invoked.

Simulating  such behavior with following code:

public static void main(String[] args) {
    MDC.put("rid", UUID.randomUUID().toString());
    logger.info("New Market Status {} {}", kv("marketId", 1234), kv("newStatus", "ready"));
    logger.info("Some more logging");
}

Console output for MDC requires some extra configuration per field, so don't get too excited with these fields.

<encoder>
    <pattern>%d{HH:mm:ss.SSS} %X{rid} [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>

So now we are able to see it in the console/file

20:27:28.007 b5fab580-12f4-4266-8a97-927d03d1b01a [main] INFO  Main4 - New Market Status 
marketId=1234 newStatus=ready
20:27:28.043 b5fab580-12f4-4266-8a97-927d03d1b01a [main] INFO  Main4 - Some more logging

And given the strict format, to read it with grok as any other field (Notice extra configuration per new MDC field).

grok {
        match => [ "message" , "%{TIME:@timestamp} %{UUID:rid} \[%{WORD:thread}\] 
%{LOGLEVEL:loglevel}\s+%{WORD:classname} - %{GREEDYDATA:message}" ]
        overwrite => ["message"]
}

Once again, with the right Logstash configuration

input {
  beats {
    port => 5044
    codec => json
  }
  
}

output {
    elasticsearch {
        hosts => "localhost:9200"
        index => "logstash-classic"
    }
}

and the right Logback configuration (below) we can get MDC mapped in JSON with no effort for new fields at all:

{"@timestamp":"2018-09-06T20:27:28.007+02:00","thread_name":"main",
"rid":"b5fab580-12f4-4266-8a97-927d03d1b01a","logger_name":"Main4",
"message":"New Market Status marketId=1234 newStatus=ready","level":"INFO",
"marketId":1234,"newStatus":"ready"}
 
{"@timestamp":"2018-09-06T20:27:28.043+02:00","thread_name":"main",
"rid":"b5fab580-12f4-4266-8a97-927d03d1b01a","logger_name":"Main4",
"message":"Some more logging","level":"INFO"}

Once again, "rid" appears outside the message and as independent field, so it is automatically accepted by Logstash and indexed by ElasticSearch. No extra Logstash configuration either.


Logback Configuration for JSON using logstash-logback-encoder

It's not the purpose of the article to get deep into configuration details. Just mentioning this is the configuration that I'm using for this examples. Best guide to get this right is (GitHub) logstash-logback-encoder

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <appender name="stash" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>logFile.json</file>
        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            <fileNamePattern>file.log.%d{yyyy-MM-dd}</fileNamePattern>
            <maxHistory>30</maxHistory>
        </rollingPolicy>
        <encoder class="net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder">
            <providers>
                <timestamp/>
                <threadName/>
                <mdc/>
                <loggerName/>
                <message/>
                <logLevel/>
                <arguments/>
                <stackTrace/>
                <stackHash/>
            </providers>
        </encoder>
    </appender>

    <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>
    
    <root level="all">
        <appender-ref ref="stash" />
        <appender-ref ref="STDOUT" />
    </root>

</configuration>


Next: 4 - Logging cutting-edge practices (experimental)


jueves, 6 de septiembre de 2018

Level up logs and ELK - JSON logs

Articles index:

  1. Introduction (Everyone)
  2. JSON as logs format (Everyone)
  3. Logging best practices with Logback (Targetting Java DEVs)
  4. Logging cutting-edge practices (Targetting Java DEVs) 
  5. Contract first log generator (Targetting Java DEVs)
  6. ElasticSearch VRR Estimation Strategy (Targetting OPS)
  7. VRR Java + Logback configuration (Targetting OPS)
  8. VRR FileBeat configuration (Targetting OPS)
  9. VRR Logstash configuration and Index templates (Targetting OPS)
  10. VRR Curator configuration (Targetting OPS)
  11. Logstash Grok, JSON Filter and JSON Input performance comparison (Targetting OPS)

JSON as logs format

 

JSON Logs

JSON is not the only solution, but it's one of them, and the one I am advocating for until I find something better.

Based on my personal experience across almost a dozen companies, Logs lifecycle looks like this a lot:

Bad log lifecycle:
  1. Logs are created as plain text, using natural language, unstructured.
  2. Lines are batched together if they belong to an exception stacktrace, otherwise treated as individual messages (regex work).
  3. Parsed using Grok, message and timestamp are extracted from parts of the parsing (more regex).
  4. If logs are structured, or unstructured fixed-format, some useful information can be extracted by using Grok (e.g. Apache or Nginx logs, they are always the same, easy).
  5. OPS team see how Grok consumes all CPU it can find and reject to add more regex to the expression.
  6. Developers want to get information from logs to index and plot, they ask OPS to please add some lines to Grok. Three OPS suicide and other two quit the company. Developers finally get what they want.
  7. DEV team changes the logs without telling OPS, so previously useful information stops flowing in, dashboards are now empty and nobody cares.
It doesn't really matter if your company doesn't match all previous bullet points. As long as you are using Grok, you will struggle to squeeze 50+ different regex in a single or series of Grok Regex expressions in order to extract all the information you want, and, as long as your log writing team is not your Logstash maintaining team, your Grok configuration will get outdated for good and it will happen soon.

However, if your application produced JSON, that's it, all fields go through Logstash and end up in ElasticSearch without OPS intervention at all.

Alternative steps using JSON + Logback + Logstash + ElasticSearch:
  1. Logs are created in JSON, it's developer responsibility to choose what extra metrics needs to be extracted from the code itself. Even exceptions with stacktrace are single liners JSON documents in the log file.
  2. Log files are taken by FileBeat and sent to Logstash line by line. This configuration is written once and won't change much after that.
  3. Logstash takes these lines and send it to its index in ElasticSearch without any other processing, again, write once (for all applications, not even once per application).
  4. ElasticSearch take this information, index per application, day and priority, it will keep the extra fields that developers put in the logs in first place.
  5. When developers want to expose more fields, they don't need to bother anyone, if it's in the logs, they will be in ElasticSearch. (Maybe asking for a reindexing in ES every now and then, not too much).

Using JSON as your log format is, by all means, part of the solution I am presenting here.

I haven't yet explored other topologies, like using Fluentd instead of Logstash, or FileBeat sending to ElasticSearch directly, yet to explore.


Next: 3 - Logging best practices with Logback


miércoles, 5 de septiembre de 2018

Level up logs and ELK - Introduction

Articles index:

    1. Introduction (Everyone)
    2. JSON as logs format (Everyone)
    3. Logging best practices with Logback (Targetting Java DEVs)
    4. Logging cutting-edge practices (Targetting Java DEVs) 
    5. Contract first log generator (Targetting Java DEVs)
    6. ElasticSearch VRR Estimation Strategy (Targetting OPS)
    7. VRR Java + Logback configuration (Targetting OPS)
    8. VRR FileBeat configuration (Targetting OPS)
    9. VRR Logstash configuration and Index templates (Targetting OPS)
    10. VRR Curator configuration (Targetting OPS)
    11. Logstash Grok, JSON Filter and JSON Input performance comparison (Targetting OPS)

       

      Introduction

       

      Why this? Why now?

      This is the result of many years as a developer knowing that there was something called "logs":
      A log is something super important that you cannot change because someone reads them, you cannot read them either because you don't have ssh access to the boxes they are generated in. Write them, but not too much, disk may fill.
      Then learned how to write them, then suffered how to read them (grep) until I knew there was a superexpensive tool that could collect, sort, query and present them for you.
      Then I love them, always thought of them like the ultimate audit tool but still too many colleagues preferred to use database for that sort of functionality.

      I got better at logging like it was a nice story happening in my application, learned also how to correlate logs across multiple services, but I barely managed to create good dashboards.It happened that Splunk was too expensive to buy, and ElasticSearch too expensive to maintain. Infamous years without managing logs happened again.

      Finally I got a job that involved architecture-level monitoring decisions and got the opportunity to develop a logging strategy for ElasticSearch (Splunk and other managed platforms didn't require that much hard thinking as they were providing the know-how and setup time). The strategy I will be developing in the next few articles came as a solution to many common restrictions in all companies I've been around the last decade.

      It is a long story, will try to make it concise, bear with me and, if you belong to that huge 95% of companies that uses ElasticSearch as a supergrep, you'll raise your game.

      Objectives of this series of articles:

      1. Save up to 90% disk space based on VRR (Variable Replication factor and Retention) estimations by playing with replication, retention and custom classification.
          • Differentiate important from redundant information and apply different policies to them.
      2. Log useful information for once, that you will be able to filter, query, plot and alert on.
        • We are covering parameters, structured arguments, and how to avoid grok to parse them.
      3. Save tons of OPS time by using the right tools to empower DEVs to be responsible of their logs.
        •  Let's avoid bothering our heroes with each change in a log line. Minimizing OPS time is paramount.

      Some assumptions:

      • All my examples will orbit around Java applications using SLF4J log framework, backed by Logback.
      • Logs are dumped to files, read by FileBeat, sent to Logstash.
      • Logstash receives the log lines and send them to ElasticSearch after some processing.
      • Kibana as ElasticSearch UI.

      Even if your stack is not 100% identical, I am sure you can apply some bits from here.


      Next:  2 - JSON as logs format