grok

Milestone: 3

Parse arbitrary text and structure it.

Grok is currently the best way in logstash to parse crappy unstructured log data into something structured and queryable.

This tool is perfect for syslog logs, apache and other webserver logs, mysql logs, and in general, any log format that is generally written for humans and not computer consumption.

Logstash ships with about 120 patterns by default. You can find them here: https://github.com/logstash/logstash/tree/v1.4.2/patterns. You can add your own trivially. (See the patterns_dir setting)

If you need help building patterns to match your logs, you will find the http://grokdebug.herokuapp.com too quite useful!

Grok Basics

Grok works by combining text patterns into something that matches your logs.

The syntax for a grok pattern is %{SYNTAX:SEMANTIC}

The SYNTAX is the name of the pattern that will match your text. For example, “3.44” will be matched by the NUMBER pattern and “55.3.244.1” will be matched by the IP pattern. The syntax is how you match.

The SEMANTIC is the identifier you give to the piece of text being matched. For example, “3.44” could be the duration of an event, so you could call it simply ‘duration’. Further, a string “55.3.244.1” might identify the ‘client’ making a request.

For the above example, your grok filter would look something like this:

%{NUMBER:duration} %{IP:client}

Optionally you can add a data type conversion to your grok pattern. By default all semantics are saved as strings. If you wish to convert a semantic’s data type, for example change a string to an integer then suffix it with the target data type. For example %{NUMBER:num:int} which converts the ‘num’ semantic from a string to an integer. Currently the only supported conversions are int and float.

Example

With that idea of a syntax and semantic, we can pull out useful fields from a sample log like this fictional http request log:

55.3.244.1 GET /index.html 15824 0.043

The pattern for this could be:

%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}

A more realistic example, let’s read these logs from a file:

input {
  file {
    path => "/var/log/http.log"
  }
}
filter {
  grok {
    match => [ "message", "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" ]
  }
}

After the grok filter, the event will have a few extra fields in it:

client: 55.3.244.1
method: GET
request: /index.html
bytes: 15824
duration: 0.043

Regular Expressions

Grok sits on top of regular expressions, so any regular expressions are valid in grok as well. The regular expression library is Oniguruma, and you can see the full supported regexp syntax on the Onigiruma site.

Custom Patterns

Sometimes logstash doesn’t have a pattern you need. For this, you have a few options.

First, you can use the Oniguruma syntax for ‘named capture’ which will let you match a piece of text and save it as a field:

(?<field_name>the pattern here)

For example, postfix logs have a ‘queue id’ that is an 10 or 11-character hexadecimal value. I can capture that easily like this:

(?<queue_id>[0-9A-F]{10,11})

Alternately, you can create a custom patterns file.

Create a directory called patterns with a file in it called extra (the file name doesn’t matter, but name it meaningfully for yourself)
In that file, write the pattern you need as the pattern name, a space, then the regexp for that pattern.

For example, doing the postfix queue id example as above:

# contents of ./patterns/postfix:
POSTFIX_QUEUEID [0-9A-F]{10,11}

Then use the patterns_dir setting in this plugin to tell logstash where your custom patterns directory is. Here’s a full example with a sample log:

Jan  1 06:25:43 mailserver14 postfix/cleanup[21403]: BEF25A72965: message-id=<20130101142543.5828399CCAF@mailserver14.example.com>

filter {
  grok {
    patterns_dir => "./patterns"
    match => [ "message", "%{SYSLOGBASE} %{POSTFIX_QUEUEID:queue_id}: %{GREEDYDATA:syslog_message}" ]
  }
}

The above will match and result in the following fields:

timestamp: Jan 1 06:25:43
logsource: mailserver14
program: postfix/cleanup
pid: 21403
queue_id: BEF25A72965
syslog_message: message-id=20130101142543.5828399CCAF@mailserver14.example.com

The timestamp, logsource, program, and pid fields come from the SYSLOGBASE pattern which itself is defined by other patterns.

Synopsis

This is what it might look like in your config file:

filter {
  grok {
    add_field => ... # hash (optional), default: {}
    add_tag => ... # array (optional), default: []
    break_on_match => ... # boolean (optional), default: true
    drop_if_match => ... # boolean (optional), default: false
    keep_empty_captures => ... # boolean (optional), default: false
    match => ... # hash (optional), default: {}
    named_captures_only => ... # boolean (optional), default: true
    overwrite => ... # array (optional), default: []
    patterns_dir => ... # array (optional), default: []
    remove_field => ... # array (optional), default: []
    remove_tag => ... # array (optional), default: []
    tag_on_failure => ... # array (optional), default: ["_grokparsefailure"]
  }
}

Details

add_field

Value type is hash
Default value is {}

If this filter is successful, add any arbitrary fields to this event. Field names can be dynamic and include parts of the event using the %{field} Example:

filter {
  grok {
    add_field => { "foo_%{somefield}" => "Hello world, from %{host}" }
  }
}

# You can also add multiple fields at once:

filter {
  grok {
    add_field => { 
      "foo_%{somefield}" => "Hello world, from %{host}"
      "new_field" => "new_static_value"
    }
  }
}

If the event has field “somefield” == “hello” this filter, on success, would add field “foo_hello” if it is present, with the value above and the %{host} piece replaced with that value from the event. The second example would also add a hardcoded field.

add_tag

Value type is array
Default value is []

If this filter is successful, add arbitrary tags to the event. Tags can be dynamic and include parts of the event using the %{field} syntax. Example:

filter {
  grok {
    add_tag => [ "foo_%{somefield}" ]
  }
}

# You can also add multiple tags at once:
filter {
  grok {
    add_tag => [ "foo_%{somefield}", "taggedy_tag"]
  }
}

If the event has field “somefield” == “hello” this filter, on success, would add a tag “foo_hello” (and the second example would of course add a “taggedy_tag” tag).

break_on_match

Value type is boolean
Default value is true

Break on first match. The first successful match by grok will result in the filter being finished. If you want grok to try all patterns (maybe you are parsing different things), then set this to false.

drop_if_match

Value type is boolean
Default value is false

Drop if matched. Note, this feature may not stay. It is preferable to combine grok + grep filters to do parsing + dropping.

exclude_tags DEPRECATED

DEPRECATED WARNING: This config item is deprecated. It may be removed in a further version.
Value type is array
Default value is []

Only handle events without all/any (controlled by exclude_any config option) of these tags. Optional.

keep_empty_captures

Value type is boolean
Default value is false

If true, keep empty captures as event fields.

match

Value type is hash
Default value is {}

A hash of matches of field => value

For example:

filter {
  grok {
    match => [ "message", "Duration: %{NUMBER:duration}" ]
  }
}

named_captures_only

Value type is boolean
Default value is true

If true, only store named captures from grok.

overwrite

Value type is array
Default value is []

The fields to overwrite.

This allows you to overwrite a value in a field that already exists.

For example, if you have a syslog line in the ‘message’ field, you can overwrite the ‘message’ field with part of the match like so:

filter {
  grok {
    match => [
      "message",
      "%{SYSLOGBASE} %{DATA:message}"
    ]
    overwrite => [ "message" ]
  }
}

In this case, a line like “May 29 16:37:11 sadness logger: hello world” will be parsed and ‘hello world’ will overwrite the original message.

pattern DEPRECATED

DEPRECATED WARNING: This config item is deprecated. It may be removed in a further version.
Value type is array
There is no default value for this setting.

Specify a pattern to parse with. This will match the ‘message’ field.

If you want to match other fields than message, use the ‘match’ setting. Multiple patterns is fine.

patterns_dir

Value type is array
Default value is []

logstash ships by default with a bunch of patterns, so you don’t necessarily need to define this yourself unless you are adding additional patterns.

Pattern files are plain text with format:

NAME PATTERN

For example:

NUMBER \d+

remove_field

Value type is array
Default value is []

If this filter is successful, remove arbitrary fields from this event. Fields names can be dynamic and include parts of the event using the %{field} Example:

filter {
  grok {
    remove_field => [ "foo_%{somefield}" ]
  }
}

# You can also remove multiple fields at once:

filter {
  grok {
    remove_field => [ "foo_%{somefield}" "my_extraneous_field" ]
  }
}

If the event has field “somefield” == “hello” this filter, on success, would remove the field with name “foo_hello” if it is present. The second example would remove an additional, non-dynamic field.

remove_tag

Value type is array
Default value is []

If this filter is successful, remove arbitrary tags from the event. Tags can be dynamic and include parts of the event using the %{field} syntax. Example:

filter {
  grok {
    remove_tag => [ "foo_%{somefield}" ]
  }
}

# You can also remove multiple tags at once:

filter {
  grok {
    remove_tag => [ "foo_%{somefield}", "sad_unwanted_tag"]
  }
}

If the event has field “somefield” == “hello” this filter, on success, would remove the tag “foo_hello” if it is present. The second example would remove a sad, unwanted tag as well.

singles DEPRECATED

DEPRECATED WARNING: This config item is deprecated. It may be removed in a further version.
Value type is boolean
Default value is true

If true, make single-value fields simply that value, not an array containing that one value.

tag_on_failure

Value type is array
Default value is ["_grokparsefailure"]

Append values to the ‘tags’ field when there has been no successful match

tags DEPRECATED

DEPRECATED WARNING: This config item is deprecated. It may be removed in a further version.
Value type is array
Default value is []

Only handle events with all/any (controlled by include_any config option) of these tags. Optional.

type DEPRECATED

DEPRECATED WARNING: This config item is deprecated. It may be removed in a further version.
Value type is string
Default value is ""

Note that all of the specified routing options (type,tags.exclude_tags,include_fields,exclude_fields) must be met in order for the event to be handled by the filter. The type to act on. If a type is given, then this filter will only act on messages with the same type. See any input plugin’s “type” attribute for more. Optional.

This is documentation from lib/logstash/filters/grok.rb