Hadoop and hbase is a complicated software with lots of stuff inside. It’s essential to keep an eye on them to be sure your installation is heavy and not going to break in next five minutes.
To do this, hadoop has a subsystem, called metrics which provides a way to send some value outside. Usually, this is just a number, representing something — count of requests performed, size of some buffer, etc.
The most common thing is to drop these values into ganglia to build nice charts out of them. There are lots of tutorials how to do this on the net — just google for «ganglia hadoop».
This article describes less known feature — filtering of metrics, which becomes unavoidable for mid-range HBase installations.
I suppose some prior knowledge about hadoop and ganglia.
Some time ago, HBase developers decided that emitting metrics about regionservers is not enough and started to produce metrics of regions itself. As regions tend to migrate from node to node, new ganglia chart will eventually be created for every region on every machine in cluster.
Let’s calculate. It’s about 30 metrics for region. If you have 1000 regions live on 100 machines (pretty moderate), it gives us 3 millions of fresh, completely useless RRD files on your ganglia servers. They are useless, as on every region migrate, partly-built chart will migrate to another rrd file.
I faced this problem right after CDH4 migration and it was disasterous, as we had much larger cluster than on this toy example (300 machines, 20k regions). At that time, the only way to resolve this was patch hbase to stop emitting these values. After HBase 0.94 there is better solution — metrics2 filters.
To start filter events, several decisions must be made:
- filter class
- level of filtering
- what to filter
There are two classes implemented which provide actual filtering with different filtering syntax:
The filtering rule has the following syntax:
- subsystem – kind of daemon: hbase, yarn, hdfs, etc
- sink|source – have no idea what it is, just used sink and it works
- sink_name – arbitrary name of sink used
- sources|record|metric – level of filter to operate
- include|exclude – will filter exclude or include metrics. If all rules are exclude, it works on a black list logic, if all are include, white list logic used.
There are three levels to perform filtering:
- sources – large group of metrics, usually subsystem (see next)
- record – set of metrics grouped together. By default, class name taken as a record name
- metric – name of emitted metric, for example blockCacheHitCount (please note that this is short name, not the full metric’s name appeared in ganglia. So, filter won’t get ‘regionserver.Server.blockCacheHitCount’, only ‘blockCacheHitCount’)
Names to filter
It’s a bit tricky to get list of metrics groups to do filtering, as they are hardcoded in sources. The simplest way to find all metrics provided by daemon is to get it from ‘Metrics dump’ tab in web interface in master or RS. It returns json with all metrics with their group and values. For example, small part of master’s dump:
"name" : "Hadoop:service=HBase,name=Master,sub=AssignmentManger",
"modelerType" : "Master,sub=AssignmentManger",
"tag.Context" : "master",
"tag.Hostname" : "dhcp-21-64",
"ritOldestAge" : 0,
"ritCount" : 0,
"BulkAssign_num_ops" : 1,
"BulkAssign_min" : 232,
"BulkAssign_max" : 232,
"BulkAssign_mean" : 232.0,
"BulkAssign_median" : 232.0,
"BulkAssign_75th_percentile" : 232.0,
"BulkAssign_95th_percentile" : 232.0,
"BulkAssign_99th_percentile" : 232.0,
"ritCountOverThreshold" : 0,
"Assign_num_ops" : 1,
"Assign_min" : 82,
"Assign_max" : 82,
"Assign_mean" : 82.0,
"Assign_median" : 82.0,
"Assign_75th_percentile" : 82.0,
"Assign_95th_percentile" : 82.0,
"Assign_99th_percentile" : 82.0
Under key ‘name’ we get source and record of these set of metrics (master’s assignment manager which does region assignment). ‘Master’ is a source (top-level) and ‘AssignmentManger’ (note the typo) is a record. The final metric’s name will be dot-combination of these parts (with random lower-case transform): «master.AssignmentManger.Assign_max»
So, to filter out AssignmentManager metrics from ganglia, you can write something like this:
Rules to grab
These rules filters out not very interesting or just useless metrics (from my point of view):
# Warning: this must be an address of gmond mentioned in gmetad's sources directive
# select glob filter for everything
# remove these messy useless pseudo-statistical metrics
# filter out regions metrics completely, as Ganglia have no idea how to separate them from hosts
There are several things which must be kept in mind when you configure filters:
- in hadoop-2.3.0, metrics2 supports only one filter expression for include/exclude rules per filtering level – it takes only the first one and ignores the rest. As it’s not clear from the documentation (maybe, this fixed already), it’s a bit confusing.
- there is no good documentation and schema definition for config file, so, typos in config do not cause warning messages in log – type carefully, check twice. Personally, I wasted about three days due to this — small typo, no warnings.
- in sink address you should put address of one of gmond your gmetad collecting data from