HDFS patch前后Ganglia看到running processes变化的分析
Ganglia running processes是怎么算出来的?ganglia是通过 cat /proc/loadavg获得running processes的。
可得到如下值:0.00 0.28 0.61 1/591 2993。其中,1是running process,591是total process。
为了追踪ganglia图上突然出现的14个running processes,调查了一下,ps Haxh查询出来的total processes和/proc/loadavg的total processes是一致的,状态为R的即是running process。于是又写了个脚本,在启动namenode的期间,每隔0.5秒打印出/proc/loadavg 和 ps Haxh的数据,查看何时出现14个running processes的。
最后发现,不管有没有这个patch,都有可能出现running processes从1上升至6,甚至14的情况。这些processes是namenode的子线程,在某些情况下状态为Rl,R是运行态,l是多线程,在/proc/loadavg中被计算成了running processes。这些子线程运行的时间很短,ganglia是每分钟获得一次数据,很有可能没有采集到。
因此,我之前测试时ganglia图示上的差别,只是巧合,导致我认为加上patch之后有问题。我通过几次实验,看到没有patch时,也会出现ganglia图中running processes上升的情况。
PROCESS STATE CODES
Here are the different values that the s, stat and state output specifiers (header "STAT" or "S") will display to describe the state of a process.
D Uninterruptible sleep (usually IO)
R Running or runnable (on run queue)
S Interruptible sleep (waiting for an event to complete)
T Stopped, either by a job control signal or because it is being traced.
W paging (not valid since the 2.6.xx kernel)
X dead (should never be seen)
Z Defunct ("zombie") process, terminated but not reaped by its parent.
For BSD formats and when the stat keyword is used, additional characters may be displayed:
< high-priority (not nice to other users)
N low-priority (nice to other users)
L has pages locked into memory (for real-time and custom IO)
s is a session leader
l is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
+ is in the foreground process group