现象就是巡检hdp平台发现有个hbase的regionserver 16030端口一直不通。
登陆此机器排查
##查看端口未被监听,
netstat -tunlp | grep 16030
##排查16030端口也未被其他程序占用
ss -an | grep 16030
##排查进程是否启动,发现进程已经启动
[root@example05 hbase]# ps aux | grep -i regionserve
root 386 0.0 0.0 112712 988 pts/0 S+ 08:38 0:00 grep --color=auto -i regionserve
hbase 61101 0.0 0.0 113188 1544 ? S 08:30 0:00 bash /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh --config /usr/hdp/current/hbase-regionserver/conf foreground_start regionserver
hbase 61115 98.4 6.5 44891440 17169096 ? Sl 08:30 7:48 /usr/local/jdk1.8.0_112/bin/java -Dproc_regionserver -XX:OnOutOfMemoryError=kill -9 %p -Dhdp.version=2.6.5.0-292 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hbase/hs_err_pid%p.log -Djava.security.auth.login.config=/usr/hdp/current/hbase-regionserver/conf/hbase_client_jaas.conf -Djava.io.tmpdir=/tmp -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/hbase/gc.log-202201070830 -Xmn6144m -XX:CMSInitiatingOccupancyFraction=50 -XX:+UseCMSInitiatingOccupancyOnly -Xms30720m -Xmx30720m -Djava.security.auth.login.config=/usr/hdp/current/hbase-regionserver/conf/hbase_regionserver_jaas.conf -Djavax.security.auth.useSubjectCredsOnly=false -XX:MaxDirectMemorySize=12288m -Dhbase.log.dir=/var/log/hbase -Dhbase.log.file=hbase-hbase-regionserver-example05.hdpprd.example.com.log -Dhbase.home.dir=/usr/hdp/current/hbase-regionserver/bin/.. -Dhbase.id.str=hbase -Dhbase.root.logger=INFO,RFA -Djava.library.path=:/usr/hdp/2.6.5.0-292/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.6.5.0-292/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.6.5.0-292/hadoop/lib/native -Dhbase.security.logger=INFO,RFAS org.apache.hadoop.hbase.regionserver.HRegionServer start
到目前发现进程启动但是未监听端口
##排查进程gc是否存在问题,发现内存垃圾回收无异常
[root@example05 hbase]# /usr/local/jdk1.8.0_112/bin/jstat -gcutil 进程号 1s
排查regionserver的日志
[root@example05 ~]# tail -f /var/log/hbase/hbase-hbase-regionserver-example05.hdpprd.example.com.log
##...省略部分输出
2022-01-07 07:29:59,378 INFO [main-SendThread(hdpprdm03.hdpprd.example.com:2181)] zookeeper.ClientCnxn: Opening socket connection to server hdpprdm03.hdpprd.example.com/192.168.0.96:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
2022-01-07 07:29:59,379 INFO [main-SendThread(hdpprdm03.hdpprd.example.com:2181)] zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.0.37:37802, server: hdpprdm03.hdpprd.example.com/192.168.0.96:2181
2022-01-07 07:29:59,379 INFO [main-SendThread(hdpprdm03.hdpprd.example.com:2181)] zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
2022-01-07 07:30:00,161 INFO [main-SendThread(hdpprdm01.hdpprd.example.com:2181)] client.ZooKeeperSaslClient: Client will use GSSAPI as SASL mechanism.
2022-01-07 07:30:00,161 INFO [main-SendThread(hdpprdm01.hdpprd.example.com:2181)] zookeeper.ClientCnxn: Opening socket connection to server hdpprdm01.hdpprd.example.com/192.168.0.92:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
2022-01-07 07:30:00,162 INFO [main-SendThread(hdpprdm01.hdpprd.example.com:2181)] zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.0.37:39804, server: hdpprdm01.hdpprd.example.com/192.168.0.92:2181
2022-01-07 07:30:00,162 INFO [main-SendThread(hdpprdm01.hdpprd.example.com:2181)] zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
发现大量的连接zookeeper失败的请求,并且一直在重复请求。
重启本节点regionserver,发现还是一样。
排查zookeeper服务(和zookeeper的gc情况,这里省略过程,排查没发现异常)
###通过命令行工具连接zookeeper看看是否正常,发现还是能正常链接的
[root@hdpprdm01 ~]# /usr/hdp/current/zookeeper-client/bin/zkCli.sh -server hdpprdm01.hdpprd.example.com:2181
[root@hdpprdm01 ~]# /usr/hdp/current/zookeeper-client/bin/zkCli.sh -server hdpprdm02.hdpprd.example.com:2181
[root@hdpprdm01 ~]# /usr/hdp/current/zookeeper-client/bin/zkCli.sh -server hdpprdm03.hdpprd.example.com:2181
排查zookeeper日志
[root@hdpprdm01 ~]# tail -f /var/log/zookeeper/zookeeper.log
#...省略部分内容
2022-01-07 08:26:13,225 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /192.168.0.37 - max is 240
2022-01-07 08:26:13,226 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /192.168.0.37 - max is 240
2022-01-07 08:26:13,233 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /192.168.0.180 - max is 240
2022-01-07 08:26:13,237 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /192.168.0.180 - max is 240
2022-01-07 08:26:13,238 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /192.168.0.180 - max is 240
2022-01-07 08:26:13,241 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /192.168.0.37 - max is 240
2022-01-07 08:26:13,251 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /192.168.0.246 - max is 240
2022-01-07 08:26:13,257 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /192.168.0.180 - max is 240
2022-01-07 08:26:13,258 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /192.168.0.37 - max is 240
2022-01-07 08:26:13,259 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /192.168.0.37 - max is 240
发现连接数量超过最大数,配置hdp集群的zookeeper参数,将Custom zoo.cfg的maxClientCnxns改为500。 (顺便将Zookeeper Server Maximum Memory 从1024改为3072,以免将来zookeeper内存再出啥问题,这个不是本次故障原因,可以不做修改)
接下来,逐个手动重启zookeeper的节点(逐个手动重启,不影响zookeeper提供服务)。再次排查zookeeper日志,未发现异常,hbase的regionserver 也自动恢复正常。