others linux服务器运维 django3 监控 k8s golang 数据库 大数据 前端 devops 理论基础 java oracle 运维日志

hbase regionserver 无法启动 端口不监听

访问量:1797 创建时间:2022-01-07

现象就是巡检hdp平台发现有个hbase的regionserver 16030端口一直不通。

登陆此机器排查

##查看端口未被监听,
netstat -tunlp | grep 16030
##排查16030端口也未被其他程序占用
ss -an | grep 16030
##排查进程是否启动,发现进程已经启动
[root@example05 hbase]# ps aux | grep -i regionserve
root       386  0.0  0.0 112712   988 pts/0    S+   08:38   0:00 grep --color=auto -i regionserve
hbase    61101  0.0  0.0 113188  1544 ?        S    08:30   0:00 bash /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh --config /usr/hdp/current/hbase-regionserver/conf foreground_start regionserver
hbase    61115 98.4  6.5 44891440 17169096 ?   Sl   08:30   7:48 /usr/local/jdk1.8.0_112/bin/java -Dproc_regionserver -XX:OnOutOfMemoryError=kill -9 %p -Dhdp.version=2.6.5.0-292 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hbase/hs_err_pid%p.log -Djava.security.auth.login.config=/usr/hdp/current/hbase-regionserver/conf/hbase_client_jaas.conf -Djava.io.tmpdir=/tmp -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/hbase/gc.log-202201070830 -Xmn6144m -XX:CMSInitiatingOccupancyFraction=50 -XX:+UseCMSInitiatingOccupancyOnly -Xms30720m -Xmx30720m -Djava.security.auth.login.config=/usr/hdp/current/hbase-regionserver/conf/hbase_regionserver_jaas.conf -Djavax.security.auth.useSubjectCredsOnly=false -XX:MaxDirectMemorySize=12288m -Dhbase.log.dir=/var/log/hbase -Dhbase.log.file=hbase-hbase-regionserver-example05.hdpprd.example.com.log -Dhbase.home.dir=/usr/hdp/current/hbase-regionserver/bin/.. -Dhbase.id.str=hbase -Dhbase.root.logger=INFO,RFA -Djava.library.path=:/usr/hdp/2.6.5.0-292/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.6.5.0-292/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.6.5.0-292/hadoop/lib/native -Dhbase.security.logger=INFO,RFAS org.apache.hadoop.hbase.regionserver.HRegionServer start

到目前发现进程启动但是未监听端口

##排查进程gc是否存在问题,发现内存垃圾回收无异常
[root@example05 hbase]# /usr/local/jdk1.8.0_112/bin/jstat -gcutil 进程号 1s

排查regionserver的日志

[root@example05 ~]# tail -f /var/log/hbase/hbase-hbase-regionserver-example05.hdpprd.example.com.log 
##...省略部分输出
2022-01-07 07:29:59,378 INFO  [main-SendThread(hdpprdm03.hdpprd.example.com:2181)] zookeeper.ClientCnxn: Opening socket connection to server hdpprdm03.hdpprd.example.com/192.168.0.96:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
2022-01-07 07:29:59,379 INFO  [main-SendThread(hdpprdm03.hdpprd.example.com:2181)] zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.0.37:37802, server: hdpprdm03.hdpprd.example.com/192.168.0.96:2181
2022-01-07 07:29:59,379 INFO  [main-SendThread(hdpprdm03.hdpprd.example.com:2181)] zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
2022-01-07 07:30:00,161 INFO  [main-SendThread(hdpprdm01.hdpprd.example.com:2181)] client.ZooKeeperSaslClient: Client will use GSSAPI as SASL mechanism.
2022-01-07 07:30:00,161 INFO  [main-SendThread(hdpprdm01.hdpprd.example.com:2181)] zookeeper.ClientCnxn: Opening socket connection to server hdpprdm01.hdpprd.example.com/192.168.0.92:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
2022-01-07 07:30:00,162 INFO  [main-SendThread(hdpprdm01.hdpprd.example.com:2181)] zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.0.37:39804, server: hdpprdm01.hdpprd.example.com/192.168.0.92:2181
2022-01-07 07:30:00,162 INFO  [main-SendThread(hdpprdm01.hdpprd.example.com:2181)] zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect

发现大量的连接zookeeper失败的请求,并且一直在重复请求。

重启本节点regionserver,发现还是一样。

排查zookeeper服务(和zookeeper的gc情况,这里省略过程,排查没发现异常)

###通过命令行工具连接zookeeper看看是否正常,发现还是能正常链接的
[root@hdpprdm01 ~]#  /usr/hdp/current/zookeeper-client/bin/zkCli.sh -server hdpprdm01.hdpprd.example.com:2181
[root@hdpprdm01 ~]#  /usr/hdp/current/zookeeper-client/bin/zkCli.sh -server hdpprdm02.hdpprd.example.com:2181
[root@hdpprdm01 ~]#  /usr/hdp/current/zookeeper-client/bin/zkCli.sh -server hdpprdm03.hdpprd.example.com:2181

排查zookeeper日志

[root@hdpprdm01 ~]# tail -f /var/log/zookeeper/zookeeper.log
#...省略部分内容
2022-01-07 08:26:13,225 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /192.168.0.37 - max is 240
2022-01-07 08:26:13,226 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /192.168.0.37 - max is 240
2022-01-07 08:26:13,233 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /192.168.0.180 - max is 240
2022-01-07 08:26:13,237 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /192.168.0.180 - max is 240
2022-01-07 08:26:13,238 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /192.168.0.180 - max is 240
2022-01-07 08:26:13,241 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /192.168.0.37 - max is 240
2022-01-07 08:26:13,251 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /192.168.0.246 - max is 240
2022-01-07 08:26:13,257 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /192.168.0.180 - max is 240
2022-01-07 08:26:13,258 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /192.168.0.37 - max is 240
2022-01-07 08:26:13,259 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /192.168.0.37 - max is 240

发现连接数量超过最大数,配置hdp集群的zookeeper参数,将Custom zoo.cfg的maxClientCnxns改为500。 (顺便将Zookeeper Server Maximum Memory 从1024改为3072,以免将来zookeeper内存再出啥问题,这个不是本次故障原因,可以不做修改)

接下来,逐个手动重启zookeeper的节点(逐个手动重启,不影响zookeeper提供服务)。再次排查zookeeper日志,未发现异常,hbase的regionserver 也自动恢复正常。

登陆评论: 使用GITHUB登陆