记一次防火墙导致greenplum装机失败及定位修复过程

一、问题现象

20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:----------------------------------------
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Greenplum Primary Segment Configuration
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:----------------------------------------
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-sdw1-1 /home/primary/gpseg0 40000 2 0 41000
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-sdw1-1 /home/primary/gpseg1 40001 3 1 41001
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-sdw1-2 /home/primary/gpseg2 40000 4 2 41000
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-sdw1-2 /home/primary/gpseg3 40001 5 3 41001
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:---------------------------------------
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Greenplum Mirror Segment Configuration
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:---------------------------------------
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-sdw1-2 /home/mirror/gpseg0 50000 6 0 51000
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-sdw1-2 /home/mirror/gpseg1 50001 7 1 51001
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-sdw1-1 /home/mirror/gpseg2 50000 8 2 51000
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-sdw1-1 /home/mirror/gpseg3 50001 9 3 51001
Continue with Greenplum creation Yy/Nn>
y
20180201:15:06:28:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Building the Master instance database, please wait...
20180201:15:06:38:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Starting the Master in admin mode
20180201:15:06:46:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Commencing parallel build of primary segment instances
20180201:15:06:46:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Spawning parallel processes batch [1], please wait...
....
20180201:15:06:46:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Waiting for parallel processes batch [1], please wait...
........................
20180201:15:07:10:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:------------------------------------------------
20180201:15:07:10:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Parallel process exit status
20180201:15:07:10:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:------------------------------------------------
20180201:15:07:10:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Total processes marked as completed = 4
20180201:15:07:10:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Total processes marked as killed = 0
20180201:15:07:10:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Total processes marked as failed = 0
20180201:15:07:10:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:------------------------------------------------
20180201:15:07:10:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Commencing parallel build of mirror segment instances
20180201:15:07:10:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Spawning parallel processes batch [1], please wait...
....
20180201:15:07:11:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Waiting for parallel processes batch [1], please wait...
....
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:------------------------------------------------
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Parallel process exit status
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:------------------------------------------------
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Total processes marked as completed = 0
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Total processes marked as killed = 0
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[WARN]:-Total processes marked as failed = 4 <<<<<
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:------------------------------------------------
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[FATAL]:-Errors generated from parallel processes
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Dumped contents of status file to the log file
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Building composite backout file
20180201:15:07:15:gpinitsystem:sdw1-2:gpadmin-[FATAL]:-Failures detected, see log file /home/gpadmin/gpAdminLogs/gpinitsystem_20180201.log for more detail Script Exiting!
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[WARN]:-Script has left Greenplum Database in an incomplete state
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[WARN]:-Run command /bin/bash /home/gpadmin/gpAdminLogs/backout_gpinitsystem_gpadmin_20180201_150615 to remove these changes
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Start Function BACKOUT_COMMAND
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-End Function BACKOUT_COMMAND

在装机的时候,发现所有的segment都启动失败

检查所有计算节点文件和日志,没有明显的异常信息

二、定位过程

1、查看master日志

根据提示查看/home/gpadmin/gpAdminLogs/gpinitsystem_20180201.log日志信息

20180201:15:07:12:015183 gpcreateseg.sh:sdw1-2:gpadmin-[INFO]:-End Function BACKOUT_COMMAND
20180201:15:07:12:015183 gpcreateseg.sh:sdw1-2:gpadmin-[INFO][3]:-Completed to start segment instance database sdw1-1 /home/mirror/gpseg3
20180201:15:07:12:015183 gpcreateseg.sh:sdw1-2:gpadmin-[INFO]:-Copying data for mirror on sdw1-1 using remote copy from primary sdw1-2 ...
20180201:15:07:12:015183 gpcreateseg.sh:sdw1-2:gpadmin-[INFO]:-Start Function RUN_COMMAND_REMOTE
20180201:15:07:12:015183 gpcreateseg.sh:sdw1-2:gpadmin-[INFO]:-Commencing remote /bin/ssh sdw1-2 export GPHOME=/usr/local/gpdb; . /usr/local/gpdb/greenplum_path.sh; /usr/local/gpdb/bin/lib/pysync.py -x pg_log -x postgresql.conf -x postmaster.pid /home/primary/gpseg3 \[sdw1-1\]:/home/mirror/gpseg3
Killed by signal 1.^M
Killed by signal 1.^M
Killed by signal 1.^M
Traceback (most recent call last):
File "/usr/local/gpdb/bin/lib/pysync.py", line 669, in <module>
sys.exit(LocalPysync(sys.argv, progressTimestamp=True).run())
File "/usr/local/gpdb/bin/lib/pysync.py", line 647, in run
code = self.work()
File "/usr/local/gpdb/bin/lib/pysync.py", line 611, in work
self.socket.connect(self.connectAddress)
File "/usr/lib64/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 113] No route to host
20180201:15:07:15:014991 gpcreateseg.sh:sdw1-2:gpadmin-[FATAL]:- Command export GPHOME=/usr/local/gpdb; . /usr/local/gpdb/greenplum_path.sh; /usr/local/gpdb/bin/lib/pysync.py -x pg_log -x postgresql.conf -x postmaster.pid /home/primary/gpseg2 \[sdw1-1\]:/home/mirror/gpseg2 on sdw1-2 failed with error status 1
20180201:15:07:15:014991 gpcreateseg.sh:sdw1-2:gpadmin-[INFO]:-End Function RUN_COMMAND_REMOTE
20180201:15:07:15:014991 gpcreateseg.sh:sdw1-2:gpadmin-[FATAL][2]:-Failed remote copy of segment data directory from sdw1-2 to sdw1-1
Killed by signal 1.^M
Traceback (most recent call last):
File "/usr/local/gpdb/bin/lib/pysync.py", line 669, in <module>
sys.exit(LocalPysync(sys.argv, progressTimestamp=True).run())
File "/usr/local/gpdb/bin/lib/pysync.py", line 647, in run
code = self.work()
File "/usr/local/gpdb/bin/lib/pysync.py", line 611, in work
self.socket.connect(self.connectAddress)
File "/usr/lib64/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 113] No route to host
Traceback (most recent call last):
File "/usr/local/gpdb/bin/lib/pysync.py", line 669, in <module>
sys.exit(LocalPysync(sys.argv, progressTimestamp=True).run())
File "/usr/local/gpdb/bin/lib/pysync.py", line 647, in run
code = self.work()
File "/usr/local/gpdb/bin/lib/pysync.py", line 611, in work
self.socket.connect(self.connectAddress)
File "/usr/lib64/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 113] No route to host

此处的关键信息在于:No route to host

按照此信息判断是集群内机器有不互通的情况,之后做如下检查:

  • 所有物理机是否能正常连通
  • 所有物理机hosts文件配置是否正确
  • 是否ssh免密打通

所有内容检查完后,发现一切都是正常的。

2、查看segment目录

Commencing remote /bin/ssh sdw1-2 export GPHOME=/usr/local/gpdb; . /usr/local/gpdb/greenplum_path.sh; /usr/local/gpdb/bin/lib/pysync.py -x pg_log -x postgresql.conf -x postmaster.pid /home/primary/gpseg3 \[sdw1-1\]:/home/mirror/gpseg3

根据报错的这段信息,找到sdw1-1和sdw1-2机器的primary和mirror目录,做如下检查:

  • 目录是否正常创建
  • 文件是否完成
  • 权限是否正确(文件夹权限应该授予给数据库的管理员账户)

检查完成后发现都是正常的

3、检查segment日志文件

<span>检查完所有相关的segment日志文件后,基本segment的启动都是正常的,仅找到如下的信息:<br />
2018-02-01 15:07:07.854785 CST,,,p9642,th708450368,,,,0,,,seg-1,,,,,"LOG","00000","database system is ready to accept connections","PostgreSQL 8.3.23 (Greenplum Database 4.3.99.00 build dev) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4) compiled on Jan 18 2018 15:33:53 (with assert checking)",,,,,,0,,"postmaster.c",4337,
2018-02-01 15:07:08.853415 CST,,,p9642,th708450368,,,,0,,,seg-1,,,,,"LOG","00000","received smart shutdown request",,,,,,,0,,"postmaster.c",4075,
2018-02-01 15:07:08.855196 CST,,,p9664,th708450368,,,,0,,,seg-1,,,,,"LOG","00000","shutting down",,,,,,,0,,"xlog.c",8616,
2018-02-01 15:07:08.863175 CST,,,p9664,th708450368,,,,0,,,seg-1,,,,,"LOG","00000","database system is shut down",,,,,,,0,,"xlog.c",8632,

这里说明数据库并没有启动完成,问题也不是在于segment上

至此,没有多少有价值的信息来解决问题,因为翻阅官方文档,在https://www.emc.com/collateral/TechnicalDocument/docu51071.pdf文档的第三章提到需要关闭防火墙,于是使用如下命令查看防火墙状态

systemctl status firewalld.service

发现防火墙是开启状态

三、解决方案:

1、回滚安装

根据装机时的日志,提供了如下语句进行回滚:

/bin/bash /home/gpadmin/gpAdminLogs/backout_gpinitsystem_gpadmin_20180201_150615

2、配置防火墙

systemctl stop firewalld.service
systemctl disable firewalld.service#禁止防火墙开机后自启动

注意:以上语句适用于centos 7

3、重新执行安装步骤即可

疑问:目前为什么一定要关闭防火墙,还没有深入研究,另外如何在防火墙开启的状态下部署集群,本人也还没有成功过。

相关推荐