Redis高可用集群实践

一、环境说明

redis版本:Redis-5.0.8
系统版本:Redhat 6.8

主机IP
redispri1110.50.1.55
redispri2110.50.1.56
redispri3110.50.1.57
redisbck1110.50.1.57
redisbck2110.50.1.59
redisbck3110.50.1.60

二、搭建redis高可用集群

  1. 各节点关闭NetworkManager、selinux=disable
    # /etc/init.d/NetworkManager stop
    # chkconfig NetworkManager off
    # sed -i "s/SELINUX=.*/SELINUX=disabled/g" /etc/selinux/config && setenforce 0
  2. 各节点安装Redis、gcc gcc-c++ libstdc++-devel
    # tar xvf redis-5.0.8.tar.gz -C /usr/local/
    # mv /usr/local/redis-5.0.8/ /usr/local/redis/
    # yum install  -y gcc gcc-c++ libstdc++-devel
    # cd /usr/local/redis/src && make MALLOC=libc && make install prefix=/uer/local/redis
  3. 各节点设置Iptables,开放6379/16379端口
    # iptables -I INPUT -p TCP --dport 6379 -j ACCEPT
    # iptables -I INPUT -p TCP --dport 16379 -j ACCEPT
    # /etc/init.d/iptables save
    # chkconfig iptables on
  4. 集群配置
    #创建备份文件存储、数据文件存储、日志文件存储、AOF模式文件存储
    # mkdir -p /appdata/redis/{db,file,log,aof}
    #修改配置文件参数
    # cp /usr/local/redis/redis.conf /usr/local/redis/redis.conf.bak
    # vim /usr/local/redis/redis_6379.conf
    bind 当前节点IP
    port 6379
    daemonize yes
    pidfile /var/run/redis_6379.pid
    #验证环境debug,生产环境notice
    loglevel debug       
    logfile /appdata/redis/log/redis.log
    save 900 1
    save 300 10
    save 60 10000
    dbfilename dump.rdb
    dir /appdata/redis/file/
    requirepass 202004
    maxclients 10000
    maxmemory 100M
    appendonly yes
    appendfilename appendonly.aof
    appendfsync everysec
    auto-aof-rewrite-percentage 100
    auto-aof-rewrite-min-size 64mb
    cluster-enabled yes
    cluster-config-file nodes-6379.conf
    cluster-node-timeout 5000
    #禁用清空所有记录
    rename-command FLUSHALL ""   
    #禁用清空数据库
    rename-command FLUSHDB ""    
    #禁用客户端连接后可配置服务器
    #rename-command CONFIG ""     
    #禁用客户端连接后可查看所有存在的键
    rename-command KEYS ""
  5. 各节点启动redis
    # /usr/local/bin/redis-server /usr/local/redis/redis_6379.conf
    #bin目录文件说明
    redis-benchmark  #脚本用于性能测试
    redis-check-aof  #脚本用于redis数据持久化,是来一条存储一条
    redis-check-rdb #脚本用于redis数据持久化,是每隔一段时间存储一次
    redis-cli        #脚本用于客户端对redis的连接
    redis-sentinel -- >redis-server  #脚本用于集群上,哨兵
    redis-server     #脚本用于redis服务的开启
  6. 创建集群
    # redis-cli --cluster create --cluster-replicas 1 110.50.1.55:6379 110.50.1.56:6379 110.50.1.57:6379 110.50.1.58:6379 110.50.1.59:6379 110.50.1.60:6379>>> Performing hash slots allocation on 6 nodes...
    Master[0] -> Slots 0 - 5460
    Master[1] -> Slots 5461 - 10922
    Master[2] -> Slots 10923 - 16383
    Adding replica 110.50.1.59:6379 to 110.50.1.55:6379
    Adding replica 110.50.1.60:6379 to 110.50.1.56:6379
    Adding replica 110.50.1.58:6379 to 110.50.1.57:6379
    M: d851fcaf25e413e9e58d0ad96c91ac2d7b4ae515 110.50.1.55:6379
    slots:[0-5460] (5461 slots) master
    M: ec79401490c99d77c878c9a8b7cfb08fe237b949 110.50.1.56:6379
    slots:[5461-10922] (5462 slots) master
    M: 6a75af4717e1890a632ffffcad4d1e8b98cebcab 110.50.1.57:6379
    slots:[10923-16383] (5461 slots) master
    S: b38d96efe145a810dd48e3349002794c98649d72 110.50.1.58:6379
    replicates 6a75af4717e1890a632ffffcad4d1e8b98cebcab
    S: 00d4fdf753d8097637d3c96b1da6abe43ebb7e59 110.50.1.59:6379
    replicates d851fcaf25e413e9e58d0ad96c91ac2d7b4ae515
    S: f66e8c1f3e7e3cd35a86b11dd0506b1fc24691fe 110.50.1.60:6379
    replicates ec79401490c99d77c878c9a8b7cfb08fe237b949
    Can I set the above configuration? (type ‘yes‘ to accept): yes
    >>> Nodes configuration updated
    >>> Assign a different config epoch to each node
    >>> Sending CLUSTER MEET messages to join the cluster
    Waiting for the cluster to join
    ..
    >>> Performing Cluster Check (using node 110.50.1.55:6379)
    M: d851fcaf25e413e9e58d0ad96c91ac2d7b4ae515 110.50.1.55:6379
    slots:[0-5460] (5461 slots) master
    1 additional replica(s)
    S: b38d96efe145a810dd48e3349002794c98649d72 110.50.1.58:6379
    slots: (0 slots) slave
    replicates 6a75af4717e1890a632ffffcad4d1e8b98cebcab
    S: 00d4fdf753d8097637d3c96b1da6abe43ebb7e59 110.50.1.59:6379
    slots: (0 slots) slave
    replicates d851fcaf25e413e9e58d0ad96c91ac2d7b4ae515
    M: 6a75af4717e1890a632ffffcad4d1e8b98cebcab 110.50.1.57:6379
    slots:[10923-16383] (5461 slots) master
    1 additional replica(s)
    S: f66e8c1f3e7e3cd35a86b11dd0506b1fc24691fe 110.50.1.60:6379
    slots: (0 slots) slave
    replicates ec79401490c99d77c878c9a8b7cfb08fe237b949
    M: ec79401490c99d77c878c9a8b7cfb08fe237b949 110.50.1.56:6379
    slots:[5461-10922] (5462 slots) master
    1 additional replica(s)
    [OK] All nodes agree about slots configuration.
    >>> Check for open slots...
    >>> Check slots coverage...
    [OK] All 16384 slots covered.
  7. 查看集群状态
    # redis-cli -c -h 110.50.1.55 -p 6379
    # 查看集群节点
    110.50.1.55:6379> cluster nodes
    b38d96efe145a810dd48e3349002794c98649d72 110.50.1.58: slave 6a75af4717e1890a632ffffcad4d1e8b98cebcab 0 1587694145212 4 connected
    00d4fdf753d8097637d3c96b1da6abe43ebb7e59 110.50.1.59: slave d851fcaf25e413e9e58d0ad96c91ac2d7b4ae515 0 1587694145000 5 connected
    6a75af4717e1890a632ffffcad4d1e8b98cebcab 110.50.1.57: master - 0 1587694145000 3 connected 10923-16383
    f66e8c1f3e7e3cd35a86b11dd0506b1fc24691fe 110.50.1.60: slave ec79401490c99d77c878c9a8b7cfb08fe237b949 0 1587694146623 6 connected
    d851fcaf25e413e9e58d0ad96c91ac2d7b4ae515 110.50.1.55: myself,master - 0 1587694146000 1 connected 0-5460
    ec79401490c99d77c878c9a8b7cfb08fe237b949 110.50.1.56: master - 0 1587694146219 2 connected 5461-10922
    # 查看集群信息
    110.50.1.55:6379> cluster info
    cluster_state:ok
    cluster_slots_assigned:16384
    cluster_slots_ok:16384
    cluster_slots_pfail:0
    cluster_slots_fail:0
    cluster_known_nodes:6
    cluster_size:3
    cluster_current_epoch:6
    cluster_my_epoch:1
    cluster_stats_messages_ping_sent:828
    cluster_stats_messages_pong_sent:826
    cluster_stats_messages_sent:1654
    cluster_stats_messages_ping_received:821
    cluster_stats_messages_pong_received:828
    cluster_stats_messages_meet_received:5
    cluster_stats_messages_received:1654
    # 查看集群分片
    110.50.1.55:6379> cluster slots
    1) 1) (integer) 10923
    2) (integer) 16383
    3) 1) "110.50.1.57"
      2) (integer) 6379
      3) "6a75af4717e1890a632ffffcad4d1e8b98cebcab"
    4) 1) "110.50.1.58"
      2) (integer) 6379
      3) "b38d96efe145a810dd48e3349002794c98649d72"
    2) 1) (integer) 0
    2) (integer) 5460
    3) 1) "110.50.1.55"
      2) (integer) 6379
      3) "d851fcaf25e413e9e58d0ad96c91ac2d7b4ae515"
    4) 1) "110.50.1.59"
      2) (integer) 6379
      3) "00d4fdf753d8097637d3c96b1da6abe43ebb7e59"
    3) 1) (integer) 5461
    2) (integer) 10922
    3) 1) "110.50.1.56"
      2) (integer) 6379
      3) "ec79401490c99d77c878c9a8b7cfb08fe237b949"
    4) 1) "110.50.1.60"
      2) (integer) 6379
      3) "f66e8c1f3e7e3cd35a86b11dd0506b1fc24691fe"
  8. 设置集群密码
    requirepass:外面服务、客户端来连接redis的密码。
    masterauth:redis从去连接redis主使用的密码。
    这个意思是说,如果你在主上设置了requirepass参数,你就需要再从上设置masterauth参数,并和主密码指定成一样的。这样从才能继续去同步主的数据。
    方法一:配置文件添加字段,但是需要重启
    requirepass "passwd"
    masterauth "passwd"
    方法二:登录redis执行以下命令,不需要重启服务
    config rewrite 可以将config set持久化到Redis配置文件中
    config set requirepass "passwd"
    config set masterauth "passwd"
    config rewrite
  9. 验证测试及结论

随机关闭1主节点,对应从节点被推选为主节点,集群依旧正常对外服务

*#正常集群*
112.111.110.50.1.55:6379> cluster nodes
d851fcaf25e413e9e58d0ad96c91ac2d7b4ae515 110.50.1.55: myself,master - 0 1587699843000 9 connected 0-5460
00d4fdf753d8097637d3c96b1da6abe43ebb7e59 110.50.1.59: slave d851fcaf25e413e9e58d0ad96c91ac2d7b4ae515 0 1587699845503 9 connected
b38d96efe145a810dd48e3349002794c98649d72 110.50.1.58: slave 6a75af4717e1890a632ffffcad4d1e8b98cebcab 0 1587699844597 4 connected
ec79401490c99d77c878c9a8b7cfb08fe237b949 110.50.1.56: master - 0 1587699845504 10 connected 5461-10922
f66e8c1f3e7e3cd35a86b11dd0506b1fc24691fe 110.50.1.60: slave ec79401490c99d77c878c9a8b7cfb08fe237b949 0 1587699844902 10 connected
6a75af4717e1890a632ffffcad4d1e8b98cebcab 110.50.1.57: master - 0 1587699844000 3 connected 10923-16383

*#关闭主节点110.50.1.57*
*#此时节点状态为master,fail,之前的从节点110.50.1.58被推选为master*
111.110.50.1.55:6379> cluster nodes*
d851fcaf25e413e9e58d0ad96c91ac2d7b4ae515 110.50.1.55: myself,master - 0 1587699968000 9 connected 0-5460
00d4fdf753d8097637d3c96b1da6abe43ebb7e59 110.50.1.59: slave d851fcaf25e413e9e58d0ad96c91ac2d7b4ae515 0 1587699967723 9 connected
b38d96efe145a810dd48e3349002794c98649d72 110.50.1.58: master - 0 1587699969535 12 connected 10923-16383
ec79401490c99d77c878c9a8b7cfb08fe237b949 110.50.1.56: master - 0 1587699967519 10 connected 5461-10922
f66e8c1f3e7e3cd35a86b11dd0506b1fc24691fe 110.50.1.60: slave ec79401490c99d77c878c9a8b7cfb08fe237b949 0 1587699969739 10 connected
6a75af4717e1890a632ffffcad4d1e8b98cebcab 110.50.1.57: master,fail - 1587699941783 1587699941000 3 disconnected

*#设置Key并随机查看,集群可用*

111.110.50.1.55:6379> set helloworld 111
OK
110.50.1.58:6379> get helloworld
-> Redirected to slot [2739] located at 110.50.1.55:6379
"111"
110.50.1.59:6379> get helloworld
-> Redirected to slot [2739] located at 110.50.1.55:6379
"111"

随机手动关闭1从节点,集群依旧正常对外服务

*#关闭从节点110.50.1.59*
*#此时节点状态为slave,fail*
111.110.50.1.56:6379> cluster nodes
00d4fdf753d8097637d3c96b1da6abe43ebb7e59 110.50.1.59: slave,fail d851fcaf25e413e9e58d0ad96c91ac2d7b4ae515 1587703268981 1587703268000 9 disconnected
f66e8c1f3e7e3cd35a86b11dd0506b1fc24691fe 110.50.1.60: slave ec79401490c99d77c878c9a8b7cfb08fe237b949 0 1587703278821 10 connected
b38d96efe145a810dd48e3349002794c98649d72 110.50.1.58: slave 6a75af4717e1890a632ffffcad4d1e8b98cebcab 0 1587703278000 13 connected
ec79401490c99d77c878c9a8b7cfb08fe237b949 110.50.1.56: myself,master - 0 1587703277000 10 connected 5461-10922
d851fcaf25e413e9e58d0ad96c91ac2d7b4ae515 110.50.1.55: master - 0 1587703277819 9 connected 0-5460
6a75af4717e1890a632ffffcad4d1e8b98cebcab 110.50.1.57: master - 0 1587703278000 13 connected 10923-16383

*#设置Key并随机查看,集群可用*

111.110.50.1.56:6379> set helloworld 222
-> Redirected to slot [2739] located at 110.50.1.55:6379
OK
110.50.1.58:6379> get helloworld
-> Redirected to slot [2739] located at 110.50.1.55:6379
"222"

随机关闭一对主从

*#随机主动关闭一对主从节点110.50.1.57,110.50.1.58*
*#报错(error) CLUSTERDOWN The cluster is down,集群已不可用*
111.110.50.1.55:6379> cluster nodes
d851fcaf25e413e9e58d0ad96c91ac2d7b4ae515 110.50.1.55: myself,master - 0 1587703705000 9 connected 0-5460
00d4fdf753d8097637d3c96b1da6abe43ebb7e59 110.50.1.59: slave d851fcaf25e413e9e58d0ad96c91ac2d7b4ae515 0 1587703706000 9 connected
b38d96efe145a810dd48e3349002794c98649d72 110.50.1.58: slave,fail 6a75af4717e1890a632ffffcad4d1e8b98cebcab 1587703697645 1587703697546 13 disconnected
ec79401490c99d77c878c9a8b7cfb08fe237b949 110.50.1.56: master - 0 1587703705557 10 connected 5461-10922
f66e8c1f3e7e3cd35a86b11dd0506b1fc24691fe 110.50.1.60: slave ec79401490c99d77c878c9a8b7cfb08fe237b949 0 1587703706165 10 connected
6a75af4717e1890a632ffffcad4d1e8b98cebcab 110.50.1.57: master,fail - 1587703692596 1587703691995 13 disconnected 10923-16383
110.50.1.55:6379> set helloworld 333
(error) CLUSTERDOWN The cluster is down

随机关闭两主节点

*#随机关闭两主节点110.50.1.57,110.50.1.56*
*#报错(error) CLUSTERDOWN The cluster is down,集群已不可用*
111.110.50.1.55:6379> cluster nodes
d851fcaf25e413e9e58d0ad96c91ac2d7b4ae515 110.50.1.55: myself,master - 0 1587704575000 9 connected 0-5460
00d4fdf753d8097637d3c96b1da6abe43ebb7e59 110.50.1.59: slave d851fcaf25e413e9e58d0ad96c91ac2d7b4ae515 0 1587704575494 9 connected
b38d96efe145a810dd48e3349002794c98649d72 110.50.1.58: slave 6a75af4717e1890a632ffffcad4d1e8b98cebcab 0 1587704573466 13 connected
ec79401490c99d77c878c9a8b7cfb08fe237b949 110.50.1.56: master,fail? - 1587704561741 1587704560000 10 disconnected 5461-10922
f66e8c1f3e7e3cd35a86b11dd0506b1fc24691fe 110.50.1.60: slave ec79401490c99d77c878c9a8b7cfb08fe237b949 0 1587704575496 10 connected
6a75af4717e1890a632ffffcad4d1e8b98cebcab 110.50.1.57: master,fail? - 1587704566981 1587704566000 13 disconnected 10923-16383
110.50.1.55:6379> set helloworld 444
(error) CLUSTERDOWN The cluster is down

验证结论:
1)集群是如何判断是否有某个节点挂掉并进行新的选举
redis集群的每一个节点都存有这个集群所有主节点以及从节点的信息。它们之间通过互相的ping-pong判断是否节点可以连接上。如果有一半以上的节点去ping一个节点的时候没有回应,集群就认为这个节点宕机了,然后去连接它的备用节点,并推选为新的master
2)集群不可用
a.如果集群任意master挂掉,且当前master没有slave.集群不可用
b.如果集群超过半数以上master挂掉,无论是否有slave,集群不可用

三、Redis自带的测试工具

Redis高可用集群实践

# redis-benchmark -h 110.50.1.59 -p 6382 -t set,get -c 10 -n 1000000
====== SET ======
  1000000 requests completed in 45.44 seconds
  10 parallel clients
  3 bytes payload
  keep alive: 1

89.95% <= 1 milliseconds
99.99% <= 2 milliseconds
100.00% <= 3 milliseconds
100.00% <= 3 milliseconds
22005.59 requests per second

====== GET ======
  1000000 requests completed in 45.68 seconds
  10 parallel clients
  3 bytes payload
  keep alive: 1

90.05% <= 1 milliseconds
99.99% <= 2 milliseconds
100.00% <= 3 milliseconds
100.00% <= 5 milliseconds
100.00% <= 7 milliseconds
21891.90 requests per second

四、搭建过程中遇到的问题及解决方法

一、 编译报错

# make && make install prefix=/usr/local/redis/
ake[1]: Entering directory `/usr/local/redis/src‘
    CC adlist.o
在包含自 adlist.c:34 的文件中:
zmalloc.h:50:31: 错误:jemalloc/jemalloc.h:没有那个文件或目录
zmalloc.h:55:2: 错误:#error "Newer version of jemalloc required"
make[1]: *** [adlist.o] 错误 1
make[1]: Leaving directory `/usr/local/redis/src‘
make: *** [all] 错误 2

1)src目录包含了redis server和redis cli的相关命令脚本,应在此目录编译
2)在README 有这个一段话。

Allocator
Selecting a non-default memory allocator when building Redis is done by setting
the MALLOC environment variable. Redis is compiled and linked against libc
malloc by default, with the exception of jemalloc being the default on Linux systems. This default was picked because jemalloc has proven to have fewer
fragmentation problems than libc malloc.
To force compiling against libc malloc, use:
% make MALLOC=libc
To compile against jemalloc on Mac OS X systems, use:
% make MALLOC=jemalloc

allocator(分配器), 如果有MALLOC这个环境变量, 会有用这个环境变量去建立Redis。默认的分配器是 jemalloc, 因为 jemalloc和libc比有更少的fragmentation problems。因为系统没有jemalloc而只有libc,所以编译报错。

解决办法:
(1)加参数
make MALLOC=libc
(2)下载安装jemalloc
https://github.com/jemalloc/jemalloc/releases
./configure && make && make install
二、启动报错

# dbfilename appendfilename 不能使用路径
# /usr/local/bin/redis-server /usr/local/redis/redis_6379.conf 
*** FATAL CONFIG FILE ERROR ***
Reading the configuration file, at line 253
>>> ‘dbfilename "/appdata/redis/db/dump.rdb"‘
dbfilename can‘t be a path, just a filename

*** FATAL CONFIG FILE ERROR ***
Reading the configuration file, at line 706
>>> ‘appendfilename "/appdata/redis/aof/appendonly.aof"‘
appendfilename can‘t be a path, just a filename

# 禁用命令后不能加注释
# /usr/local/bin/redis-server /usr/local/redis/redis_6379.conf
*** FATAL CONFIG FILE ERROR ***
Reading the configuration file, at line 325
>>> ‘rename-command FLUSHALL ""   #禁用清空所有记录‘
Bad directive or wrong number of arguments

三、连接报错

无论是在redis-server 5.x版本,还是老的ruby创建集群的方式, 在create cluster的环节是不能配置redis密码的,
如果设置了密码,redis-cli --cluster create会报用户认证失败的错误
 # redis-cli --cluster create --cluster-replicas 1 110.50.1.55:6379 110.50.1.56:6379 110.50.1.57:6379 110.50.1.58:6379 110.50.1.59:6379 110.50.1.60:6379
[ERR] Node 110.50.1.55:6379 NOAUTH Authentication required.
##################################################
解决方法
在redis-cli --cluster create创建集群时去除所有redis节点的密码
当集群配置完成后,通过config set的方式动态的为每一个节点设置密码(不需要重启redis,且重启后仍然有效)
redis-cli -h 127.0.0.1 -p 6379 -c
127.0.0.1:6379> config set requirepass ‘password‘   // 设置密码
127.0.0.1:6379> config set masterauth ‘password‘   // 设置连接密码
127.0.0.1:6379> config rewrite              // 把config set 操作写入配置文件中
设置密码后连接集群
redis-cli -h 127.0.0.1 -p 6379 -c -a password

五、 redis-cli --cluster help

redis-cli --cluster help
Cluster Manager Commands:
  create         host1:port1 ... hostN:portN   #创建集群
                 --cluster-replicas <arg>      #从节点个数
  check          host:port                     #检查集群
                 --cluster-search-multiple-owners #检查是否有槽同时被分配给了多个节点
  info           host:port                     #查看集群状态
  fix            host:port                     #修复集群
                 --cluster-search-multiple-owners #修复槽的重复分配问题
  reshard        host:port                     #指定集群的任意一节点进行迁移slot,重新分slots
                 --cluster-from <arg>          #需要从哪些源节点上迁移slot,可从多个源节点完成迁移,以逗号隔开,传递的是节点的node id,还可以直接传递--from all,这样源节点就是集群的所有节点,不传递该参数的话,则会在迁移过程中提示用户输入
                 --cluster-to <arg>            #slot需要迁移的目的节点的node id,目的节点只能填写一个,不传递该参数的话,则会在迁移过程中提示用户输入
                 --cluster-slots <arg>         #需要迁移的slot数量,不传递该参数的话,则会在迁移过程中提示用户输入。
                 --cluster-yes                 #指定迁移时的确认输入
                 --cluster-timeout <arg>       #设置migrate命令的超时时间
                 --cluster-pipeline <arg>      #定义cluster getkeysinslot命令一次取出的key数量,不传的话使用默认值为10
                 --cluster-replace             #是否直接replace到目标节点
  rebalance      host:port                                      #指定集群的任意一节点进行平衡集群节点slot数量 
                 --cluster-weight <node1=w1...nodeN=wN>         #指定集群节点的权重
                 --cluster-use-empty-masters                    #设置可以让没有分配slot的主节点参与,默认不允许
                 --cluster-timeout <arg>                        #设置migrate命令的超时时间
                 --cluster-simulate                             #模拟rebalance操作,不会真正执行迁移操作
                 --cluster-pipeline <arg>                       #定义cluster getkeysinslot命令一次取出的key数量,默认值为10
                 --cluster-threshold <arg>                      #迁移的slot阈值超过threshold,执行rebalance操作
                 --cluster-replace                              #是否直接replace到目标节点
  add-node       new_host:new_port existing_host:existing_port  #添加节点,把新节点加入到指定的集群,默认添加主节点
                 --cluster-slave                                #新节点作为从节点,默认随机一个主节点
                 --cluster-master-id <arg>                      #给新节点指定主节点
  del-node       host:port node_id                              #删除给定的一个节点,成功后关闭该节点服务
  call           host:port command arg arg .. arg               #在集群的所有节点执行相关命令
  set-timeout    host:port milliseconds                         #设置cluster-node-timeout
  import         host:port                                      #将外部redis数据导入集群
                 --cluster-from <arg>                           #将指定实例的数据导入到集群
                 --cluster-copy                                 #migrate时指定copy
                 --cluster-replace                              #migrate时指定replace
  help           

For check, fix, reshard, del-node, set-timeout you can specify the host and port of any working node in the cluster.

相关推荐