k8s实践17:监控利器prometheus helm方式部署配置测试
1.部署helm
后面使用helm部署grafana和prometheus,因此首先需要部署helm,保证helm能正常使用.
部署helm客户端过程见下:
[ helm]# curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 > get_helm.sh % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 6617 100 6617 0 0 5189 0 0:00:01 0:00:01 --:--:-- 5193 [ helm]# ls get_helm.sh [ helm]# chmod 700 get_helm.sh [ helm]# ./get_helm.sh Downloading https://get.helm.sh/helm-v3.0.2-linux-amd64.tar.gz Preparing to install helm into /usr/local/bin helm installed into /usr/local/bin/helm [ helm]# helm version version.BuildInfo{Version:"v3.0.2", GitCommit:"19e47ee3283ae98139d98460de796c1be1e3975f", GitTreeState:"clean", GoVersion:"go1.13.5"}
加yum
[ helm]# helm repo add stable https://kubernetes-charts.storage.googleapis.com/ "stable" has been added to your repositories
搜搜看看
[ helm]# helm search repo stable |grep grafana stable/grafana 4.2.2 6.5.2 The leading tool for querying and visualizing t... [ helm]# helm search repo stable |grep prometheus stable/helm-exporter 0.3.1 0.4.0 Exports helm release stats to prometheus stable/prometheus 9.7.2 2.13.1 Prometheus is a monitoring system and time seri... stable/prometheus-adapter 1.4.0 v0.5.0 A Helm chart for k8s prometheus adapter stable/prometheus-blackbox-exporter 1.6.0 0.15.1 Prometheus Blackbox Exporter stable/prometheus-cloudwatch-exporter 0.5.0 0.6.0 A Helm chart for prometheus cloudwatch-exporter stable/prometheus-consul-exporter 0.1.4 0.4.0 A Helm chart for the Prometheus Consul Exporter stable/prometheus-couchdb-exporter 0.1.1 1.0 A Helm chart to export the metrics from couchdb... stable/prometheus-mongodb-exporter 2.4.0 v0.10.0 A Prometheus exporter for MongoDB metrics stable/prometheus-mysql-exporter 0.5.2 v0.11.0 A Helm chart for prometheus mysql exporter with... stable/prometheus-nats-exporter 2.3.0 0.6.0 A Helm chart for prometheus-nats-exporter stable/prometheus-node-exporter 1.8.1 0.18.1 A Helm chart for prometheus node-exporter stable/prometheus-operator 8.5.0 0.34.0 Provides easy monitoring definitions for Kubern... stable/prometheus-postgres-exporter 1.1.1 0.5.1 A Helm chart for prometheus postgres-exporter stable/prometheus-pushgateway 1.2.10 1.0.1 A Helm chart for prometheus pushgateway stable/prometheus-rabbitmq-exporter 0.5.5 v0.29.0 Rabbitmq metrics exporter for prometheus stable/prometheus-redis-exporter 3.2.0 1.0.4 Prometheus exporter for Redis metrics stable/prometheus-snmp-exporter 0.0.4 0.14.0 Prometheus SNMP Exporter stable/prometheus-to-sd 0.3.0 0.5.2 Scrape metrics stored in prometheus format and ...
部署个应用测试
[ helm]# helm install stable/nginx-ingress --generate-name NAME: nginx-ingress-1577092943 LAST DEPLOYED: Mon Dec 23 17:22:26 2019 NAMESPACE: default STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: The nginx-ingress controller has been installed. It may take a few minutes for the LoadBalancer IP to be available. You can watch the status by running ‘kubectl --namespace default get services -o wide -w nginx-ingress-1577092943-controller‘
[ helm]# helm ls NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION nginx-ingress-1577092943 default 1 2019-12-23 17:22:26.230661264 +0800 CST deployed nginx-ingress-1.27.0 0.26.1
都起来了,见下
[ helm]# kubectl get all |grep nginx pod/nginx-ingress-1577092943-controller-8468884448-9wszl 1/1 Running 0 4m49s pod/nginx-ingress-1577092943-default-backend-74c4db5b5b-clc2s 1/1 Running 0 4m49s service/nginx-ingress-1577092943-controller LoadBalancer 10.254.229.168 <pending> 80:8691/TCP,443:8569/TCP 4m49s service/nginx-ingress-1577092943-default-backend ClusterIP 10.254.37.89 <none> 80/TCP 4m49s deployment.apps/nginx-ingress-1577092943-controller 1/1 1 1 4m49s deployment.apps/nginx-ingress-1577092943-default-backend 1/1 1 1 4m49s replicaset.apps/nginx-ingress-1577092943-controller-8468884448 1 1 1 4m49s replicaset.apps/nginx-ingress-1577092943-default-backend-74c4db5b5b 1 1 1 4m49s
部署完成,测试可行,移除现在安装的应用.
[ helm]# helm ls NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION nginx-ingress-1577092943 default 1 2019-12-23 17:22:26.230661264 +0800 CST deployed nginx-ingress-1.27.0 0.26.1 [ helm]# helm uninstall nginx-ingress-1577092943 release "nginx-ingress-1577092943" uninstalled
2.helm部署prometheus
helm部署prometheus
2.1.开始部署
[ ~]# helm install stable/prometheus --generate-name NAME: prometheus-1577239571 LAST DEPLOYED: Wed Dec 25 10:06:14 2019 NAMESPACE: default STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: The Prometheus server can be accessed via port 80 on the following DNS name from within your cluster: prometheus-1577239571-server.default.svc.cluster.local
2.2.遇到问题
检索启动的svc,pod
[ ~]# kubectl get svc,pod -o wide|grep prometheus service/prometheus-1577239571-alertmanager ClusterIP 10.254.251.30 <none> 80/TCP 2m26s app=prometheus,component=alertmanager,release=prometheus-1577239571 service/prometheus-1577239571-kube-state-metrics ClusterIP None <none> 80/TCP 2m26s app=prometheus,component=kube-state-metrics,release=prometheus-1577239571 service/prometheus-1577239571-node-exporter ClusterIP None <none> 9100/TCP 2m26s app=prometheus,component=node-exporter,release=prometheus-1577239571 service/prometheus-1577239571-pushgateway ClusterIP 10.254.188.166 <none> 9091/TCP 2m26s app=prometheus,component=pushgateway,release=prometheus-1577239571 service/prometheus-1577239571-server ClusterIP 10.254.128.74 <none> 80/TCP 2m26s app=prometheus,component=server,release=prometheus-1577239571 pod/prometheus-1577239571-alertmanager-67b967b8c7-lmjf7 0/2 Pending 0 2m25s <none> <none> <none> <none> pod/prometheus-1577239571-kube-state-metrics-6d86bf588b-w7hrq 1/1 Running 0 2m25s 172.30.4.7 k8s-node1 <none> <none> pod/prometheus-1577239571-node-exporter-k9bsf 1/1 Running 0 2m25s 192.168.174.130 k8s-node3 <none> <none> pod/prometheus-1577239571-node-exporter-rv9k8 1/1 Running 0 2m25s 192.168.174.129 k8s-node2 <none> <none> pod/prometheus-1577239571-node-exporter-xc8f2 1/1 Running 0 2m25s 192.168.174.128 k8s-node1 <none> <none> pod/prometheus-1577239571-pushgateway-d9b4cb944-zppfm 1/1 Running 0 2m25s 172.30.26.7 k8s-node3 <none> <none> pod/prometheus-1577239571-server-c5d4dffbf-gzk9n 0/2 Pending 0 2m25s <none> <none> <none> <none>
有两个pod一直是pending状态,检索原因,describe看看如下报错:
Warning FailedScheduling 25s (x5 over 4m27s) default-scheduler pod has unbound immediate PersistentVolumeClaims (repeated 3 times)
pvc的报错,检索pvc
[ templates]# kubectl get pvc |grep prometheus prometheus-1577239571-alertmanager Pending 21m prometheus-1577239571-server Pending 21m
describe pvc看看详情,报错,没有pv或者没有对接存储,因此无法启用pvc,报错见下:
Normal FailedBinding 16s (x82 over 20m) persistentvolume-controller no persistent volumes available for this claim and no storage class is set
怎么办?我这里的集群是pvc动态对接nfs存储的,能否修改成对接nfs存储呢?
对接Nfs存储参考前面文章,storageclass的名字见下:
[ templates]# kubectl get storageclass NAME PROVISIONER AGE managed-nfs-storage fuseim.pri/ifs 5d17h
2.3.对接存储,解决报错
检索stable/prometheus的变量,检索关于报错的pv的变量设置,参考命令见下
helm show values stable/prometheus
因为需要检索后修改,把chart文件下下来,检索修改.
[ prometheus-grafana]# helm pull stable/prometheus [ prometheus-grafana]# ls prometheus-9.7.2.tgz [ prometheus-grafana]# tar zxvf prometheus-9.7.2.tgz --warning=no-timestamp [ prometheus-grafana]# ls prometheus prometheus-9.7.2.tgz
[ prometheus-grafana]# tree prometheus prometheus ├── Chart.yaml ├── README.md ├── templates │ ├── alertmanager-clusterrolebinding.yaml │ ├── alertmanager-clusterrole.yaml │ ├── alertmanager-configmap.yaml │ ├── alertmanager-deployment.yaml │ ├── alertmanager-ingress.yaml │ ├── alertmanager-networkpolicy.yaml │ ├── alertmanager-pdb.yaml │ ├── alertmanager-podsecuritypolicy.yaml │ ├── alertmanager-pvc.yaml │ ├── alertmanager-serviceaccount.yaml │ ├── alertmanager-service-headless.yaml │ ├── alertmanager-service.yaml │ ├── alertmanager-statefulset.yaml │ ├── _helpers.tpl │ ├── kube-state-metrics-clusterrolebinding.yaml │ ├── kube-state-metrics-clusterrole.yaml │ ├── kube-state-metrics-deployment.yaml │ ├── kube-state-metrics-networkpolicy.yaml │ ├── kube-state-metrics-pdb.yaml │ ├── kube-state-metrics-podsecuritypolicy.yaml │ ├── kube-state-metrics-serviceaccount.yaml │ ├── kube-state-metrics-svc.yaml │ ├── node-exporter-daemonset.yaml │ ├── node-exporter-podsecuritypolicy.yaml │ ├── node-exporter-rolebinding.yaml │ ├── node-exporter-role.yaml │ ├── node-exporter-serviceaccount.yaml │ ├── node-exporter-service.yaml │ ├── NOTES.txt │ ├── pushgateway-clusterrolebinding.yaml │ ├── pushgateway-clusterrole.yaml │ ├── pushgateway-deployment.yaml │ ├── pushgateway-ingress.yaml │ ├── pushgateway-networkpolicy.yaml │ ├── pushgateway-pdb.yaml │ ├── pushgateway-podsecuritypolicy.yaml │ ├── pushgateway-pvc.yaml │ ├── pushgateway-serviceaccount.yaml │ ├── pushgateway-service.yaml │ ├── server-clusterrolebinding.yaml │ ├── server-clusterrole.yaml │ ├── server-configmap.yaml │ ├── server-deployment.yaml │ ├── server-ingress.yaml │ ├── server-networkpolicy.yaml │ ├── server-pdb.yaml │ ├── server-podsecuritypolicy.yaml │ ├── server-pvc.yaml │ ├── server-serviceaccount.yaml │ ├── server-service-headless.yaml │ ├── server-service.yaml │ ├── server-statefulset.yaml │ └── server-vpa.yaml └── values.yaml 1 directory, 56 files
定义所有变量的是values.yaml文件,检索这个文件.
包含的东西特别多,需要一条条去检查,关于pv的其中一个配置定义
persistentVolume: ## If true, alertmanager will create/use a Persistent Volume Claim ## If false, use emptyDir ## enabled: true ## alertmanager data Persistent Volume access modes ## Must match those of existing PV or dynamic provisioner ## Ref: http://kubernetes.io/docs/user-guide/persistent-volumes/ ## accessModes: - ReadWriteOnce ## alertmanager data Persistent Volume Claim annotations ## annotations: {} ## alertmanager data Persistent Volume existing claim name ## Requires alertmanager.persistentVolume.enabled: true ## If defined, PVC must be created manually before volume will be bound existingClaim: "" ## alertmanager data Persistent Volume mount root path ## mountPath: /data ## alertmanager data Persistent Volume size ## size: 2Gi ## alertmanager data Persistent Volume Storage Class ## If defined, storageClassName: <storageClass> ## If set to "-", storageClassName: "", which disables dynamic provisioning ## If undefined (the default) or set to null, no storageClassName spec is ## set, choosing the default provisioner. (gp2 on AWS, standard on ## GKE, AWS & OpenStack) ## # storageClass: "-" ## alertmanager data Persistent Volume Binding Mode ## If defined, volumeBindingMode: <volumeBindingMode> ## If undefined (the default) or set to null, no volumeBindingMode spec is ## set, choosing the default mode. ##
根据上面的变量解释,可知,chart定义了一个2GB的pvc,配置pvc对接动态存储的参数是:storageClass,默认没有启用.启用这个参数对接storageclass.
把# storageClass: "-" 修改成 storageClass: managed-nfs-storage (managed-nfs-storage 是我在集群配置的storageclass的名字,总共需要修改三处)
[ prometheus-grafana]# cat prometheus/values.yaml |grep -B 8 managed ## alertmanager data Persistent Volume Storage Class ## If defined, storageClassName: <storageClass> ## If set to "-", storageClassName: "", which disables dynamic provisioning ## If undefined (the default) or set to null, no storageClassName spec is ## set, choosing the default provisioner. (gp2 on AWS, standard on ## GKE, AWS & OpenStack) ## # storageClass: "-" storageClass: managed-nfs-storage -- ## Prometheus server data Persistent Volume Storage Class ## If defined, storageClassName: <storageClass> ## If set to "-", storageClassName: "", which disables dynamic provisioning ## If undefined (the default) or set to null, no storageClassName spec is ## set, choosing the default provisioner. (gp2 on AWS, standard on ## GKE, AWS & OpenStack) ## # storageClass: "-" storageClass: managed-nfs-storage -- ## pushgateway data Persistent Volume Storage Class ## If defined, storageClassName: <storageClass> ## If set to "-", storageClassName: "", which disables dynamic provisioning ## If undefined (the default) or set to null, no storageClassName spec is ## set, choosing the default provisioner. (gp2 on AWS, standard on ## GKE, AWS & OpenStack) ## # storageClass: "-" storageClass: managed-nfs-storage
可以的,修改对接存储参数后,安装成功,见下
[ prometheus-grafana]# kubectl get svc,pod -o wide |grep prometheus service/prometheus-1577263826-alertmanager ClusterIP 10.254.112.105 <none> 80/TCP 4m6s app=prometheus,component=alertmanager,release=prometheus-1577263826 service/prometheus-1577263826-kube-state-metrics ClusterIP None <none> 80/TCP 4m6s app=prometheus,component=kube-state-metrics,release=prometheus-1577263826 service/prometheus-1577263826-node-exporter ClusterIP None <none> 9100/TCP 4m6s app=prometheus,component=node-exporter,release=prometheus-1577263826 service/prometheus-1577263826-pushgateway ClusterIP 10.254.185.145 <none> 9091/TCP 4m6s app=prometheus,component=pushgateway,release=prometheus-1577263826 service/prometheus-1577263826-server ClusterIP 10.254.132.104 <none> 80/TCP 4m6s app=prometheus,component=server,release=prometheus-1577263826 pod/prometheus-1577263826-alertmanager-5cfccc55b7-6hdqn 2/2 Running 0 4m5s 172.30.26.8 k8s-node3 <none> <none> pod/prometheus-1577263826-kube-state-metrics-697db589d4-d5rmm 1/1 Running 0 4m5s 172.30.26.7 k8s-node3 <none> <none> pod/prometheus-1577263826-node-exporter-5gcc2 1/1 Running 0 4m5s 192.168.174.129 k8s-node2 <none> <none> pod/prometheus-1577263826-node-exporter-b569p 1/1 Running 0 4m5s 192.168.174.130 k8s-node3 <none> <none> pod/prometheus-1577263826-node-exporter-mft6l 1/1 Running 0 4m5s 192.168.174.128 k8s-node1 <none> <none> pod/prometheus-1577263826-pushgateway-95c67bd5d-28p25 1/1 Running 0 4m5s 172.30.4.7 k8s-node1 <none> <none> pod/prometheus-1577263826-server-88fbdfc47-p2bfm 2/2 Running 0 4m5s 172.30.4.8 k8s-node1 <none> <none>
2.4.prometheus基础概念
prometheus这些组件的作用
prometheus server
Prometheus Server是Prometheus组件中的核心部分,负责实现对监控数据的获取,存储以及查询.
Prometheus Server内置的Express Browser UI,通过这个UI可以直接通过PromQL实现数据的查询以及可视化.
node-exporter
Exporter将监控数据采集的端点通过HTTP服务的形式暴露给Prometheus Server,Prometheus Server通过访问该Exporter提供的Endpoint端点,即可获取到需要采集的监控数据.
alertmanager
在Prometheus Server中支持基于PromQL创建告警规则,如果满足PromQL定义的规则,则会产生一条告警,而告警的后续处理流程则由AlertManager进行管理.在AlertManager中我们可以与邮件,Slack等等内置的通知方式进行集成,也可以通过Webhook自定义告警处理方式.AlertManager即Prometheus体系中的告警处理中心.
pushgateway
由于Prometheus数据采集基于Pull模型进行设计,因此在网络环境的配置上必须要让Prometheus Server能够直接与Exporter进行通信.当这种网络需求无法直接满足时,就可以利用PushGateway来进行中转.可以通过PushGateway将内部网络的监控数据主动Push到Gateway当中.而Prometheus Server则可以采用同样Pull的方式从PushGateway中获取到监控数据.
这里的环境用不到这个.
kube-state-metrics
基础概念是:kube-state-metrics轮询Kubernetes API,并将Kubernetes的结构化信息转换为metrics.比如调度多少rc,现在可用多少个rc?现在有多少个Job在执行?
2.5.配置web访问prometheus server和kube-state-metrics
前面环境部署了traefik,只需要添加ingress即可,见下:
prometheus server
[ prometheus-grafana]# cat prometheus-server-ingress.yaml apiVersion: extensions/v1beta1 kind: Ingress metadata: name: prometheus-server namespace: default spec: rules: - host: prometheus-server http: paths: - path: / backend: serviceName: prometheus-1577263826-server servicePort: 80
kube-state-metrics
[ prometheus-grafana]# cat kube-state-ingress.yaml apiVersion: extensions/v1beta1 kind: Ingress metadata: name: kube-state namespace: default spec: rules: - host: kube-state http: paths: - path: / backend: serviceName: prometheus-1577263826-kube-state-metrics servicePort: 80
指定下Host解析,即可正常访问,注意两个server都是https的.怎么配置traefik请参考traefik的配置.