公司網(wǎng)絡(luò)更改重啟服務(wù)器后,發(fā)現(xiàn)Prometheus監(jiān)控中node節(jié)點(diǎn)三個(gè)掛掉了,實(shí)際上節(jié)點(diǎn)服務(wù)器是正常的,但是監(jiān)控的node_exporter請(qǐng)求http://IP:9100/metrics超過(guò)10秒沒(méi)有獲取返回?cái)?shù)據(jù)則認(rèn)為服務(wù)掛掉。
成都創(chuàng)新互聯(lián)是一家網(wǎng)站設(shè)計(jì)公司,集創(chuàng)意、互聯(lián)網(wǎng)應(yīng)用、軟件技術(shù)為一體的創(chuàng)意網(wǎng)站建設(shè)服務(wù)商,主營(yíng)產(chǎn)品:成都響應(yīng)式網(wǎng)站建設(shè)、品牌網(wǎng)站制作、營(yíng)銷(xiāo)型網(wǎng)站建設(shè)。我們專(zhuān)注企業(yè)品牌在網(wǎng)站中的整體樹(shù)立,網(wǎng)絡(luò)互動(dòng)的體驗(yàn),以及在手機(jī)等移動(dòng)端的優(yōu)質(zhì)呈現(xiàn)。網(wǎng)站制作、成都網(wǎng)站制作、移動(dòng)互聯(lián)產(chǎn)品、網(wǎng)絡(luò)運(yùn)營(yíng)、VI設(shè)計(jì)、云產(chǎn)品.運(yùn)維為核心業(yè)務(wù)。為用戶(hù)提供一站式解決方案,我們深知市場(chǎng)的競(jìng)爭(zhēng)激烈,認(rèn)真對(duì)待每位客戶(hù),為客戶(hù)提供賞析悅目的作品,網(wǎng)站的價(jià)值服務(wù)。到各個(gè)節(jié)點(diǎn)服務(wù)器用curl命令檢測(cè)多久返回?cái)?shù)據(jù)
curl -o /dev/null -s -w '%{time_connect}:%{time_starttransfer}:%{time_total}\n' 'http://NodeIP:9100/metrics'
time_connect :連接時(shí)間,從開(kāi)始到TCP三次握手完成時(shí)間,這里面包括DNS解析的時(shí)候,如果想求連接時(shí)間,需要減去上面的解析時(shí)間;
time_starttransfer :開(kāi)始傳輸時(shí)間,從發(fā)起請(qǐng)求開(kāi)始,到服務(wù)器返回第一個(gè)字段的時(shí)間;
time_total :總時(shí)間;
可以看到time_connect時(shí)間是很快的,排除網(wǎng)絡(luò)問(wèn)題。除了167返回時(shí)間正常其他幾個(gè)節(jié)點(diǎn)都是有問(wèn)題的。
查看pod日志
ts=2023-01-11T06:11:08.189Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp 163:9100" msg="->167:40743: write: broken pipe"
ts=2023-01-11T06:11:08.189Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp 163:9100" msg="->.167:40743: write: broken pipe"
可以看到有大量的這種日志
參考資料: https://asktug.com/t/topic/153284/36 和 https://blog.csdn.net/lyf0327/article/details/99971590
有些說(shuō)可以調(diào)大scrape_timeout這個(gè)時(shí)間,但是這個(gè)不是解決問(wèn)題的根本
也有說(shuō)調(diào)大node_exporter的yaml中的limit內(nèi)存和CPU,但是我這邊的node_exporter是沒(méi)有配置request和limit資源的,也不是這個(gè)問(wèn)題,最后參考別人的配置文件
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "5"
prometheus.io/scrape: "true"
generation: 5
labels:
app: node-exporter
name: node-exporter
namespace: monitoring
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: node-exporter
template:
metadata:
creationTimestamp: null
labels:
app: node-exporter
name: node-exporter
spec:
containers:
- args:
- --web.listen-address=$(HOSTIP):9100
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --path.rootfs=/host # 修改了這個(gè)原來(lái)是/host/root
- --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/)
- --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
- --log.level=debug
env:
- name: HOSTIP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
image: prom/node-exporter:latest
imagePullPolicy: IfNotPresent
name: node-exporter
ports:
- containerPort: 9100
hostPort: 9100
protocol: TCP
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/proc
name: proc
- mountPath: /host/sys
name: sys
- mountPath: /host # 修改了這個(gè)原來(lái)是/rootfs
name: rootfs
dnsPolicy: ClusterFirst
hostIPC: true
hostNetwork: true
hostPID: true
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
volumes:
- hostPath:
path: /proc
name: proc
- hostPath:
path: /sys
name: sys
- hostPath:
path: /
name: rootfs
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
觀察了一天
根據(jù)上述修改后仍有問(wèn)題,其中一個(gè)報(bào)權(quán)限不足
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "5"
prometheus.io/scrape: "true"
generation: 5
labels:
app: node-exporter
name: node-exporter
namespace: monitoring
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: node-exporter
template:
metadata:
creationTimestamp: null
labels:
app: node-exporter
name: node-exporter
spec:
containers:
- args:
- --web.listen-address=$(HOSTIP):9100
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --path.rootfs=/host
- --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/)
- --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
- --log.level=debug
env:
- name: HOSTIP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
image: prom/node-exporter:latest
imagePullPolicy: IfNotPresent
name: node-exporter
ports:
- containerPort: 9100
hostPort: 9100
protocol: TCP
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/proc
name: proc
- mountPath: /host/sys
name: sys
- mountPath: /host
name: rootfs
dnsPolicy: ClusterFirst
hostIPC: true
hostNetwork: true
hostPID: true
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
runAsUser: 0 # 添加該選項(xiàng) 默認(rèn)是nobody用戶(hù) 65534的uid
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
volumes:
- hostPath:
path: /proc
name: proc
- hostPath:
path: /sys
name: sys
- hostPath:
path: /
name: rootfs
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
再觀察一天最新的問(wèn)題有重新出現(xiàn)
報(bào)大量的error encoding and sending metric family: write tcp 10.37.0.163:9100" msg="->10.37.0.167:50266: write: broken pip
查了很多資料最后一個(gè)資料比較有用
地址:https://github.com/prometheus/node_exporter/issues/2500
我這邊的節(jié)點(diǎn)3個(gè)內(nèi)核是5.4.0-132,三個(gè)都是有問(wèn)題的,而更高版本的內(nèi)核5.15.0是沒(méi)問(wèn)題的
最終結(jié)論是內(nèi)核版本問(wèn)題;
臨時(shí)解決方案是在node_exporter中添加環(huán)境變量GOMAXPROCS=1
永久解決是升級(jí)內(nèi)核
你是否還在尋找穩(wěn)定的海外服務(wù)器提供商?創(chuàng)新互聯(lián)www.cdcxhl.cn海外機(jī)房具備T級(jí)流量清洗系統(tǒng)配攻擊溯源,準(zhǔn)確流量調(diào)度確保服務(wù)器高可用性,企業(yè)級(jí)服務(wù)器適合批量采購(gòu),新人活動(dòng)首月15元起,快前往官網(wǎng)查看詳情吧
新聞標(biāo)題:關(guān)于k8s中的node-創(chuàng)新互聯(lián)
網(wǎng)址分享:http://aaarwkj.com/article16/gdpdg.html
成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián),為您提供服務(wù)器托管、網(wǎng)站改版、動(dòng)態(tài)網(wǎng)站、靜態(tài)網(wǎng)站、外貿(mào)網(wǎng)站建設(shè)、定制開(kāi)發(fā)
聲明:本網(wǎng)站發(fā)布的內(nèi)容(圖片、視頻和文字)以用戶(hù)投稿、用戶(hù)轉(zhuǎn)載內(nèi)容為主,如果涉及侵權(quán)請(qǐng)盡快告知,我們將會(huì)在第一時(shí)間刪除。文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如需處理請(qǐng)聯(lián)系客服。電話(huà):028-86922220;郵箱:631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載,或轉(zhuǎn)載時(shí)需注明來(lái)源: 創(chuàng)新互聯(lián)
猜你還喜歡下面的內(nèi)容