我使用AWS Kubernetes Quickstart在VPC和私有子网中创建Kubernetes集群:https://aws-quickstart.s3.amazonaws.com/quickstart-heptio/doc/heptio-kubernetes-on-the-aws-cloud.pdf . 它运行良好了一段时间 . 我在我的Kubernetes集群上安装了Calico . 我有两个节点和一个主节点 . 主服务器上的calico pod正常运行,节点上的calico pod处于crashloopbackoff状态:
NAME READY STATUS RESTARTS AGE
calico-etcd-ztwjj 1/1 Running 1 55d
calico-kube-controllers-685755779f-ftm92 1/1 Running 2 55d
calico-node-gkjgl 1/2 CrashLoopBackOff 270 22h
calico-node-jxkvx 2/2 Running 4 55d
calico-node-mxhc5 1/2 CrashLoopBackOff 9 25m
描述其中一个崩溃的 beans 荚:
ubuntu@ip-10-0-1-133:~$ kubectl describe pod calico-node-gkjgl -n kube-system
Name: calico-node-gkjgl
Namespace: kube-system
Node: ip-10-0-0-237.us-east-2.compute.internal/10.0.0.237
Start Time: Mon, 17 Sep 2018 16:56:41 +0000
Labels: controller-revision-hash=185957727
k8s-app=calico-node
pod-template-generation=1
Annotations: scheduler.alpha.kubernetes.io/critical-pod=
Status: Running
IP: 10.0.0.237
Controlled By: DaemonSet/calico-node
Containers:
calico-node:
Container ID: docker://d89979ba963c33470139fd2093a5427b13c6d44f4c6bb546c9acdb1a63cd4f28
Image: quay.io/calico/node:v3.1.1
Image ID: docker-pullable://quay.io/calico/node@sha256:19fdccdd4a90c4eb0301b280b50389a56e737e2349828d06c7ab397311638d29
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 18 Sep 2018 15:14:44 +0000
Finished: Tue, 18 Sep 2018 15:14:44 +0000
Ready: False
Restart Count: 270
Requests:
cpu: 250m
Liveness: http-get http://:9099/liveness delay=10s timeout=1s period=10s #success=1 #failure=6
Readiness: http-get http://:9099/readiness delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
ETCD_ENDPOINTS: <set to the key 'etcd_endpoints' of config map 'calico-config'> Optional: false
CALICO_NETWORKING_BACKEND: <set to the key 'calico_backend' of config map 'calico-config'> Optional: false
CLUSTER_TYPE: kubeadm,bgp
CALICO_DISABLE_FILE_LOGGING: true
CALICO_K8S_NODE_REF: (v1:spec.nodeName)
FELIX_DEFAULTENDPOINTTOHOSTACTION: ACCEPT
CALICO_IPV4POOL_CIDR: 192.168.0.0/16
CALICO_IPV4POOL_IPIP: Always
FELIX_IPV6SUPPORT: false
FELIX_IPINIPMTU: 1440
FELIX_LOGSEVERITYSCREEN: info
IP: autodetect
FELIX_HEALTHENABLED: true
Mounts:
/lib/modules from lib-modules (ro)
/var/lib/calico from var-lib-calico (rw)
/var/run/calico from var-run-calico (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-cni-plugin-token-b7sfl (ro)
install-cni:
Container ID: docker://b37e0ec7eba690473a4999a31d9f766f7adfa65f800a7b2dc8e23ead7520252d
Image: quay.io/calico/cni:v3.1.1
Image ID: docker-pullable://quay.io/calico/cni@sha256:dc345458d136ad9b4d01864705895e26692d2356de5c96197abff0030bf033eb
Port: <none>
Host Port: <none>
Command:
/install-cni.sh
State: Running
Started: Mon, 17 Sep 2018 17:11:52 +0000
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 17 Sep 2018 16:56:43 +0000
Finished: Mon, 17 Sep 2018 17:10:53 +0000
Ready: True
Restart Count: 1
Environment:
CNI_CONF_NAME: 10-calico.conflist
ETCD_ENDPOINTS: <set to the key 'etcd_endpoints' of config map 'calico-config'> Optional: false
CNI_NETWORK_CONFIG: <set to the key 'cni_network_config' of config map 'calico-config'> Optional: false
Mounts:
/host/etc/cni/net.d from cni-net-dir (rw)
/host/opt/cni/bin from cni-bin-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-cni-plugin-token-b7sfl (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType:
var-run-calico:
Type: HostPath (bare host directory volume)
Path: /var/run/calico
HostPathType:
var-lib-calico:
Type: HostPath (bare host directory volume)
Path: /var/lib/calico
HostPathType:
cni-bin-dir:
Type: HostPath (bare host directory volume)
Path: /opt/cni/bin
HostPathType:
cni-net-dir:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
calico-cni-plugin-token-b7sfl:
Type: Secret (a volume populated by a Secret)
SecretName: calico-cni-plugin-token-b7sfl
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: :NoSchedule
:NoExecute
:NoSchedule
:NoExecute
CriticalAddonsOnly
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 4m (x6072 over 22h) kubelet, ip-10-0-0-237.us-east-2.compute.internal Back-off restarting failed container
同一个pod的日志:
ubuntu@ip-10-0-1-133:~$ kubectl logs calico-node-gkjgl -n kube-system -c calico-node
2018-09-18 15:14:44.605 [INFO][8] startup.go 251: Early log level set to info
2018-09-18 15:14:44.605 [INFO][8] startup.go 269: Using stored node name from /var/lib/calico/nodename
2018-09-18 15:14:44.605 [INFO][8] startup.go 279: Determined node name: ip-10-0-0-237.us-east-2.compute.internal
2018-09-18 15:14:44.609 [INFO][8] startup.go 101: Skipping datastore connection test
2018-09-18 15:14:44.610 [INFO][8] startup.go 352: Building new node resource Name="ip-10-0-0-237.us-east-2.compute.internal"
2018-09-18 15:14:44.610 [INFO][8] startup.go 367: Initialize BGP data
2018-09-18 15:14:44.614 [INFO][8] startup.go 564: Using autodetected IPv4 address on interface ens3: 10.0.0.237/19
2018-09-18 15:14:44.614 [INFO][8] startup.go 432: Node IPv4 changed, will check for conflicts
2018-09-18 15:14:44.618 [WARNING][8] startup.go 861: Calico node 'ip-10-0-0-237' is already using the IPv4 address 10.0.0.237.
2018-09-18 15:14:44.618 [WARNING][8] startup.go 1058: Terminating
Calico node failed to start
因此,似乎找到了节点IP地址的冲突,或者Calico似乎认为IP已经分配给另一个节点 . 快速搜索我找到了这个帖子:https://github.com/projectcalico/calico/issues/1628 . 我看到这应该通过将IP_AUTODETECTION_METHOD设置为can-reach = DESTINATION来解决,我假设它是"can-reach=10.0.0.237" . 此配置是在calico / node容器上设置的环境变量 . 我一直试图将shell装入容器本身,但是kubectl告诉我没有找到容器:
ubuntu@ip-10-0-1-133:~$ kubectl exec calico-node-gkjgl --stdin --tty /bin/sh -c calico-node -n kube-system
error: unable to upgrade connection: container not found ("calico-node")
我怀疑这是因为Calico无法分配IP . 所以我登录主机并尝试使用docker封装容器:
root@ip-10-0-0-237:~# docker exec -it k8s_POD_calico-node-gkjgl_kube-system_a6998e98-ba9a-11e8-a9fa-0a97f5a48ef4_1 /bin/bash
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "exec: \"/bin/bash\": stat /bin/bash: no such file or directory"
所以我想在容器中没有shell可以执行 . 有理由为什么Kubernetes无法执行 . 我尝试在外部运行命令列出环境变量,但我找不到任何,我可能会错误地运行这些命令:
root@ip-10-0-0-237:~# docker inspect -f '{{range $index, $value := .Config.Env}}{{$value}} {{end}}' k8s_POD_calico-node-gkjgl_kube-system_a6998e98-ba9a-11e8-a9fa-0a97f5a48ef4_1
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
root@ip-10-0-0-237:~# docker exec -it k8s_POD_calico-node-gkjgl_kube-system_a6998e98-ba9a-11e8-a9fa-0a97f5a48ef4_1 printenv IP_AUTODETECTION_METHOD
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "exec: \"printenv\": executable file not found in $PATH"
root@ip-10-0-0-237:~# docker exec -it k8s_POD_calico-node-gkjgl_kube-system_a6998e98-ba9a-11e8-a9fa-0a97f5a48ef4_1 /bin/env
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "exec: \"/bin/env\": stat /bin/env: no such file or directory"
好吧,也许我会以错误的方式解决这个问题 . 我应该尝试使用Kubernetes更改Calico配置文件并重新部署吗?我在哪里可以找到我的系统?我一直无法找到设置环境变量的位置 .
1 回答
如果你看Calico docs
IP_AUTODETECTION_METHOD
已经默认为first-round
.我的猜测是先前的'run'印花布没有发布某些东西或IP地址,或者仅仅是印花布的
v3.1.1
版本中的一个错误 .尝试:
您的pod将被重新创建并希望初始化 .
将Calico升级至
v3.1.3
或最新 . 请关注这些docs我的猜测是Heptio的Calico安装正在使用etcd数据存储 .尝试了解Heptio的AWS AMI如何工作,并查看它们是否存在任何问题 . 这可能需要一些时间,因此您也可以联系他们的支持 .
尝试使用其他方法在Calico上安装Kubernetes . 在https://kubernetes.io上有详细记载