我已经在主节点上安装了api-server,controller,scheduler,proxy和kubelet,在3个其他工作节点上安装了kubelet和代理,flanneld使用了contrib/init k8s project中提供的systemd服务文件 .

群集启动时,一切都运行良好 . 我可以部署仪表板和我定制的一些部署(consul客户端/服务器,nginx等),它们都很好用 . 但是,如果我让群集运行几个小时,我会回来,每个pod都将在CrashLoopBackup中,重启多次 . 解决问题的唯一方法是在每台机器上重新启动kubelet . 问题立即消失,一切都恢复正常 .

在kubelet进入糟糕状态后从日志中记录:

Sep 10 19:09:06 k8-app-2.example.com kubelet[1025]: , failed to "StartContainer" for "nginx-server" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=nginx-server pod=nginx-deployment-617048525-mgf0v_default(f6dff9f2-95db-11e7-b533-02c75fb65df0)"
Sep 10 19:09:06 k8-app-2.example.com kubelet[1025]: ]
Sep 10 19:09:07 k8-app-2.example.com kubelet[1025]: I0910 19:09:07.286367    1025 kuberuntime_manager.go:457] Container {Name:nginx-server Image:nginx Command:[] Args:[] WorkingDir: Ports:[{Name:http HostPort:0 ContainerPort:80 Protocol:TCP HostIP:}] EnvFrom:[] Env:[{Name:NODE_IP Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.hostIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {Name:POD_IP Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.podIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}}] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/,Port:80,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:10,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:Always SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Sep 10 19:09:07 k8-app-2.example.com kubelet[1025]: I0910 19:09:07.286795    1025 kuberuntime_manager.go:457] Container {Name:regup Image:registry.hub.docker.com/spunon/regup:latest Command:[] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:SERVICE_NAME Value:nginx ValueFrom:nil} {Name:SERVICE_PORT Value:80 ValueFrom:nil} {Name:NODE_IP Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.hostIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {Name:POD_IP Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.podIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}}] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:Always SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Sep 10 19:09:07 k8-app-2.example.com kubelet[1025]: I0910 19:09:07.287071    1025 kuberuntime_manager.go:741] checking backoff for container "nginx-server" in pod "nginx-deployment-617048525-mgf0v_default(f6dff9f2-95db-11e7-b533-02c75fb65df0)"
Sep 10 19:09:07 k8-app-2.example.com kubelet[1025]: I0910 19:09:07.287376    1025 kuberuntime_manager.go:751] Back-off 5m0s restarting failed container=nginx-server pod=nginx-deployment-617048525-mgf0v_default(f6dff9f2-95db-11e7-b533-02c75fb65df0)
Sep 10 19:09:07 k8-app-2.example.com kubelet[1025]: I0910 19:09:07.287601    1025 kuberuntime_manager.go:741] checking backoff for container "regup" in pod "nginx-deployment-617048525-mgf0v_default(f6dff9f2-95db-11e7-b533-02c75fb65df0)"
Sep 10 19:09:07 k8-app-2.example.com kubelet[1025]: I0910 19:09:07.287863    1025 kuberuntime_manager.go:751] Back-off 5m0s restarting failed container=regup pod=nginx-deployment-617048525-mgf0v_default(f6dff9f2-95db-11e7-b533-02c75fb65df0)

EDIT: Here are the logs from the kubelet when the issue seems to start