kubeflow安装教程
1.安装docker
1 | #关闭交换空间 |
1.2验证docker
1 | docker version |
1.2配置加速器
1 | nano /etc/docker/daemon.json |
在其中加入镜像内容如下1
2
3
4
5{
"registry-mirrors": [
"https://registry.docker-cn.com"
]
}
1.3保存文件,并重启docker
1 | sudo systemctl restart docker |
2.安装kubeadm
1 | apt-get update && apt-get install -y apt-transport-https |
2.2修改文件 kubeadm.yml
1 | nano kubeadm.yml |
修改配置为如下内容1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51apiVersion: kubeadm.k8s.io/v1beta1
bootstrapTokens:
- groups:
- system:bootstrappers:kubeadm:default-node-token
token: abcdef.0123456789abcdef
ttl: 24h0m0s
usages:
- signing
- authentication
kind: InitConfiguration
localAPIEndpoint:
# 修改为主节点内网 IP
advertiseAddress: 172.17.166.73
bindPort: 6443
nodeRegistration:
criSocket: /var/run/dockershim.sock
name: kubernetes-master
taints:
- effect: NoSchedule
key: node-role.kubernetes.io/master
---
apiServer:
timeoutForControlPlane: 4m0s
apiVersion: kubeadm.k8s.io/v1beta1
certificatesDir: /etc/kubernetes/pki
clusterName: kubernetes
controlPlaneEndpoint: ""
controllerManager: {}
dns:
type: CoreDNS
etcd:
local:
dataDir: /var/lib/etcd
# 国内不能访问 Google,修改为阿里云
imageRepository: registry.aliyuncs.com/google_containers
kind: ClusterConfiguration
# 修改版本号
kubernetesVersion: v1.14.1
networking:
dnsDomain: cluster.local
# 配置成 Calico 的默认网段
podSubnet: "192.168.0.0/16"
serviceSubnet: 10.96.0.0/12
scheduler: {}
---
# 开启 IPVS 模式
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
featureGates:
SupportIPVSProxyMode: true
mode: ipvs
2.3kubeadm 初始化
1 | kubeadm init --config=kubeadm.yml | tee kubeadm-init.log |
出现如下情况表示成功1
2
3
4
5
6
7
8
9
10
11
12
13Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/
Then you can join any number of worker nodes by running the following on each as root:
2.4配置kubectl
1 | mkdir -p $HOME/.kube |
查看node状态1
2
3
4kubectl get nodes
# 能够打印出节点信息即表示成功
NAME STATUS ROLES AGE VERSION
kubernetes-master NotReady master 8m40s v1.14.1
2.5配置网络
1 | #cni 这里3.7版本须与之前的kubeadm1.14.1对应,否则会出错 |
2.6修改master节点状态,能让master部署应用
1 | #查看node状态 |
3拉取kubeflow镜像
1 | #拉取脚本 |
连通外网
1 | 创建一个新的screen ,screen 保证在退出远程连接之后程序依然能运行 |
kubeadm 重新启动
如果出现有的pod是ImagePullBackOff或者CrashLoopBackOff,重启整个服务1
2
3rm -r $HOME/.kube
kubectl reset
#重新从2.3kubeadm 初始化执行一遍
看日志
journalctl -f -u kubelet
看node
kubectl get nodes
拉取jupyter镜像
docker pull andreyneverov/tensorflow-1.13.1-notebook-cpu:v0.5.0
将镜像改为kubeflow需要的名字
docker tag andreyneverov/tensorflow-1.13.1-notebook-cpu:v0.5.0 gcr.io/kubeflow-images-public/tensorflow-1.13.1-notebook-cpu:v0.5.0
registry.cn-shenzhen.aliyuncs.com/shikanon/kubeflow-images-public.katib.v1alpha2.metrics-collector:v0.1.2-alpha-289-g14dad8b
kubectl -nkubeflow describe experiment random-example
docker tag katib/metrics-collector:latest registry.cn-shenzhen.aliyuncs.com/shikanon/kubeflow-images-public.katib.v1alpha2.metrics-collector:v0.1.2-alpha-289-g14dad8b