I’m working on Kubernetes since now almost a year and I realized a few things. The first one is that “on premise” clusters are not so common.

I notice that most of blog posts, official presentations and communications rely on the public cloud, and to be honest most of the time, Google Cloud, or AWS. But what if you want to build a Kubernetes cluster on your own without these clouds, even worst what if you want to use a cloud provider that people don’t talk about in Kubernetes communities ? That is my case. I’m a huge fan of the Scaleway cloud (a french cloud owned by the online.net company), it’s cheap, easy and fast. Everything that you want is here to test and hack for a few euros or dollars. (Check at their prices; compared to AWS or Google it’s something affordable if your budget is not so big, and the quality of service is awesome. Kudos to Scaleway. You rock).

The second thing I notice is that almost nobody talks about what I call “HA” Kubernetes, or “multimasters” Kubernetes. Most of people talking about it rely on an external load balancer service, or a cloud load balancer service. What if you cannot use one of these solution ?

With this in mind I wanted to build a fully redundant and automated Kubernetes cluster without having to rely on external tools (like Kubespray, Kubicorn, or Kops) and I wanted to use the Scaleway cloud. As I also didn’t want to reinvent the wheel (we have this expression in French (not so sure you have it in english but I hope you understand the meaning of it)) I decided to work with Kubeadm … unfortunately this one is not yet ready for multimaster Kubernetes, but you’ll see that there are some ways to use it even in this case.

For those who wants to do everything from scratch I recommend “Kubernetes the hard way”, which is a fantastic tutorial. I think every people interested in Kubernetes must follow it at least one time. (One my side I have recoded -with Ansible- a Kubernetes cluster creation by reading “Kubernetes the hard way”)

To sum up this blog post will tell you my journey about creating this highly redundant Kubernetes cluster on Scaleway. Please keep in mind that this is my way of “seeing things” and I assume there are better or different ways to do it. Don’t hesitate to drop a line in the comments if you have questions or anything else.

Everything that you’ll read in this blog post is “translated” into an Ansible playbook that you can use to bootstrap high available Kubernetes cluster on Scaleway. This is the result of almost two months of night work (meaning after work,so please be indulgent). A good way not to think about life when you come back home.

All the resources needed to bootstrap the cluster can be found on my GitHub account:

Let’s talk about the details :-) !

Architecture

Before going into the “deep” details I wanted to talk about the “architecture of the cluster”.

About mesh vpn

To be able to have the cheapest price for our cluster, it is possible on Scaleway, to create machines without a public ip. The goal of the cluster is to have a few nodes (that I’ll call proxy nodes) in charge of “routing” the traffic to the cluster. These ones will own a public ip address and the ingress-controller(s). Now that you have understood this, you now realize that all the other nodes have a private ip (not reachable from the internet, and in the Scaleway case these private ips also do not have access to the internet). These private ips are delivered to the nodes using DHCP, this means that if you reboot these nodes, you may have a chance to have a different ip address assigned to your network interface. That’s a problem for our Kubernetes cluster because we can’t rely on ips that are changing. A solution to this problem is to use a mesh vpn. A mesh vpn will be capable of creating a network between all your nodes and assign them private ip that you choose and that will never change. With this we can create the Kubernetes cluster on top on this mesh vpn and be sure that everything will “survive” a reboot. The tool I choosed to do this is called tinc.

Type of nodes, persitant storage, and hight availablility

Now lets differenciate each category of nodes:

To sum up, we have:

Tooling

The two main tools I’m using to create the cluster are Terraform and Ansible Terraform is used to create the machines and Ansible to bootstrap the cluster. Check below for all the detailed information.

Terraforming the machines!

I’m using Terraform to create the machines on the cloud. Terraform is a fantastic tool to manage your cloud infrastructure as a code. It comes with a Scaleway provider. Pretty cool. I prefer using Terrafrom over Ansible for server creation because Terrafrom is keeping the state of your environment in a file, by doing this Terrafrom is capable of creating, modifying, but also deleting a cloud infrastructure by using the exact same Terrafrom plan:

In the terraform directory of the scaleway-k8s project you will find my Terraform plan. The variables.tf file allows you to choose the number of nodes that you want for each type:

# cat variables.tf 
variable "region" {
  default = "par1"
}

[..]

variable "image" {
  default = "Ubuntu Zesty"
}

variable "master_instance_type" {
  default = "VC1S"
}

variable "master_instance_count" {
  default = 3
}

variable "proxy_instance_type" {
  default = "VC1S"
}

variable "worker_instance_type" {
  default = "VC1M"
}

[..]

variable "worker_instance_count" {
  default = 3
}

The sl.tf file describe the infrastructure:

# cat sl.tf
[..]
resource "scaleway_server" "worker" {
  count  = "${var.worker_instance_count}"
  name   = "worker${count.index}"
  image  = "${data.scaleway_image.ubuntu.id}"
  type   = "${var.worker_instance_type}"
  state  = "started"
  tags   = ["workers"]

  volume {
    size_in_gb = 50
    type       = "l_ssd"
  }
}

resource "scaleway_server" "master" {
  count  = "${var.master_instance_count}"
  name   = "master${count.index}"
  image  = "${data.scaleway_image.ubuntu.id}"
  type   = "${var.master_instance_type}"
  state  = "started"
  tags   = ["masters"]
}

resource "scaleway_server" "proxy0" {
  count = 1 
  name  = "proxy0"
  image = "${data.scaleway_image.ubuntu.id}"
  type  = "${var.proxy_instance_type}"
  public_ip  = "${element(scaleway_ip.public_ip.*.ip, count.index)}"
  state  = "started"
  tags  = ["proxy","primary"]
}
[..]

You now just have to run a “plan” and “apply” command to create the infrastructure, that’s it. This plan is adding tags to each server type. These tags will be used by the Ansible dynamic inventory to assign each server to its group. This Ansible groups create from the Scaleway tags will be used by the playbook to differentiate each node role. (ie, master, worker, proxy (primary or secondary))

Let’s create the nodes:

# terraform apply
scaleway_ip.public_ip: Refreshing state... (ID: 612c8678-cf66-4cfc-9117-653f11dce2fd)
data.scaleway_image.ubuntu: Refreshing state...
scaleway_server.master[1]: Refreshing state... (ID: aa9a153d-bd01-4a54-a963-6c2884161c3f)
scaleway_server.master[2]: Refreshing state... (ID: 720f2661-0541-4848-a497-caf2a058d264)
scaleway_server.master[0]: Refreshing state... (ID: bc7e5357-410c-475a-afa7-542f47d78d23)
scaleway_server.worker[0]: Refreshing state... (ID: afdccf54-728b-47a9-afb6-22334a395151)
scaleway_server.worker[1]: Refreshing state... (ID: c677107d-8beb-469c-87ca-9b6b1cb0fc2a)
scaleway_server.worker[2]: Refreshing state... (ID: b0fc22bf-2e90-4ede-9d0c-055f8bf0c1e5)
scaleway_server.proxy1: Refreshing state... (ID: 826740ad-2614-420e-b5e0-500558cd5164)
scaleway_server.proxy0: Refreshing state... (ID: ea7aa7f8-eb1f-454b-a904-5f4d4d972afa)
scaleway_server.master[0]: Modifying... (ID: bc7e5357-410c-475a-afa7-542f47d78d23)
  state: "running" => "started"
scaleway_server.proxy0: Modifying... (ID: ea7aa7f8-eb1f-454b-a904-5f4d4d972afa)
  state: "running" => "started"
scaleway_server.worker[1]: Modifying... (ID: c677107d-8beb-469c-87ca-9b6b1cb0fc2a)
  state: "running" => "started"
scaleway_server.worker[0]: Modifying... (ID: afdccf54-728b-47a9-afb6-22334a395151)
  state: "running" => "started"
scaleway_server.proxy1: Modifying... (ID: 826740ad-2614-420e-b5e0-500558cd5164)
  state: "running" => "started"
scaleway_server.worker[2]: Modifying... (ID: b0fc22bf-2e90-4ede-9d0c-055f8bf0c1e5)
  state: "running" => "started"
scaleway_server.master[1]: Modifying... (ID: aa9a153d-bd01-4a54-a963-6c2884161c3f)
  state: "running" => "started"
scaleway_server.master[2]: Modifying... (ID: 720f2661-0541-4848-a497-caf2a058d264)
  state: "running" => "started"
scaleway_server.master[0]: Modifications complete after 1s (ID: bc7e5357-410c-475a-afa7-542f47d78d23)
scaleway_server.worker[2]: Modifications complete after 1s (ID: b0fc22bf-2e90-4ede-9d0c-055f8bf0c1e5)
scaleway_server.worker[0]: Modifications complete after 1s (ID: afdccf54-728b-47a9-afb6-22334a395151)
scaleway_server.proxy1: Modifications complete after 2s (ID: 826740ad-2614-420e-b5e0-500558cd5164)
scaleway_server.proxy0: Modifications complete after 2s (ID: ea7aa7f8-eb1f-454b-a904-5f4d4d972afa)
scaleway_server.worker[1]: Modifications complete after 3s (ID: c677107d-8beb-469c-87ca-9b6b1cb0fc2a)
scaleway_server.master[2]: Modifications complete after 3s (ID: 720f2661-0541-4848-a497-caf2a058d264)
scaleway_server.master[1]: Modifications complete after 3s (ID: aa9a153d-bd01-4a54-a963-6c2884161c3f)

Apply complete! Resources: 0 added, 8 changed, 0 destroyed.

Outputs:

public_ip = [
    51.15.132.63
]

When everything is finished you can check that all the servers were successfully created and tagged as we wanted (you can see the tags and the different server type). If you look at the Terraform plan you can also notice that workers node have additionals volumes. These volumes will be used to create our GlusterFS persistent storage.

If you want to destroy the infrastructure, just run the ‘destroy’ command:

# terrafrom destroy

During the development of the Ansible playbook creating and deleting the infrastructure on demand was mandatory. I think I have easily created and destroyed almost a hundred nodes. Terraform is the perfect tool for this.

Ansible: One tool to rule them all

No need here to tell why I’m using Ansible for all my provisioning tasks, everybody does, that’s it. Let’s talks in details how to now create the cluster after the machines are created by the previous Terraform step.

We now have all our masters, workers and proxy nodes up. Each ones are tagged (saying if the node is a proxy, master or worker one).

As we only have one public ip we must tell Ansible to use the node that is holding the public ip as an ssh proxy bastion to access the other nodes that only have privates ips. For each node we should add a variable called “ansible_ssh_common_args” telling Ansible that this node will be reached by first accessing the one which have the public ip. A static inventory may looks like this:

# cat k8s.inv
# where the ingress controller(s) runs
[proxy]
proxy0 ansible_host=51.15.214.245 ansible_user=root vpn_ip=192.168.66.1
proxy1 ansible_host=10.3.110.11 ansible_user=root vpn_ip=192.168.66.2 ansible_ssh_common_args='-o ProxyCommand="ssh -W %h:%p -q root@51.15.214.245"'
# where the workers node(s) runs
[workers]
worker0 ansible_host=10.2.95.73 ansible_user=root vpn_ip=192.168.66.6
worker1 ansible_host=10.3.31.7 ansible_user=root vpn_ip=192.168.66.7
worker3 ansible_host=10.4.131.199 ansible_user=root vpn_ip=192.168.66.8
# where the master node(s) runs
[masters]
master0 ansible_host=10.4.147.77 ansible_user=root vpn_ip=192.168.66.4
master1 ansible_host=10.2.202.15 ansible_user=root vpn_ip=192.168.66.5

As only one of the proxy node have public ip we should tell the other ones to use the first one to be accessed. For the masters and the workers we can use here a group_var because every of these node needs to be accessed through the ssh bastion.

# cat group_vars/workers.yml 
ansible_ssh_common_args: '-o ProxyCommand="ssh -W %h:%p -q root@51.15.214.245"'
# cat group_vars/masters.yml 
ansible_ssh_common_args: '-o ProxyCommand="ssh -W %h:%p -q root@51.15.214.245"'

Note that the vpn_ip is the ip address that will be used by tinc on each node.

Dynamic Inventory

As I didn’t want to rewrite this inventory each time I’m creating a new cluster I have created an Ansible dynamic inventory. You can found it here on my GitHub account: here. It’s forked from someone else work that can be found here.

An Ansible dynamic inventory is pretty simple to create. You must create a program that should be able to take two kinds or arguments:

As my program needs the Scaleway organization and the Sacleway key I have wrapped it into a shell script for security reasons:

# go install github.com/chmod666org/scaleway-dynamic-inventory
#!/bin/bash

export SCALEWAY_ORG_TOKEN='abcd'
export SCALEWAY_TOKEN='efgh'

if [[ "$1" == "--list" ]]; then
  /home/ben/go/bin/scaleway-dynamic-inventory --list
elif [[ "$1" == "--host" ]]; then
  /home/ben/go/bin/scaleway-dynamic-inventory --host $2
fi

To test it run a –list then a –host with a name of a host:

# ~/bin/scaleway-dynamic-inventory.sh  --list
{"masters":["master0"],"primary":["proxy0"],"proxy":["proxy0"],"workers":["worker1","worker0","worker2"]}
# ~/bin/scaleway-dynamic-inventory.sh --host worker2
{"ansible_host":"10.3.11.141","ansible_ssh_common_args":"-o ProxyCommand=\"ssh -W %h:%p -q root@51.15.214.245\"","ansible_user":"root","vpn_ip":"192.168.66.4"}

You can see here that:

Let’s now use the dynamic inventory to launch an Ansible task to verify everything is working:

# ansible -i /home/ben/bin/scaleway-dynamic-inventory.sh all -a "date"
proxy0 | SUCCESS | rc=0 >>
Sun Nov 19 17:36:55 UTC 2017

worker0 | SUCCESS | rc=0 >>
Sun Nov 19 17:36:55 UTC 2017

worker2 | SUCCESS | rc=0 >>
Sun Nov 19 17:36:55 UTC 2017

worker1 | SUCCESS | rc=0 >>
Sun Nov 19 17:36:55 UTC 2017

master0 | SUCCESS | rc=0 >>
Sun Nov 19 17:36:55 UTC 2017
# ansible -i /home/ben/bin/scaleway-dynamic-inventory.sh masters -a "kubectl get nodes"
master0 | SUCCESS | rc=0 >>
NAME      STATUS    ROLES     AGE       VERSION
master0   Ready     master    5d        v1.8.3
proxy0    Ready     <none>    5d        v1.8.3
worker0   Ready     <none>    5d        v1.8.3
worker1   Ready     <none>    5d        v1.8.3
worker2   Ready     <none>    5d        v1.8.3

The cluster installation

I will not detail here all the specificities of the playbook. I think Ansible playbook are easily readable. For all the details go into the roles and check what you want (This is a piece of work). The playbook is made of different roles, with some variables that can be adjusted to fit your needs:

Some extra vars needs to be added to run the playbook. There are sensitive data than can’t be stored as variables in the Ansible playbook :

Then run the playbook like this.

#ANSIBLE_HOST_KEY_CHECKING=false ansible-playbook -e "scaleway_token=abcd scaleway_orga=efgh basic_auth_user=k8s basic_auth_password=k8siscool666" -i /home/ben/bin/scaleway-dynamic-inventory.sh k8s.yml
PLAY RECAP ****************************************************************************************
master0                    : ok=79   changed=60   unreachable=0    failed=0   
master1                    : ok=79   changed=60   unreachable=0    failed=0   
master2                    : ok=140  changed=109  unreachable=0    failed=0   
proxy0                     : ok=77   changed=54   unreachable=0    failed=0   
proxy1                     : ok=81   changed=58   unreachable=0    failed=0   
worker0                    : ok=62   changed=45   unreacable=0    failed=0   
worker1                    : ok=62   changed=45   unreachable=0    failed=0   
worker2                    : ok=62   changed=45   unreachable=0    failed=0   

A total of 642 tasks for a 8 nodes cluster, composed of two proxy, three masters and three workers. The execution takes around 30 minutes to get a fresh Kubernetes cluster with the nginx-ingress-controller, kube-lego, the dashboard, GlusterFS and Heketi. If you have time to waste look at this huge gif. :-)

When everything is finished you should be able to see a stable cluster with a lot of things running on it:

root@master2:~# kubectl get nodes
NAME      STATUS    ROLES     AGE       VERSION
master0   Ready     master    59m       v1.8.3
master1   Ready     master    58m       v1.8.3
master2   Ready     master    1h        v1.8.3
proxy0    Ready     <none>    58m       v1.8.3
proxy1    Ready     <none>    58m       v1.8.3
worker0   Ready     <none>    58m       v1.8.3
worker1   Ready     <none>    58m       v1.8.3
worker2   Ready     <none>    58m       v1.8.3
root@master2:~# kubectl get pods --all-namespaces
NAMESPACE     NAME                                        READY     STATUS    RESTARTS   AGE
default       glusterfs-744r7                             1/1       Running   0          53m
default       glusterfs-q4wgz                             1/1       Running   0          53m
default       glusterfs-tvpdx                             1/1       Running   0          53m
default       heketi-598d8b795b-n52lz                     1/1       Running   0          40m
kube-system   default-http-backend-7f46f4cdd7-tq4v6       1/1       Running   0          57m
kube-system   heapster-54df6c4847-5pwr5                   2/2       Running   0          55m
kube-system   keepalived-proxy0                           1/1       Running   0          57m
kube-system   keepalived-proxy1                           1/1       Running   0          57m
kube-system   kube-apiserver-master0                      1/1       Running   0          59m
kube-system   kube-apiserver-master1                      1/1       Running   0          58m
kube-system   kube-apiserver-master2                      1/1       Running   0          1h
kube-system   kube-controller-manager-master0             1/1       Running   0          59m
kube-system   kube-controller-manager-master1             1/1       Running   0          58m
kube-system   kube-controller-manager-master2             1/1       Running   0          1h
kube-system   kube-dns-545bc4bfd4-szjlj                   3/3       Running   0          1h
kube-system   kube-flannel-ds-4fsd6                       1/1       Running   1          58m
kube-system   kube-flannel-ds-4x9hr                       1/1       Running   0          1h
kube-system   kube-flannel-ds-jxcvx                       1/1       Running   2          59m
kube-system   kube-flannel-ds-t64rq                       1/1       Running   1          58m
kube-system   kube-flannel-ds-whxj2                       1/1       Running   1          58m
kube-system   kube-flannel-ds-xrgnt                       1/1       Running   0          58m
kube-system   kube-flannel-ds-zbwbd                       1/1       Running   2          59m
kube-system   kube-flannel-ds-zcxwf                       1/1       Running   0          58m
kube-system   kube-lego-5777cf6897-gnqbl                  1/1       Running   0          56m
kube-system   kube-proxy-7bw96                            1/1       Running   0          58m
kube-system   kube-proxy-7sp9t                            1/1       Running   0          1h
kube-system   kube-proxy-cbkl8                            1/1       Running   0          59m
kube-system   kube-proxy-jv8jx                            1/1       Running   0          58m
kube-system   kube-proxy-l9mln                            1/1       Running   0          58m
kube-system   kube-proxy-q8ck8                            1/1       Running   0          58m
kube-system   kube-proxy-v8l5c                            1/1       Running   0          59m
kube-system   kube-proxy-ztv2t                            1/1       Running   0          58m
kube-system   kube-scheduler-master0                      1/1       Running   0          59m
kube-system   kube-scheduler-master1                      1/1       Running   0          58m
kube-system   kube-scheduler-master2                      1/1       Running   0          1h
kube-system   kubernetes-dashboard-69c5c78645-bbslh       1/1       Running   0          56m
kube-system   nginx-ingress-controller-7cc69f8f67-pnfzt   1/1       Running   0          57m
kube-system   nginx-ingress-controller-7cc69f8f67-qxjmp   1/1       Running   0          57m

At this time you should be able to use the cluster. Every master node can be used to use the kubectl command. The dashboard is also available and can be used, here are a few screens to proof everything is working ok (note that heapster was installed too generate nice graphs):

kubeadm and multimasters

I wanted to detail here what I’m exactly doing when creating a multimaster Kubernetes using kubeadm.

First, when doing this we need to rely an external etcd, this is done by the etcd Ansible role. I have faced a nasty Ansible bug here forcing me to put the command on different lines:

    - name: Running etcd container on masters nodes
      docker_container:
        name: etcd
        image: "quay.io/coreos/etcd"
        state: started
        detach: True
        ports:
          - "0.0.0.0:2380:2380"
          - "0.0.0.0:2379:2379"
        command: [
          "etcd",
          "--name ",
          "--initial-advertise-peer-urls ",
          "--listen-peer-urls ",
          "--advertise-client-urls ",
          "--listen-client-urls ",
          "--initial-cluster ",
          "--initial-cluster-state ",
          "--initial-cluster-token "
        ]
        network_mode: host
        restart_policy: always

After this piece of the playbook is executed you should be able to see a three nodes etcd cluster is running, you can check this by running this command on one of the masters (please note the etcd is using the tinc ip):

root@master2:~# docker ps | grep etcd
625c19b6eb08        quay.io/coreos/etcd                                      "etcd --name etcd_..."   22 hours ago        Up 22 minutes                           etcd
root@master2:~# docker exec -it 625c19b6eb08 etcdctl cluster-health
member 762273195df0d85d is healthy: got healthy result from http://192.168.66.2:2379
member 8ba61affa3b02c35 is healthy: got healthy result from http://192.168.66.8:2379
member 9310b87cee311b9f is healthy: got healthy result from http://192.168.66.3:2379
cluster is healthy

We next create a Keepalived component based on Docker on each master node to load balance the apiserver:

#ansible -i /home/ben/bin/scaleway-dynamic-inventory.sh masters -m shell -a "docker ps|grep keepalived_api"

master1 | SUCCESS | rc=0 >>
a7fad9351919        chmod666/keepalived:latest                               "/container/tool/run"    23 hours ago        Up About an hour                        keepalived_api
master2 | SUCCESS | rc=0 >>
d5fe6f4b92a5        chmod666/keepalived:latest                               "/container/tool/run"    23 hours ago        Up 44 minutes                           keepalived_api
master0 | SUCCESS | rc=0 >>
7b700c86fdd9        chmod666/keepalived:latest                               "/container/tool/run"    23 hours ago        Up About an hour                        keepalived_api

You can see on the output below that the ip address 192.168.66.253 is running on one master of the cluster. This ip will move to another master if this one is failing. This address must be used by the Kubernetes APIserver ip for the future configuration:

#ansible -i /home/ben/bin/scaleway-dynamic-inventory.sh masters -m shell -a "ip a s tun0"
master1 | SUCCESS | rc=0 >>
6: tun0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000
    link/ether 26:1c:7b:6c:d8:c2 brd ff:ff:ff:ff:ff:ff
    inet 192.168.66.3/24 brd 192.168.66.255 scope global tun0
       valid_lft forever preferred_lft forever
    inet6 fe80::241c:7bff:fe6c:d8c2/64 scope link 
       valid_lft forever preferred_lft forever

master2 | SUCCESS | rc=0 >>
6: tun0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000
    link/ether 12:5c:62:bb:85:41 brd ff:ff:ff:ff:ff:ff
    inet 192.168.66.2/24 brd 192.168.66.255 scope global tun0
       valid_lft forever preferred_lft forever
    inet 192.168.66.253/24 scope global secondary tun0
       valid_lft forever preferred_lft forever
    inet6 fe80::105c:62ff:febb:8541/64 scope link 
       valid_lft forever preferred_lft forever

master0 | SUCCESS | rc=0 >>
6: tun0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000
    link/ether d2:52:5d:c2:33:59 brd ff:ff:ff:ff:ff:ff
    inet 192.168.66.8/24 brd 192.168.66.255 scope global tun0
       valid_lft forever preferred_lft forever
    inet6 fe80::d052:5dff:fec2:3359/64 scope link 
       valid_lft forever preferred_lft forever

Warning: tinc must be configured in “switch” mode to accept alias configuration. This allows Keepalived to put the floating ip on a tun interface. This piece of the Ansible template configures the loadbalancing between each master:

virtual_server   {
    delay_loop 10
    protocol TCP
    lb_algo rr
    lb_kind DR
    persistence_timeout 7200

    
}

Instead of using “kubeadm init” like everyone else do we create a configuration file on each master node that will allow us to use the external etcd, and use the load balanced and high available ip as the apiserver advertise address:

# cat /tmp/kubeadm_config
apiVersion: kubeadm.k8s.io/v1alpha1
kind: MasterConfiguration
api:
  advertiseAddress: 192.168.66.253
etcd:
  endpoints:
  - "http://192.168.66.2:2379"
  - "http://192.168.66.3:2379"
  - "http://192.168.66.8:2379"
networking:
  podSubnet: "10.244.0.0/16"
kubernetesVersion: "v1.9.0-alpha.1"
apiServerCertSANs:
- "192.168.66.2"
- "192.168.66.3"
- "192.168.66.8"
- "192.168.66.254"
- "127.0.0.1"
token: "016883.b40246c071ee8c73"
tokenTTL: "0s"

Then we run the “kubeadm init” command on the first master:

master0# kubeadm init --config /tmp/kubeadm_config

After this node is bootstraped a flannel, calico, weave or anything you want is running ok we copy the content of /etc/kubernetes/pki on the other masters and run the kubeadm init on these ones:

master1# kubeadm init --config /tmp/kubeadm_config
master2# kubeadm init --config /tmp/kubeadm_config

When the initialization of these nodes is finished don’t forget to configure the kubectl configuration file to point to the balanced address (all of this is done by the Ansible playbook):

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUN5RENDQWJDZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRFM01URXhPVEl5TlRVd05Gb1hEVEkzTVRFeE56SXlOVFV3TkZvd0ZURVRNQkVHQT
FVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBTVhDCmgrN0pkcEpBOVpRK2hZYzBXVnB6YTlHcjdwbUdsb2xOaWMvQ09wY0JncG43Ym1DRytYQWJkWlhBa1hJbDBFVGEKTTZSdDFENVRZQlhHOWpJR0NWa3FMOC9SSkYySWZ6VGREKzZZMkg2eUJON21y
TEMwQTJmVFNNM0tycHJURW95Qgo1S1ZiWUE0emRMdDlRRkJMWWFuWHlKM2NKS2xQWlVZS2xRZWRJeXJ0Wnc2aE0vcEVwMDZEdjJZUnNZYTlMcXZTCkNScnNJZGgrcER1eG1JK3Y2Y2E1c0lWMHkxcU
[..]
    server: https://192.168.66.253:6443
  name: kubernetes

Then initialize the other nodes (proxy and workers) by specifying the api ip address (the Keepalived on):

worker1# kubeadm join --token=016883.b40246c071ee8c73 192.168.66.253:6443

After everything is ready you should be able to browse the etcd cluster and check that you have Kubernetes values in it:

# docker exec -e ETCDCTL_API=3 -it 625c19b6eb08 etcdctl --endpoints=http://localhost:2379 get /registry --prefix --keys-only | tail -10
/registry/services/specs/kube-system/kube-dns

/registry/services/specs/kube-system/kube-lego-nginx

/registry/services/specs/kube-system/kubernetes-dashboard

/registry/services/specs/kube-system/nginx-ingress

/registry/storageclasses/gluster

You’re done you have now a multimaster Kubernetes cluster. \o/.

nginx-ingress-controller and kube-lego

The controller

The nginx-controller is deployed on all the proxy nodes in hostNetwork mode (allowing the pod to directly listen on the host). To move the Scaleway public ip address from one node to one another a Keepalived is also running on the proxy hosts. When one of the proxy is failing a custom script is called to move the ip on the healthy node.

You can see that we have another Keepalived running on the proxy nodes.

# ANSIBLE_HOST_KEY_CHECKING=false ansible -i /home/ben/bin/scaleway-dynamic-inventory.sh proxy -m shell -a "docker ps | grep -i keepalive"
proxy0 | SUCCESS | rc=0 >>
00831451285b        chmod666/keepalived                                 "/container/tool/run"    2 hours ago         Up 2 hours                              k8s_keepalived_keepalived-proxy0_kube-system_543a6e63166b2abee151693ba3214c09_1
a81e24df58ba        gcr.io/google_containers/pause-amd64:3.0            "/pause"                 2 hours ago         Up 2 hours                              k8s_POD_keepalived-proxy0_kube-system_543a6e63166b2abee151693ba3214c09_1

proxy1 | SUCCESS | rc=0 >>
ee45eb26ef66        chmod666/keepalived                                 "/container/tool/run"    2 hours ago         Up 2 hours                              k8s_keepalived_keepalived-proxy1_kube-system_665dda7a97e17443554447f76d4d0370_1
2059832a4094        gcr.io/google_containers/pause-amd64:3.0            "/pause"                 2 hours ago         Up 2 hours                              k8s_POD_keepalived-proxy1_kube-system_665dda7a97e17443554447f76d4d0370_1

The configuration is not common and is configured via environment variables. A volume is also mounted with a custom script allowing us to do this ip move (the /mnt/notify.sh in the output below, mounted for the /usr/local/bin/scaleway-ip move on the host):

# docker inspect ee45eb26ef66
            {
                "Type": "bind",
                "Source": "/usr/local/bin/scaleway-ipmove",
                "Destination": "/mnt",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
[..]
            "Env": [
                "KEEPALIVED_INTERFACE=tun0",
                "KEEPALIVED_UNICAST_PEERS=#PYTHON2BASH:['192.168.66.1', '192.168.66.5']",
                "KEEPALIVED_VIRTUAL_IPS=#PYTHON2BASH:['192.168.66.254']",
                "KEEPALIVED_PRIORITY=2",
                "KEEPALIVED_NOTIFY=/mnt/notify.sh",
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "LANG=en_US.UTF-8",
                "LANGUAGE=en_US:en",
                "LC_ALL=en_US.UTF-8"
            ],
# cat /usr/local/bin/scaleway-ipmove/notify.sh
[..]
        "MASTER") echo "I'm the MASTER! Whup whup." > /proc/1/fd/1
                  echo "Here is the master"
                  # this put the public ip on the master using the scaleway api
                  /mnt/scaleway-ipmove.py token proxy0 proxy1 51.15.214.245 voight-kampff.org SCALEWAY_ORGA
                  exit 0

Obviously these ingress controllers are create with hostNetwork: true to listen on port 443 and 80 on the proxy nodes:

# ANSIBLE_HOST_KEY_CHECKING=false ansible -i /home/ben/bin/scaleway-dynamic-inventory.sh proxy -m shell -a "netstat -lnpt | grep nginx | grep -v tcp6"
proxy0 | SUCCESS | rc=0 >>
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      10194/nginx: master 
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      10194/nginx: master 
tcp        0      0 0.0.0.0:443             0.0.0.0:*               LISTEN      10194/nginx: master 
tcp        0      0 0.0.0.0:443             0.0.0.0:*               LISTEN      10194/nginx: master 
tcp        0      0 127.0.0.1:18080         0.0.0.0:*               LISTEN      10194/nginx: master 
tcp        0      0 127.0.0.1:18080         0.0.0.0:*               LISTEN      10194/nginx: master 

proxy1 | SUCCESS | rc=0 >>
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      10100/nginx: master 
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      10100/nginx: master 
tcp        0      0 0.0.0.0:443             0.0.0.0:*               LISTEN      10100/nginx: master 
tcp        0      0 0.0.0.0:443             0.0.0.0:*               LISTEN      10100/nginx: master 
tcp        0      0 127.0.0.1:18080         0.0.0.0:*               LISTEN      10100/nginx: master 
tcp        0      0 127.0.0.1:18080         0.0.0.0:*               LISTEN      10100/nginx: master 

You can access the status page by creating an ssh tunnel:

# ssh -L 18080:localhost:18080 root@51.15.214.245 -N

About kube-lego

To verify kube-lego and the ingress-controller is working we are going to create a simple echoserver deployment, a service and an ingress. The goal here is not to teach you how to do that. I assume you have some experience in Kubernetes. Here is the manifest doing the job:

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: echoserver
  namespace: default
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: echoserver
    spec:
      containers:
      - image: gcr.io/google_containers/echoserver:1.0
        imagePullPolicy: Always
        name: echoserver
        ports:
        - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: echoserver
  namespace: default
spec:
  ports:
  - port: 80
    targetPort: 8080
    protocol: TCP
  selector:
    app: echoserver
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: echoserver
  namespace: default
  annotations:
    kubernetes.io/tls-acme: "true"
    kubernetes.io/ingress.class: "nginx"
spec:
  tls:
  - hosts:
    - echo.voight-kampff.org
    secretName: echoserver-tls
  rules:
  - host: echo.voight-kampff.org
    http:
      paths:
      - path: /
        backend:
          serviceName: echoserver
          servicePort: 80

We then create it:

# kubectl apply -f echo.yml
deployment "echoserver" created
service "echoserver" created
ingress "echoserver" created
# kubectl scale deployment echoserver --replicas=10
deployment "echoserver" scaled

We can notice a few things. The ingress is created with two annotations (kubernetes.io/tls-acme: “true”, kubernetes.io/ingress.class: “nginx”). This tells kube-lego to generate a certificate. The ingress.class is used if you have multiple ingress-controllers to choose which one will be used when you create an ingress object. We can verify this by checking the kube-lego logs:

# kubectl logs -f kube-lego-5777cf6897-86s45 -n=kube-system
time="2017-11-20T22:36:23Z" level=info msg="Attempting to create new secret" context=secret name=echoserver-tls namespace=default 
time="2017-11-20T22:36:23Z" level=info msg="no cert associated with ingress" context="ingress_tls" name=echoserver namespace=default 
time="2017-11-20T22:36:23Z" level=info msg="requesting certificate for echo.voight-kampff.org" context="ingress_tls" name=echoserver namespace=default 
time="2017-11-20T22:36:27Z" level=info msg="authorization successful" context=acme domain=echo.voight-kampff.org 
time="2017-11-20T22:36:29Z" level=info msg="successfully got certificate: domains=[echo.voight-kampff.org] url=https://acme-v01.api.letsencrypt.org/acme/cert/043e0030bb157c6a445ddd1ce4a31412d7bd" context=acme 
time="2017-11-20T22:36:29Z" level=info msg="Attempting to create new secret" context=secret name=echoserver-tls namespace=default 
time="2017-11-20T22:36:29Z" level=info msg="Secret successfully stored" context=secret name=echoserver-tls namespace=default 
time="2017-11-20T22:36:29Z" level=info msg="cert expires in 89.9 days, no renewal needed" context="ingress_tls" expire_time=2018-02-18 19:52:10 +0000 UTC name=kubernetes-dashboard namespace=kube-system 
time="2017-11-20T22:36:29Z" level=info msg="no cert request needed" context="ingress_tls" name=kubernetes-dashboard namespace=kube-system

The certificate has been stored in a Kubernetes secret:

# kubectl describe secret echoserver-tls
Name:         echoserver-tls
Namespace:    default
Labels:       <none>
Annotations:  kubernetes.io/tls-acme=true

Type:  kubernetes.io/tls

Data
====
tls.crt:  3461 bytes
tls.key:  1675 bytes

The page is secured \o/:

Glusterfs and Heketi for dynamic storage

To have a persistent storage among the cluster, a Gluster is created on the worker nodes. Each worker node have an additional disk used for this. When running the Heketi role a _ Gluster_ DaemonSet will be created on all the worker nodes. The pods of this DaemonSet are running gluster and creating a vg on the specified device. Next Glusterfs volumes will be created on top of this vg.

After the deployment is finished you can notice a storageclass called “gluster” is created. This storage class use the Heketi api url (in this case the ClusterIP, as both Glusterfs and Heketi are containerized)

# kubectl get storageclass
NAME      PROVISIONER
gluster   kubernetes.io/glusterfs
root@master2:~# kubectl describe storageclass gluster
Name:            gluster
IsDefaultClass:  No
Annotations:     kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"storage.k8s.io/v1beta1","kind":"StorageClass","metadata":{"annotations":{},"name":"gluster","namespace":""},"parameters":{"resturl":"http://10.98.42.5:8080"},"provisioner":"kubernetes.io/glusterfs"}

Provisioner:  kubernetes.io/glusterfs
Parameters:   resturl=http://10.98.42.5:8080
Events:       <none>

Gluster pods are running on each worker nodes, and the Heketi api is ready to work

root@master2:~# kubectl get pods
NAME                      READY     STATUS    RESTARTS   AGE
glusterfs-8vsqb           1/1       Running   0          20m
glusterfs-9vdwk           1/1       Running   0          20m
glusterfs-gqnqz           1/1       Running   0          20m
heketi-598d8b795b-qc2wj   1/1       Running   0          11m

If the deployment was successful you’ll also notice that a volume is already created on the Gluster. This one is used by the Heketi deployment to store the Heketi database.

# kubectl exec glusterfs-9vdwk -- gluster volume info all
Volume Name: heketidbstorage
Type: Replicate
Volume ID: 283cce11-d311-4d5f-98f7-12f4393d7e72
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.66.6:/var/lib/heketi/mounts/vg_aa2a614147fb84048bf1d448081ccf81/brick_5e958487b6711ec451ce76b6360a651b/brick
Brick2: 192.168.66.4:/var/lib/heketi/mounts/vg_34134cc5a5ef4f209459a0ca092d5f7f/brick_8d7ac0bd050b66fefb3219f4ddc8a1bc/brick
Brick3: 192.168.66.7:/var/lib/heketi/mounts/vg_ca3e80e6015c84e2d8a09465bbc2e820/brick_46da63a6bf18087b917f18f87056b6a9/brick
Options Reconfigured:
transport.address-family: inet
nfs.disable: on

You can now create an application using a persistent volume (my advice is to create a statefullset even if it’s not the case below):

---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: nginx-pv
spec:
  storageClassName: gluster
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 3Gi
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: nginx
  namespace: default
spec:
  replicas: 5
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        imagePullPolicy: Always
        name: nginx
        ports:
        - containerPort: 80
        volumeMounts:
        - mountPath: "/usr/share/nginx/html"
          name: datadir
      volumes:
      - name: datadir
        persistentVolumeClaim:
          claimName: nginx-pv
---
apiVersion: v1
kind: Service
metadata:
  name: nginxsvc
  namespace: default
spec:
  ports:
  - port: 80
    protocol: TCP
  selector:
    app: nginx
---
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: nginx
  namespace: default
  annotations:
    kubernetes.io/tls-acme: "true"
    kubernetes.io/ingress.class: "nginx"
spec:
  tls:
  - hosts:
    - nginx.voight-kampff.org
    secretName: nginx-tls
  rules:
  - host: nginx.voight-kampff.org
    http:
      paths:
      - path: /
        backend:
          serviceName: nginxsvc
          servicePort: 80
# kubectl create -f nginx-pv.yml 
persistentvolumeclaim "nginx-pv" created
deployment "nginx" created
service "nginx" created
ingress "nginx" created

After this creation a pv and a pvc were automatically created on the Gluster cluster. \o/.

# kubectl get pvc
NAME       STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
nginx-pv   Bound     pvc-a5779800-ce4d-11e7-9e37-de194813b005   4Gi        RWX            gluster        14s
# kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS    CLAIM              STORAGECLASS   REASON    AGE
pvc-a5779800-ce4d-11e7-9e37-de194813b005   4Gi        RWX            Delete           Bound     default/nginx-pv   gluster                  1m

This volume can be seen by using the glusterfs command:

kubectl exec glusterfs-9vdwk -- gluster volume info
 
[..]
 
Volume Name: vol_7944fb5803e23b247d6e6932fe748998
Type: Replicate
Volume ID: a6bf86e0-45ae-4490-9a3d-3d2d109e27f6
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.66.6:/var/lib/heketi/mounts/vg_aa2a614147fb84048bf1d448081ccf81/brick_534cdbeeccc5dd23e1680255ccc47d40/brick
Brick2: 192.168.66.7:/var/lib/heketi/mounts/vg_ca3e80e6015c84e2d8a09465bbc2e820/brick_21db80ebabc0e1ffaaedc935cb83bdaa/brick
Brick3: 192.168.66.4:/var/lib/heketi/mounts/vg_34134cc5a5ef4f209459a0ca092d5f7f/brick_9e8e986857448eac4a65b0bd71dd27ca/brick
Options Reconfigured:
transport.address-family: inet
nfs.disable: on

On each nginx pod the volume is correctly mounted and we can write on it:

# for i in nginx-b875-47vlp nginx-b875-9k25w nginx-b875-fmz68 nginx-b875-jv2fk nginx-b875-tbqpj ; do kubectl exec -it $i mount | grep gluster ; done
192.168.66.4:vol_4d2c17e9ae9a2ff73163c6544bf45e7a on /usr/share/nginx/html type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
192.168.66.4:vol_4d2c17e9ae9a2ff73163c6544bf45e7a on /usr/share/nginx/html type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
192.168.66.4:vol_4d2c17e9ae9a2ff73163c6544bf45e7a on /usr/share/nginx/html type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
192.168.66.4:vol_4d2c17e9ae9a2ff73163c6544bf45e7a on /usr/share/nginx/html type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
192.168.66.4:vol_4d2c17e9ae9a2ff73163c6544bf45e7a on /usr/share/nginx/html type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
# kubectl exec -it nginx-b875-47vlp /bin/bash
root@nginx-b875-47vlp:/# echo "K8S" > /usr/share/nginx/html/index.html
root@nginx-b875-47vlp:/# exit
exit
 kubectl exec -it nginx-b875-tbqpj /bin/bash
root@nginx-b875-tbqpj:/# echo " is cool" > /usr/share/nginx/html/index.html
root@nginx-b875-tbqpj:/# exit
exit

You can verify the page is working ok:

Conclusion

The more I use Kubernetes the better I am, and I discover things that -I realize- a few people understand. If you want all the details of the implementation just read the Ansible playbook. You will get all the tips and tricks about what’s explained above. I have detailed the tricky parts, but this playbook is so huge that I can’t talk about everything. And to be honest just copying and pasting playbook is not that interesting, you can have a look by yourself. I insist go on the GitHub on check by yourself. The goal of this blog post was to proof that it is possible to create an high available Kubernetes from scratch without the need of any external components and by understanding every piece of it. I’m running a couple of services every night and I have to admit it is stable and powerful. I’ll later add Prometheus monitoring, Elastic Search and Kibana for the monitoring and logging part. Keep an eye on the Github repositories. Pull requests are welcome. Just a quick call before ending if anybody have some tips that will allow me to access GKE or AWS for less than my monthly paycheck I’ll really appreciate it. I’d really like to work and blog about GKE.

From even the greatest of horrors irony is seldom absent

I hope this long awaited blog post was worth the wait. I live some harsh times that are not going to stop that soon: health problem that never ends and, career problems, questions about myself and others. Stuffs you don’t care about (and that one more time will results by some kind remarks from my fellow colleagues saying that I’m talking too much on this blog … but everything is not about technical stuffs, I believe that empathy still exists, and this few lines will not hurt you, -haters gonna hate-). I should stop being cynical an produce more blog posts instead of whinng on Twitter. But one thing I’m sure about is that I don’t want to stop the blog, it’s the only thing that allows me to keep my motivation and my never ending passion still alive. Ph’nglui mglw’nafh Cthulhu R’lyeh wgah’nagl fhtagn.