Welcome to openshift-cn’s documentation!

openshift 集群安装

安装计划

在安装你openshift集群之前, 需要根据集群规模和资源情况, 来规划各个基础组件的部署安排. 如果有超过三台或者以上的机器, 就可以考虑高可用了. Master, Node, Etcd, Router, ES, Prometheus, Grafana 等组件都支持多实例部署. 资源许可 的情况下, etcd集群最好单独部署, 否则可以跟master部在一起. 从OKD 3.10开始, RHEL/CENTOS操作系统上以RPM形式部署, RHEL Atomic Host 则以容器镜像形式部署.

系统要求

所有主机

  • 主机之间可以互相通信, 也能访问外网. 如果是Router机器的话, 还需要配置DNS泛域解释.
  • 开启selinux
  • 开启 DNS 和 NetworkManager
  • iptables是默认打开的, 需要打开端口有53, 4789, 8443, 10250, 2379, 2380等.详细见官档页面

Master

  • 操作系统: Fedora 21, CentOS 7.4, Red Hat Enterprise Linux (RHEL) 7.4 或者更新
  • 最少4核vCPU, 16GB内存
  • 40GB 磁盘空间 (/var目录)

Node

  • 操作系统: Fedora 21, CentOS 7.4, Red Hat Enterprise Linux (RHEL) 7.4 或者更新
  • 最少1核vCPU, 8GB内存
  • 15GB 磁盘空间 (/var目录)

外部etcd

  • 20GB 磁盘空间 (/var/lib/etcd 目录)

准备主机环境

一. 在阿里云申请了云主机后, 在master上设置主机名, 设置ssh key和主机间免密码登陆

    # cat /etc/hosts
    172.26.7.167	node01-inner
    172.26.100.176	node02-inner
    172.26.7.168	node03-inner

    # ssh-keygen (所有主机)
    # for host in node01-inner node02-inner node03-inner; do ssh-copy-id -i ~/.ssh/id_rsa.pub $host; done

二. 安装基础 rpm (所有主机)

    # yum install wget git net-tools bind-utils yum-utils iptables-services bridge-utils bash-completion kexec-tools sos psacct java-1.8.0-openjdk-headless python-passlib
    # yum update
    # reboot

三. 安装 Ansible (Master)

    # yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    # sed -i -e "s/^enabled=1/enabled=0/" /etc/yum.repos.d/epel.repo
    # yum -y --enablerepo=epel install ansible pyOpenSSL
    
    # cd ~
    # wget https://github.com/openshift/openshift-ansible/archive/openshift-ansible-3.11.100-1.tar.gz
    # tar xzvf openshift-ansible-3.11.100-1.tar.gz
    # cd openshift-ansible-openshift-ansible-3.11.100-1/

四. 安装 Docker, 默认配置即可. 需要定制化options的话, 在ansible hosts文件里定义.

    # yum install docker-1.13.1
    # rpm -V docker-1.13.1
    # docker version

五. 在控制台设置开放端口,打开 2379(etcd),8443(管理页面)端口, 参考官档

_images/port_2379.png开放端口 2379 _images/port_8443.png开放端口 8443


正式执行安装

一. 根据官方文档, 准备好ansible host 文件. 保存到master主机/etc/ansible/host.

二. 执行条件检查

    # ansible-playbook playbooks/prerequisites.yml |tee ../prerequisites.log

三. 执行安装. 如果途中失败的话, 修复问题后可重复执行.

    # ansible-playbook -vvv playbooks/deploy_cluster.yml |tee ../deploy_cluster.log

四. 安装后项目环境初始化

  • 添加超级管理员用户,注意管理员用户不能扩散,将admin密码按需替换(新增的用户都要在oc登陆一次使数据能同步到etcd)
    # htpasswd -b /etc/origin/master/htpasswd admin {admin密码}
    # oc login -u system:admin https://<admin portal>:8443
    # oc adm policy add-cluster-role-to-user cluster-admin admin
  • 禁止普通用户自建项目
    # oc patch clusterrolebinding.rbac self-provisioners -p '{"subjects": null}'
  • 把hawkular heapster和cassandra调度到infra节点
    # oc project openshift-infra
    # oc patch rc heapster -p '{"spec": {"template": {"spec": {"nodeSelector": {"node-role.kubernetes.io/infra": "true"}}}}}'
    # oc patch rc hawkular-cassandra-1 -p '{"spec": {"template": {"spec": {"nodeSelector": {"node-role.kubernetes.io/infra": "true"}}}}}'
  • 添加全局普通用户,将密码按需替换:
    # htpasswd -b /etc/origin/master/htpasswd readonly {密码}
  • 添加项目管理用户,将密码按需替换:
    # htpasswd -b /etc/origin/master/htpasswd hyperion {密码}
  • 添加项目,指定用户权限, 使网络可被全局访问
    # oc adm new-project hyperion --admin='hyperion' --description='Hyperion微服务中间层' --display-name='微服务中间层'
    # oc adm policy add-role-to-user view readonly -n hyperion
    # oc adm pod-network make-projects-global hyperion
  • 访问控制台,检查是否安装成功 https://portal.openshift.net.cn:8443/

五. 安装后集群配置优化

  • 修改node-config-compute, 增加kube-served, system-served预留资源
    # oc project openshift-node
    # oc edit cm node-config-compute
    kubeletArguments: 
    kube-reserved: 
        - "cpu=200m,memory=512Mi” 
    system-reserved: 
        - "cpu=200m,memory=512Mi"
  • 把 master-api资源分配改为burstable模式, 以保证这个pod在资源不足情况仍然正常工作
    # vi /etc/origin/node/pods/apiserver.yaml
    resources:
      requests:
        cpu: 300m
        memory: 500Mi
    # master-restart api api
  • 为确保iptables规则不会因重启丢失, 修改以下配置
    # sed -i 's/IPTABLES_SAVE_ON_STOP="no"/IPTABLES_SAVE_ON_STOP="yes"/g' /etc/sysconfig/iptables-config

集群卸载

  • 在安装目录的相同地方,运行
    # ansible-playbook playbooks/adhoc/uninstall.yml

openshift 集群配置管理

主要用于做一些devops的部署最佳实践。 包括内容

  • openshift:云平台运维脚本
  • system:测试环境中间件依赖的单机版Docker部署配置。分环境作为目录
  • tools:ops相关的工具类

为应用部署tls sni路由证书

服务器名称指示(英语:Server Name Indication,简称SNI)是TLS的一个扩展协议,在该协议下, 在握手过程开始时客户端告诉它正在连接的服务器要连接的主机名称。这允许服务器在相同的IP地址和TCP 端口号上呈现多个证书,并且因此允许在相同的IP地址上提供多个安全(HTTPS)网站(或其他任何基于TLS的服务), 而不需要所有这些站点使用相同的证书。

一. 申请tls证书, 阿里云上有免费的,只有一年, 每个域名30个名额

二. 参考官方文档进行以下操作

  # oc project opeshift-console
  # oc export route console -o yaml > console.backup.yml
  # oc delete route console
  # oc create route reencrypt console-custom -n openshift-console \
  --hostname console.apps.openshift.net.cn --key console.apps.openshift.net.cn.key \
  --cert console.apps.openshift.net.cn.crt --ca-cert console.apps.openshift.net.cn.ca \
  --service console
  
  # oc project openshift-logging
  # oc export route logging-kibana -o yaml > route-logging-kibana.yml
  # oc delete route logging-kibana
  # oc create route reencrypt logging-kibana -n openshift-logging \
  --hostname kibana.apps.openshift.net.cn --key kibana.apps.openshift.net.cn.key \
  --cert kibana.apps.openshift.net.cn.crt --ca-cert kibana.apps.openshift.net.cn.ca \
  --service logging-kibana

配置docker-registry外挂主机目录

openshift docker registry 默认安装使用empty volume, 容器重启镜像信息不能持久化. 通过挂载宿主机目录的方法, 把镜像保存在主机文件系统中, 重启后镜像仍然保留.

  • 在容器所在主机上,创建相应目录保存镜像
    # mkdir -p /diskb/registry
    # chmod 777 -R /diskb/registry
  • 关闭docker registry
    # oc project default
    # oc scale dc docker-registry --replicas=0
  • 提升容器权限访问主机目录
    # oc patch dc/docker-registry -p '{"spec":{"template":{"spec":{"containers":[{"name":"registry","securityContext":{"privileged": false}}]}}}}'
    # oc adm policy add-scc-to-user hostmount-anyuid -z registry
  • 设置主机目录
    # oc set volume dc/docker-registry --add --overwrite --name=registry-storage --type=hostPath --path=/diskb/registry
  • 恢复registry服务
    # oc scale dc docker-registry --replicas=1

日志模块

Openshift 日志模块集成了elastisearch, fluentd, kibana (efk)三个组件, 支持多实例高可用, host, ceph等多种外部存储. 以下安装步骤了三个es实例的集群,支持pod, 容器, systemd, java应用的日志收集.

安装步骤

参考官方文档

  • 下载ansible 安装脚本
    # wget https://github.com/openshift/openshift-ansible/archive/openshift-ansible-3.11.100-1.tar.gz
    # tar xzvf openshift-ansible-3.11.100-1.tar.gz
    # cd openshift-ansible-openshift-ansible-3.11.100-1/

    # vi roles/openshift_logging_fluentd/templates/fluent.conf.j2
    39 <label @INGRESS>
    40 {% if deploy_type in ['hosted', 'secure-host'] %}
    41 <match time.**>
    42 @type detect_exceptions
    43 @label @INGRESS
    44 remove_tag_prefix time
    45 message log
    46 languages time
    47 multiline_flush_interval 0.1
    48 </match>
    49 ## filters
  • 为计划安装日志模块节点打上label
    # oc label node node01-inner region/logging=true
    # oc label node node02-inner region/logging=true
    # oc label node node03-inner region/logging=true
    # oc label node node01-inner "region/logging-node"="1"
    # oc label node node02-inner "region/logging-node"="2"
    # oc label node node03-inner "region/logging-node"="3"
  • 配置ansible hosts文件,增加日志组件相关参数
    # vi /etc/ansible/hosts.3.11

    # 安装Logging:默认不安装
    # 增加es cluster size, kibana node selector
    openshift_logging_install_logging=true
    openshift_logging_curator_nodeselector={'region/logging':'true'}
    openshift_logging_es_nodeselector={'region/logging':'true'}
    openshift_logging_kibana_nodeselector={'region/logging':'true'}
    openshift_logging_curator_run_timezone=Asia/Shanghai
    openshift_logging_es_memory_limit=4Gi
    openshift_logging_es_cluster_size=3
    openshift_logging_es_number_of_replicas=1
    openshift_logging_es_number_of_shards=3
    openshift_logging_kibana_replicas=1
    
    
    # 指定fluentd 缺省用xpmotors定制打包的镜像,指定es, kibana镜像
    openshift_logging_elasticsearch_image=openshift/origin-logging-elasticsearch:v3.11.0
    openshift_logging_kibana_image=xpmotors/origin-logging-kibana:v3.11.2
    openshift_logging_kibana_proxy_image=openshift/oauth-proxy:v1.0.0
    openshift_logging_fluentd_image=xpmotors/origin-logging-fluentd:v3.9.2
  • 所有es主机创建日志存储目录, 用hostpath存储日志
    # ansible logging -m shell -a "mkdir -p /diskb/hyperion/es"
    # ansible logging -m shell -a "chmod -R 777 /diskb/hyperion/es"
  • 修改内核参数
    # ansible logging -m shell -a "sysctl -w vm.max_map_count = 262144"
    # ansible logging -m shell -a "echo 'vm.max_map_count = 262144' >> /etc/sysctl.conf"
  • 执行安装
    # ansible-playbook -i /etc/ansible/hosts.3.11 playbooks/openshift-logging/config.yml
  • 修改ES的SA,赋权,从而可挂载本地volume
   # oc adm policy add-scc-to-user privileged system:serviceaccount:openshift-logging:aggregated-logging-elasticsearch
  • 修改网络,使日志中心可以被全局访问
   # oc adm pod-network make-projects-global openshift-logging
  • 修改用户访问权限,使日志中心可以被特定用户访问
    # oc adm policy add-role-to-user edit hyperion -n openshift-logging
  • 为所有es data master dc打上label
    for dc in $(oc get deploymentconfig --selector component=es -o name); do
        oc scale $dc --replicas=0
        oc patch $dc \
           -p '{"spec":{"template":{"spec":{"containers":[{"name":"elasticsearch","securityContext":{"privileged": true}}]}}}}'
    done
      
    deploy=$(oc get deploymentconfig --selector component=es -o name)
    deploy1=$(echo $deploy | cut -d " " -f 1)
    deploy2=$(echo $deploy | cut -d " " -f 2)
    deploy3=$(echo $deploy | cut -d " " -f 3)
     
    oc patch $deploy1 -p '{"spec":{"template":{"spec":{"nodeSelector":{"region/logging": "true","region/logging-node":"1"}}}}}'
    oc patch $deploy2 -p '{"spec":{"template":{"spec":{"nodeSelector":{"region/logging": "true","region/logging-node":"2"}}}}}'
    oc patch $deploy3 -p '{"spec":{"template":{"spec":{"nodeSelector":{"region/logging": "true","region/logging-node":"3"}}}}}'
  • 作用本地mount到每个replica(假设存储被挂到每个Node的同个目录)
    for dc in $(oc get deploymentconfig --selector component=es -o name); do
        oc set volume $dc --add --overwrite --name=elasticsearch-storage --type=hostPath --path=/diskb/hyperion/es
        oc rollout latest $dc
        oc scale $dc --replicas=1
    done

监控模块

Openshift v3.11 集群监控以operator的形式把prometheus, grafana, alertmanager集中管理起来. 然而,当cluster monitor operator作为最高管理者,只开放了部分api对象修改, 这就造成了二次定制开发的巨大限制. 比如我要为grafana增加volumeClaim, 直接修改deployment spec是不行的, 因为operator侦测到对象变化, 硬是又给你再改回来. 而operator层面没有把这个对象参数化, 并不提供修改的渠道. 实在是 …

promethues只能使用storageClass作为存储对象接口, 为了支持方便的NFS, 这里需要进行一些改造. 另外grafana 暂时也只能使用empty dir作为存储, 它的插件更新无法持久化.

_images/cluster-monitoring.pngOpenshift cluster monitoring 架构

安装步骤

  • 配置NFS Server (node02-inner)
    # vi /etc/exports
    /diskb/export/prometheus-001 172.26.7.0/8(rw,sync,all_squash)
    /diskb/export/prometheus-002 172.26.7.0/8(rw,sync,all_squash)
    /diskb/export/alertmanager-001 172.26.7.0/8(rw,sync,all_squash)
    /diskb/export/alertmanager-002 172.26.7.0/8(rw,sync,all_squash)
    /diskb/export/alertmanager-003 172.26.7.0/8(rw,sync,all_squash)
    /diskb/export/grafana-001 172.26.7.0/8(rw,sync,all_squash)
    
    # systemct restart nfs
    # iptables -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 2049 -j ACCEPT
  • 为monitor节点打上label
    # oc label node node01-inner region/monitor=true
    # oc label node node02-inner region/monitor=true
  • 修改ansible hosts文件, 增加相关配置选项. 这里定义了storage_class_name是不存在的, 目的是为了后继修改方便.
    # vi /etc/ansible/hosts

    # 安装Prometheus operator
    #
    # Cluster monitoring is enabled by default, disable it by setting
    openshift_cluster_monitoring_operator_install=true
    #
    # Cluster monitoring configuration variables allow setting the amount of
    # storage and storageclass requested through PersistentVolumeClaims.
    #
    openshift_cluster_monitoring_operator_prometheus_storage_enabled=true
    openshift_cluster_monitoring_operator_alertmanager_storage_enabled=true
    
    openshift_cluster_monitoring_operator_prometheus_storage_capacity="2Gi"
    openshift_cluster_monitoring_operator_alertmanager_storage_capacity="1Gi"
    
    openshift_cluster_monitoring_operator_node_selector={'region/monitor':'true'}
    
    # external NFS support refer to Using Storage Classes for Existing Legacy Storage
    openshift_cluster_monitoring_operator_prometheus_storage_class_name="nfs"
    openshift_cluster_monitoring_operator_alertmanager_storage_class_name="nfs"
  • 修改以下playbook operator config template文件, 这是让prometheus使用NFS的一个hack
    # vi roles/openshift_cluster_monitoring_operator/templates/cluster-monitoring-operator-config.j2
    
    Line 28
    {% if openshift_cluster_monitoring_operator_prometheus_storage_enabled | bool %}
      volumeClaimTemplate:
        spec:
          selector:
            matchLabels:
              volume/type: pv-prometheus
          resources:
            requests:
              storage: {{ openshift_cluster_monitoring_operator_prometheus_storage_capacity }}
    {% endif %}

    Line 46
    {% if openshift_cluster_monitoring_operator_alertmanager_storage_enabled | bool %}
          volumeClaimTemplate:
            spec:
              selector:
                matchLabels:
                  volume/type: pv-alertmanager
              resources:
                requests:
                  storage: {{ openshift_cluster_monitoring_operator_alertmanager_storage_capacity }}
    {% endif %}
  • 创建alertmanager, promethues, grafana pv/pvc
    # oc create -f prometheus-pv-nfs-001.yml
    # oc create -f prometheus-pv-nfs-002.yml
    # oc create -f grafana-pv-pvc-nfs.yml
  • 执行安装
    # ansible-playbook playbooks/openshift-monitoring/config.yml
  • 访问promehteus 入口页面 https://grafana-openshift-monitoring.apps.openshift.net.cn

配置etcd监控目标

不管etcd是安装在哪里, 配置要做的事情是在prometheus 配置里增加scrape target, 把etcd client证书挂载进prometheus, 让prometheus可读取.

但是在operator的框架下, 只开放很少一部分的可配置功能, etcd监控就是其中之一. 本来简单的改scrape taget的事情, 需要修改clustr monitor config, 由cluster operator去增加etcd ServiceMonitor对象, 同步到secret/prometheus-k8s, promethues pod 内部的prometheus-config-reloader检测到文件变化,再生成真正的配置给prometheus.

配置同步链条太长, 大大增加了定制化的困难, 减低配置的灵活度.

参考文档

  • 修改ConfigMap cluster-monitoring-config, 增加etcd监控target ip地址

_images/promethues-monitor-etcd-config01.png配置etcd ip

  • 创建包含etcd client证书的文件 etcd-cert-secret.yaml
    # cat <<-EOF > etcd-cert-secret.yaml
    apiVersion: v1
    data:
      etcd-client-ca.crt: "$(cat /etc/origin/master/master.etcd-ca.crt | base64 --wrap=0)"
      etcd-client.crt: "$(cat /etc/origin/master/master.etcd-client.crt | base64 --wrap=0)"
      etcd-client.key: "$(cat /etc/origin/master/master.etcd-client.key | base64 --wrap=0)"
    kind: Secret
    metadata:
      name: kube-etcd-client-certs
      namespace: openshift-monitoring
    type: Opaque
    EOF
  • 创建新的secret对象
    # oc apply -f etcd-cert-secret.yaml
  • 理论上新增scrape taget后, prometheus会帮你重启生效配置. 如果不生效,则手动重启.
    # oc scale statefulset prometheus-k8s --replicas=0
    # oc scale statefulset prometheus-k8s --replicas=2

_images/promethues-monitor-etcd-config02.png监控etcd成功


配置监控Router

Router的监控端口是1936, 以Basic Auth验证请求. 所以在ServiceMonitor中需要配置以下信息.

  • 获取Router的Basic Auth用户与密码
    # oc export dc router -n default |grep -A 1 STATS
        - name: STATS_PASSWORD
          value: wDMpjeGV1P
        - name: STATS_PORT
          value: "1936"
        - name: STATS_USERNAME
          value: admin
  • 把用户名密码转成base64编码
    # echo 'admin' |base64
    YWRtaW4K
    # echo 'wDMpjeGV1P' |base64
    d0RNcGplR1YxUAo=
  • 创建 router basic auth secret和Service Monitor
    # oc project openshift-monitoring
    # oc create -f router-basic-auth-secret.yml
    # oc create -f router-monitor.yml

_images/router-scrape-config.png监控Router成功


配置监控第三方应用的例子

以下步骤演示如何监控一个go语言应用, 开放监控端口为8080, 路径为/metrics. 代码参考openshift cluster mornitoring的仓库.

  • 部署应用模板
    # oc create -f prometheus-example-app-template.yml -n hyperion
  • 注入环境变量,使用模板创建应用,服务,路由
    # oc process prometheus-example-app-template -p ENV=test |oc create -f -
    # oc get dc
    NAME                     REVISION   DESIRED   CURRENT   TRIGGERED BY
    prometheus-example-app   1          1         1         config
    # oc get svc
    NAME                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)            AGE
    prometheus-example-app    ClusterIP   172.30.8.240    <none>        8080/TCP           19m
  • 修改应用所在的namespace label, 允许被监控
    # oc patch namespace hyperion -p '{"metadata": {"labels": {"openshift.io/cluster-monitoring": "true"}}}'
  • 为prometheus-k8s sa增加允许访问项目内对象(主要是为了service)的权限
    # oc adm policy add-role-to-user view system:serviceaccount:openshift-monitoring:prometheus-k8s -n hyperion
  • 为应用创建ServiceMonitor对象. 注意: endpoints下的port值与对应service下的port name一致
    # oc create -f sericemonitor-prometheus-example-app.yml -n openshift-monitoring
  • 配置成功后, 可以看到应用的监控配置与目标对象

_images/prometheus-example-app-scrape-config.png监控三方应用成功

_images/prometheus-example-app-scrape-target.png监控对象


配置告警规则

告警规则的定客户化是通过添加prometheusrules CRD来实现的, 每新增一个CRD, 相应在cm/prometheus-k8s-rulefiles-0 增加一个group, 并自动完成prometheus 的reload. 所以运维人员可以通过管理这些CRD, 很轻松的管理告警规则与分组.

  • 准备好告警规则配置文件, 注意kind为PrometheusRule, 例如
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      labels:
        prometheus: k8s
        role: alert-rules
      name: prometheus-openshift-rules
      namespace: openshift-monitoring
    spec:
      groups:
      - name: 'Openshift 云平台告警'
        rules:
        - alert: 'openshift-01-容器重启'
          expr: changes(container_start_time_seconds{id!~"/(system|user).slice.*|/kubepods.slice/kubepods-burstable.slice/.*.slice", pod_name!~"^.*-deploy$"}[5m]) > 1
          labels:
            level: '警示'
            callbackUrl: 'https://prometheus-k8s-openshift-monitoring.apps.openshift.net.cn/graph?g0.range_input=1h&g0.expr=container_start_time_seconds&g0.tab=1'
          annotations:
            description: '{{ $labels.instance }}实例在过去5分钟内出现容器重启的现象'
  • 创建新的 prometheusrules对象
    # oc create -f alert-rules/prometheusrules-openshift.yml
  • 后继自动触发规则注入, 重启prometheus.

_images/prometheusrules-01.png告警规则创建成功

业务应用集成Prometheus监控

下面以一个spring boot 2项目为例子, 展示如何使用actuator和micrometer为应用接入prometheus监控体系.

  • 修改 pom.xml , 增加包依赖
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-actuator</artifactId>
        </dependency>
        <dependency>
            <groupId>io.micrometer</groupId>
            <artifactId>micrometer-registry-prometheus</artifactId>
            <version>1.0.4</version>
        </dependency>
  • 修改resource/application-dev.property, 开放prometheus监控子路径
    management.endpoints.web.exposure.include=health,info,metrics,prometheus
    management.endpoint.prometheus.enabled=true
    management.metrics.web.server.auto-time-requests=true
    management.metrics.export.prometheus.enabled=true
    management.security.enabled=false
  • 在Controller.java中, 导入依赖包.
    import java.util.concurrent.atomic.AtomicInteger;
    
    import io.micrometer.core.annotation.Timed;
    import io.micrometer.core.instrument.Meter;
    import io.micrometer.core.instrument.Counter;
    import io.micrometer.core.instrument.MeterRegistry;
  • 通用方法: 为每个api加上@Timed注解, 用于收集时间类计数, 如总次数, 平均每秒次数, 最大平均每秒次数. 另外也可设置标签, 直方图统计(histogram)等属性.
    @ApiOperation(value="正常Hi", notes="参数:name")
    @RequestMapping("/hi")
    @Timed( value = "greeting_hi",
            histogram = false,
            extraTags = {"demo", "true"}
    )
    public String hi() {
        return "Hi!";
    }
  • 如果需要自定义统计信息, 则通过 MetricRegistry 构造计数器(counter)或者度量器(gauge). 在业务函数方法中增加计数器/度量器调用.
    Counter myCounter;
    AtomicInteger myGauge;
    
    public GreetingController(MeterRegistry registry){
        // 注册名字为 my_counter 计数器衡量 /hi api
        this.myCounter = registry.counter("my_counter"); 
        // 注册名字为 my_gauge 度量器用于计算 myGauge对象的长度
        this.myGauge = registry.gauge("my_gauge", new AtomicInteger(0));
    }
    
    @ApiOperation(value="正常Hi", notes="参数:name")
    @RequestMapping("/hi")
    public String hi() {
        // 增加计数对象值 myCounter
        this.myCounter.increment();
        return "Hi!";
    }
    
    @ApiOperation(value="随机返回码Hi", notes="")
    @GetMapping(value = "/randomStatusHi")
    public ResponseEntity<String> randomStatusHi() {
        int code = new Random().nextInt(100);
        Integer result = randomStatus + code;
        // 增加返回配置的状态码的概率
        if (code % 10 == 0) {
            result = randomStatus;
        }
        // 设置度量对象myGauge
        this.myGauge.set(code);
        return ResponseEntity.status(result).body("Random Status " + String.valueOf(result) + " Hi!");
    }
  • Maven构建, 执行jar包, 使spring boot RestController跑起来
    # mvn clean package
    # java -jar target/gs-rest-service-0.1.0.jar --spring.profiles.active=dev
  • 访问api /greeting/hi, /greeting/randomStatusHi, 模拟外部请求
    # curl http://localhost:8800/greeting/hi
    # curl http://localhost:8800/greeting/randomStatusHi
  • 检查prometheus metrics有关以上两个api的统计信息: http://localhost:8800/actuator/prometheus

_images/prometheus_greeting_hi.pnggreeting hi metrics _images/prometheus_randomstatus_hi.pngrandom status hi gauge

CI/CD 模块

Jenkins 安装

为了集成了jenkins组件, openshift 对jenkins 作了插件开发, 使用户可以以openshift的帐号一站式登录使用jenkins. 这些插件包括: 链接

  • OpenShift Client Plugin: 作为oc客户端连接api server操作pipeline等对象
  • OpenShift Sync Plugin: oc上的构建对象如buildConfig能够同步到jenkins里
  • OpenShift Login Plugin: 使用oc上的帐号统一登陆到jenkins

下面以使用nfs作为存储保存jenkins 运行数据作为例子,展示安装过程.

  • 配置NFS Server (node02-inner)
    # vi /etc/exports
    /diskb/export/jenkins-001 172.26.7.0/8(rw,sync,all_squash)
    
    # systemct restart nfs
  • 创建jenkin pv, 该卷被挂载到jenkins容器/var/lib/jenkins中. 前提Nfs server端目录已经创建好.
    # oc create -f jenkins-pv-nfs.yml
  • 使用jenkins-persistent 模板创建deployment config
    # oc project hyperion
    # oc process jenkins-persistent -n openshift \
    -v JENKINS_SERVICE_NAME=jenkins-persistant,JNLP_SERVICE_NAME=jenkins-jnlp-persistant \
    | oc create -f -
  • 访问jenkins页面 https://jenkins-persistant-hyperion.apps.openshift.net.cn

改造jenkins能够使用pipeline docker agent

默认的安装的jenkins镜像里没有docker 命令binary, 也没有docker 执行环境. 需要做以下改造使得此pipeline可用, 简称dind (docker in docker)

  • 重新打包镜像 openshift/jenkins-2-centos7:v3.11
  • 挂载宿主机的/var/run/docker.sock入容器
  • 打包新镜像
    # docker build -t kennethye/jenkins-2-centos7:v3.11.1 -f Dockerfile.jenkins.repack .
    # docker push kennethye/jenkins-2-centos7:v3.11.1
  • 修改dc配置
    # oc scale dc jenkins-persistant --replicas=0
    # oc adm policy add-scc-to-user hostmount-anyuid -z jenkins-persistant
    # oc set volume dc/jenkins-persistant --add --overwrite --name=var-run-docker --type=hostPath --path=/var/run/docker.sock
    # oc patch dc/jenkins-persistant -p '{"spec":{"template":{"spec":{"containers":[{"name":"jenkins","image": "kennethye/jenkins-2-centos7:v3.11.1", volumeMounts": [{"name": "var-run-docker", "mountPath": "/var/run/docker.sock"}] }]}}}}'
    # oc scale dc jenkins-persistant --replicas=1

改造jenkins-agent-maven能够运行docker in docker (dind)

默认的jenkins-agent-maven镜像没有docker客户端, 为了使容器化的jenkins agent能够具备镜像打包功能, 需要对它进行二次打包.

  • 打包新镜像
    # docker build -t kennethye/jenkins-agent-maven-35-centos7:v3.11.1 -f ./Dockerfile-jenkins-agent-maven-3.5 .
    # docker push kennethye/jenkins-agent-maven-35-centos7:v3.11.1

添加jenkins slave进入jekinks集群步骤

把jenkins slave以容器的形式运行, 加入master组合成完整的jenkins集群

  • 修改jenkins-persistant服务,添加jnlp port

_images/service-port-jnlp.pngjnlp port

  • 参考教程, 在jenkins master添加node

_images/jenkins-slave-new-node.pngnew node

_images/jenkins-slave-new-node-secret.pngget new node secret

  • 创建完成后, 保存以下信息
    JENKINS_URL=http://jenkins-persistant.hyperion.svc (master的URL, 50000端口要打开)
    JENKINS_SECRET=f6cxxxxxx    (上一步创建node返回的密码信息)
    JENKINS_NAME=maven-slaves   (上一步创建node的名字)
  • 在slave node节点启动docker, cpu内存的大小按实际分配. 注意不同node节点的secret和名字是不一样的. 必须在master上已经有记录,才能注册得上.
    # docker run -d --restart always --name jenkins-agent-maven \
    -v /var/run/docker.sock:/var/run/docker.sock:rw \
    --cpu-shares 1024 --memory 2G -e 'JENKINS_URL=http://jenkins-persistant.hyperion.svc' \
    -e 'JENKINS_SECRET=f6cxxxxxx' -e 'JENKINS_NAME=maven-slaves' \
    kennethye/jenkins-agent-maven-35-centos7:v3.11.1

搭建业务应用流水线

支撑业务(java)应用从源码, UT, 打包与上传docker镜像, 创建/更新template, 部署到openshift平台. 以下例子镜像存储使用自带的 Openshift Registry, 支持改造成Harbor等其它三方镜像仓库.

  • 创建 jenkins 用户, 用于Docker镜像的上传与下载. 登陆一次,使用户信息能同步到etcd.
    # htpasswd -b /etc/origin/master/htpasswd jenkins <password>
    # oc login -u jenkins -p <password> https://portal.openshift.net.cn:8443
    # oc logout
  • 授予用户访问镜像仓库的权限. 每个项目都分别绑定 registry-editor 权限, 才能使用 jenkins 用户上传/下载项目下的镜像.
    # oc policy add-role-to-user registry-editor jenkins -n hyperion
  • 创建jenkins job包含以下几个Parameter
    BUILD_NODE_LABEL	
    PROJECT_NAME	
    GIT_REPO	
    APPLICATION_TYPE	
    BRANCH	
    OC_DEV_USER	
    OC_DEV_PASS	
    REGISTRY_USER		
    REGISTRY_PASSWORD	
    ENV	
    VERSION	
    SKIP_BUILD	
    SKIP_TEST	
    APPLICATION_INIT

_images/pipeline-test003-01.pngPipeline Parameters

  • 在jenkins job中导入groovy 脚本 cicd/jenkinsfile-all-in-one.groovy
  • ft-rest-service为例子, 普通Spring Boot 应用需要Dockerfile容器化, 并建立适配openshift模板, 才能上云. 以下为相应改动:
    Dockerfile:
    1. 抽取JAVA_OPTIONS, APP_OPTS, JMX_OPTS, GCLOG_OPTS, 支持应用按需要配置
    2. Expose 开放端口
    3. 支持通过环境变量SPRING_PROFILES_ACTIVE传参, 设置执行环境(dev/test/prod)
    
    Openshift.yml模板:
    1. 支持设置CPU, 内存 request/limit
    2. 根据当前应用设置合理默认值, 如APPLICATION_NAME, IMAGE, SUB_DOMAIN
    3. Service设置prometheus监控annotation
  • Jenkins中执行 Build with Parameters 即可完成项目从源码构建, UT, 打包容器镜像, 上传镜像, 构建openshit模板, 部署应用到云平台的完整过程.

_images/build-ship-run.pngBuild Ship Run !

问题汇总及解决方法

阿里云镜像不支持selinux enforcing

阿里云上的centos 7.6 镜像默认disabled selinux, 如果设成enforcing的话, 会导致重启主机不能登陆. 解决办法是设置成Permissive.

    # cat /etc/selinux/config
      SELINUX=permissive
    # reboot

https服务不能用浏览器访问, 错误 ERR_CONNECTION_RESET

因为阿里云的限制, 不能直接用433端口访问, 临时办法是其中一台节点上安装vnc server, 远程访问

  • 参考安装vnc server 注意centos 7.6的vnc server有bug, 必须同时安装 GNOME才能启动成功.
    # yum groupinstall 'GNOME Desktop'
    # systemctl start vncserver@:1
    # iptables -I INPUT -p tcp -m state --state NEW -m tcp --dport 5901 -j ACCEPT
    # 修改阿里云网络规则开放tcp 5901端口

Spring Metrics不能构建成功

按照官档的方式 集成prometheus, mvn build 构建总不能成功. 错误信息如下.

    2018-07-26 16:06:12.312 ERROR 7582 --- [           main] o.s.boot.SpringApplication               : Application run failed
    org.springframework.beans.factory.BeanDefinitionStoreException: Failed to process import candidates for configuration class [com.example.demo.DemoApplication]; nested exception is java.lang.IllegalStateException: Failed to introspect annotated methods on class io.prometheus.client.spring.boot.PrometheusEndpointConfiguration

原来这是一个bug, 这功能根本不能用. 相关issue

    [SpringBoot2] Cannot get SpringBoot 2 to work with Prometheus #405
    https://github.com/prometheus/client_java/issues/405

Indices and tables