Note
This document presents the infrastructure and operation of containerized component deployment for the Vera C. Rubin Observatory.
1 Introduction¶
Many of the control system components, but not all, have been developed for deployment via Docker containers. Initially we leveraged docker-compose as a mechanism for prototyping container deployment to host services. With the proliferation of Kubernetes (k8s) clusters around the Vera C. Rubin Observatory and the associated sites, a new style of deployment and configuration is needed. We will leverage the experience SQuaRE has gained deploying the EFD and Nublado services and use a similar system based on Helm charts and ArgoCD configuration and deployment. The following sections will discuss the two key components and then will be expanded into including operation aspects as experience on that front is gained.
2 Helm Charts¶
The CSC deployment starts with a Helm chart. We are currently adopting version 1.4 of ArgoCD which works with Helm version 2. The code for the charts are kept in the Helm chart Github repository. The next two sections will discuss each chart in detail. For a description of the APIs used, consult the Kubernetes documentation reference. The chart sections will not go into great detail on the content of each API delivered. Each chart section will list all of the possible configuration aspects that each chart is delivering, but full use of that configuration is left to the ArgoCD Configuration section and examples will be provided there. For the CSC deployment, we will run a single container per pod on Kubernetes. The Kafka producers will follow the same pattern.
2.1 Cluster Configuration Chart¶
This chart (cluster-config) is responsible for setting up all of the namespaces for a Kubernetes cluster by using Namespace from the Kubernetes Cluster API. Its only configuration is a list of namespaces that the cluster will use.
YAML Key | Description |
---|---|
namespaces | This holds a list of namespaces for the cluster |
2.2 OSPL Configuration Chart¶
This chart (ospl-config) is responsible for ensuring the network interface for OpenSplice DDS communication is set to listen to the proper one on a Kubernetes cluster. The multus CNI provides the multicast interface on the Kubernetes cluster for the pods. The rest of the options deal with configuring the shared memory configuration the control system is using. The chart uses ConfigMap from the Kubernetes Config and Storage API to provide the ospl.xml file for all of the cluster’s namespaces.
YAML Key | Description |
---|---|
namespaces | This holds a list of namespaces for the cluster |
networkInterface | The name of the multus CNI interface |
domainId | This sets the domain ID for the DDS communication |
shmemSize | The size in bytes of the shared memory database |
maxSamplesWarnAt | The maximum number of samples at which the system warns of resource issues |
schedulingClass | The thread scheduling class that will be used by a daemon service |
schedulingPriority | The scheduling priority that will be used by a daemon service |
monitorStackSize | The stack size in bytes for the daemon service |
waterMarksWhcHigh | This sets the size of the high watermark. Units must be explicitly used |
deliveryQueueMaxSamples | This controls the maximum size of the delivery queue in samples |
dsGracePeriod | This sets the discovery time grace period. Time units must be specified |
squashParticipants | This controls whether one virtual (true) or all (false) domain participants as shown at discovery time |
namespacePolicyAlignee | This determines how the durability service manages the data that matches the namespace |
domainReportEnabled | This enables reporting at the Domain level |
ddsi2TracingEnabled | This enables tracing for the DDSI2 service |
ddsi2TracingVerbosity | This sets the level of information for the DDSI2 tracing |
ddsi2TracingLogfile | This specifies the location and name of the DDSI2 tracing log |
durabilityServiceTracingEnabled | This enables tracing for the Durability service |
durabilityServiceTracingVerbosity | This sets the level of information for the Durability tracing |
durabilityServiceTracingLogfile | This specifies the location and name of the Durability tracing log |
2.3 OSPL Daemon Chart¶
This chart (ospl-daemon) handles deploying the OSPL daemon service for the shared memory configuration. This daemon takes over the communcation startup, handling and teardown from the individual CSC applications. The chart uses a DaemonSet from the Kubernetes Workload APIs since it is designed to run on every node of a Kubernetes cluster.
YAML Key | Description |
---|---|
image | This section holds the configuration of the container image |
image.repository | The Docker registry name of the container image to use for the producers |
image.tag | The tag of the container image to use for the producers |
image.pullPolicy | The policy to apply when pulling an image for deployment |
image.nexus3 | The tag name for the Nexus3 Docker repository secrets if private images need to be pulled |
namespace | This is the namespace in which the CSC will be placed |
env | This section holds a set of key, value pairs for environmental variables |
osplVersion | This is the version of the commercial OpenSplice library to run. It is used to set the location of the OSPL configuration file |
shmemDir | This is the path to the Kubernetes local store where the shared memory database will be written |
useHostIpc | This sets the use of the host inter-process communication system. Defaults to true |
useHostPid | This sets the use of the host process ID system. Defaults to true |
2.4 Kafka Producer Chart¶
While not a true control component, the Kafka producers are nevertheless an important part of the control system landscape. They have the capability to convert the SAL messages into Kafka messages that are then ingested into the Engineering Facilities Database (EFD). See [Fau20] for more details.
The chart consists of a single Kubernetes Workloads API: Deployment. The Deployment API allows for restarts if a particular pod dies which assists in keeping the producers up and running all the time. For each producer specified in the configuration, a deployment will be created. We will now cover the configuration options for the chart.
YAML Key | Description |
---|---|
image | This section holds the configuration of the container image |
image.repository | The Docker registry name of the container image to use for the producers |
image.tag | The tag of the container image to use for the producers |
image.pullPolicy | The policy to apply when pulling an image for deployment |
image.nexus3 | The tag name for the Nexus3 Docker repository secrets if private images need to be pulled |
env | This section holds environment configuration for the producer container |
env.lsstDdsPartitionPrefix | The LSST_DDS_PARTITION_PREFIX name applied to all producer containers |
env.brokerIp | The URI for the Kafka broker that received the generated Kafka messages |
env.brokerPort | The port associated with the Kafka broker specified in brokerIp |
env.registryAddr | The URL for the Kafka broker associated schema registry |
env.partitions | The number of partitions that the producers are supporting |
env.replication | The number of replications available to the producers |
env.waitAck | The number of Kafka brokers to wait for an ack from |
env.logLevel | This value determines the logging level for the producers |
env.extras | This section holds a set of key, value pairs for environmental variables |
producers | This section holds the configuration of the individual producers [1] |
producers.name | This key gives a name to the producer deployment and can be repeated |
producers.name.cscs [2] | The list of CSCs that the named producer will monitor |
producers.name.image | This section provides optional override of the default image section |
producers.name.image.repository | The Docker registry container image name to use for the named producer |
producers.name.image.tag | The container image tag to use for the named producer |
producers.name.image.pullPolicy | The policy to apply when pulling an image for named producer deployment |
producers.name.env | This section provides optional override of the defaults env section |
producers.name.env.lsstDdsPartitionPrefix | The LSST_DDS_PARTITION_PREFIX name applied the named producer container |
producers.name.env.partitions | The number of partitions that the named producer is supporting |
producers.name.env.replication | The number of replications available to the named producer |
producers.name.env.waitAck | The number of Kafka brokers to wait for an ack from for the named producer |
producers.name.env.logLevel | This value determines the logging level for the named producer |
producers.name.env.extras | This section holds a set of key, value pairs for environmental variables for the named producer |
namespace | This is the namespace in which the producers will be placed |
osplVersion | This is the version of the commercial OpenSplice library to run. It is used to set the location of the OSPL configuration file |
shmemDir | This is the path to the Kubernetes local store where the shared memory database will be written |
useHostIpc | This sets the use of the host inter-process communication system. Defaults to true |
useHostPid | This sets the use of the host process ID system. Defaults to true |
[1] | A given producer is given a name key that is used to identify that producer (e.g. auxtel). |
[2] | The characters >- are used after the key so that the CSCs can be specified in a list |
Note
The brokerIp, brokerPort and registryAddr of the env section are not overrideable in the producers.name.env section. The nexus3 of the image section is not overrideable in the producers.name.image section. Control of those items is on a site basis. All producers at a given site will always use the same information.
2.5 CSC Chart¶
Instead of having charts for every CSC, we employ an approach of having one chart that describes all the different CSC variants. There are four main variants that the chart supports:
- simple
- A CSC that requires no special interventions and uses only environment variables for configuration
- entrypoint
- A CSC that uses an override script for the container entrypoint.
- imagePullSecrets
- A CSC that requires the use of the Nexus3 repository and need access credentials for pulling the associated image
- volumeMount
- A CSC that requires access to a physical disk store in order to transfer information into the running container
The chart consists of the Job Kubernetes Workflows API, ConfigMap and PersistentVolumeClaim Kubernetes Config and Storage APIs. The Job API is used to provide correct behavior when a CSC is sent of OFFLINE mode, the pod should not restart. If the CSC dies for an unknown reason, not one caught by a FAULT state transition, a new pod will be started and the CSC will then come up in its lowest control state. The old pod will remain in a failed state, but available for interrogation about the problem. The other APIs are used to support the non-simple CSC variants. They will be mentioned in the configuration description which we will turn to next.
YAML Key | Description |
---|---|
image | This section holds the configuration of the CSC container image |
image.repository | The Docker registry name of the container image to use for the CSC |
image.tag | The tag of the container image to use for the CSC |
image.pullPolicy | The policy to apply when pulling an image for deployment |
image.nexus3 | The tag name for the Nexus3 Docker repository secrets if private images need to be pulled |
namespace | This is the namespace in which the CSC will be placed |
env | This section holds a set of key, value pairs for environmental variables |
entrypoint | This key allows specification of a script to override the entrypoint |
mountpoint | This section holds the information necessary to create a volume mount for the container. |
mountpoint.name | A label identifier for the mountpoint |
mountpoint.path | The path inside the container to mount |
mountpoint.accessMode [3] | This sets the required access mode for the volume mount. |
mountpoint.ids | This section contains UID and GID overrides |
mountpoint.ids.uid | An alternative UID for mounting |
mountpoint.ids.gid | An alternative GID for mounting |
mountpoint.claimSize | The requested physical disk space size for the volume mount |
osplVersion | This is the version of the commercial OpenSplice library to run. It is used to set the location of the OSPL configuration file |
shmemDir | This is the path to the Kubernetes local store where the shared memory database will be written |
useHostIpc | This sets the use of the host inter-process communication system. Defaults to true |
useHostPid | This sets the use of the host process ID system. Defaults to true |
[3] | Definitions can be found here. |
Note
The configurations that are associated with each chart do not represent the full range of component coverage. The ArgoCD Configuration handles that.
2.6 Packaging and Deploying Charts¶
The Github repository has a README that contains information in how to package up a new chart for deployment to the chart repository. First, ensure that the chart version has been updated in the Chart.yaml file. The step for creating/updating the index file needs one more flag for completeness.
helm repo index --url=https://lsst-ts.github.io/charts .
Once the version number is updated, the chart packaged and the index file updated, they can be collected into a single commit and pushed to master. That push to master will trigger the installation of the new chart into the chart repository.
3 ArgoCD Configuration¶
The configuration and subsequent deployment of the control components is handled by the ArgoCD system. The code for the ArgoCD configuration is kept in the ArgoCD Github repository. The deployment methodologies will be handled in forthcoming sections. ArgoCD uses the concept of an app of apps (or chart of charts). Each app requires a specific chart or charts to use in order to deploy.
Each component has its own directory within the top-level apps
directory.
This includes the cluster configuration, OSPL configuration, OSPL daemon,
Kafka producers and the CSCs. There are a few special apps which collect the
main CSC component apps into a group (further called collector apps). Those
will be discussed later. Some applications have extra support included that are
not present in the application chart. That extra support will be explained
within the appropriate section.
The contents found within the application directories have roughly the same content.
- Chart.yaml
- This file specifies the name of the application with that key in the file.
- requirements.yaml
- This file specifies the Helm chart to use including the version.
- values.yaml
- This file contains base information that will apply to all site specific configuration. May not be present in all applications.
- values-<site tag>.yaml
- This file contains site specific information for the configuration. It may
override configuration provided in the
values.yaml
file. The supported sites listed in the following table and not all applications will have all sites supported.
Site Tag | Site Location |
---|---|
base | La Serena Base Data Center |
ncsa-teststand | NCSA Test Stand |
ncsa-lsp-int | LSP-int at NCSA |
summit | Cerro Pachon Control System Infrastructure |
tucson-teststand | Tucson Test Stand |
- templates
- This directory may not appear in all configurations. It will contain other Kubernetes or ArgoCD APIs to support deployment.
All values*.yaml
files start with the chart name the application uses as the
top-level key. Further keys are specified in the same manner as the Helm chart
configuration. Examples will be provided below.
3.1 Cluster Configuration¶
This application (cluster-config
) contains a VaultSecret
Vault API in the templates
directory that
handles setting up the access credential secrets for image pulling into each
defined namespace. This requires a configuration parameter that is outside the
chart level configuration.
YAML Key | Description |
---|---|
deployEnv | The site tag to use when setting up the namespace secrets |
The default configuration contains the following five namespaces and are used in the OSPL daemon, Kafka producer and CSC applications.
- auxtel
- maintel
- obssys
- kafka-producers
- ospl-daemon
3.2 Collector Apps¶
Within the ArgoCD Github repository, there are currently three collector
applications: auxtel
, maintel
and obssys
. The layout for these apps
is different and explained here.
- Chart.yaml
- This file contains the specification of a new chart that will deploy a group of CSCs.
- values.yaml
- This file contains configuration parameters to fill out the application deployment. The keys will be discussed below.
- templates/<collector app name>.yaml
- This file contains the ArgoCD Application API used to deploy the associated CSCs specified by the collector app configuration. One application is generated for each CSC listed in the configuration.
YAML Key | Description |
---|---|
spec | This section defines elements for cluster setup and ArgoCD location |
spec.destination | This section defines information for the deployment destination |
spec.destination.server | The name of the Kubernetes resource to deploy on |
spec.source | This section defines the ArgoCD setup to use |
spec.source.repoURL | The repository housing the ArgoCD configuration |
spec.source.targetRevision | The branch on the ArgoCD repository to use for the configuration |
env | This key sets the site tag for the deployment. Must be in quotes. |
cscs | This key holds a list of CSCs that are associated with the app |
noSim | This key holds a list of CSCs that do not have a simulator capability |
3.3 CSCs with Special Support¶
This section will detail any CSC applications that require special support that is outside the supplied chart.
3.3.1 Hexapodsim¶
The athexapod
application requires the use of a simulated hexapod
low-level controller when running in simulation mode. This simulator
(hexapodsim
) is accessed by a specific IP address and port. The
hexapodsim
app uses a Service from the Kubernetes Service APIs to setup
the port. Kubernetes conjoins that with the deployed pod IP in an environment
variable: HEXAPOD_SIM_SERVICE_HOST
. The ATHexapod CSC code uses that
variable to set the proper connection information.
Note
hexapodsim
uses version 0.4.1 of the csc
Helm chart. The use
of this older chart is due to that application not being an OSPL
client and therefore does not need any of the new shared memory
support. This may be changed to a chart that contains no OSPL features
in the future.
3.4 Examples¶
ArgoCD level configuration files follow this general format.
chart-name:
chart-key1: values
chart-key2: values
...
If a given application uses extra APIs for deployment, those configurations will look like the following.
api-key1: values
api-key2: values
...
Refer to the appropriate Helm Chart section for chart level key descriptions. API key descriptions are in this section.
3.4.1 Cluster Configuration¶
The main values.yaml
file looks like:
cluster-config:
namespaces:
- auxtel
- maintel
- obssys
- kafka-producers
- ospl-daemon
This sets the namespaces for all sites. This configuration can be overridden on a per site basis, but it is not recommended for production environments such as the summit, base and NCSA test stand.
The site specific configuration files only need to contain the deployEnv
keyword. The values-ncsa-teststand.yaml
is shown as an example.
deployEnv: ncsa-teststand
If one does want to override the list of namespaces for a particular site, this is how it would be done for a site specific file.
cluster-config:
namespaces:
- test1
- myspace
- home
deployEnv: tucson-teststand
3.4.2 OSPL Configuration¶
This is the ospl-config
directory within the ArgoCD repository. There is one
and only one configuration for this application.
ospl-config:
namespaces:
- auxtel
- maintel
- obssys
- kafka-producers
- ospl-daemon
domainId: 0
shmemSize: 104857600
maxSamplesWarnAt: 50000
schedulingClass: Default
schedulingPriority: 0
monitorStackSize: 6000000
waterMarksWhcHigh: 8MB
deliveryQueueMaxSamples: 10000
squashParticipants: true
namespacePolicyAlignee: Lazy
The list of namespaces MUST contain at least the same namespaces as
cluster-config
. The networkInterface is the name specified by the
multus
CNI and is the same for all sites that we currently deploy to. The
rest of the configuration is meant for handling setup, services and features
related to the shared memory configuration. Again, they are typically set once
per site and are normally propogated to all sites we deploy to.
If one wants to adjust configuration parameters for testing without effecting other sites, a site specific configuration file can be used.
3.4.3 OSPL Daemon Configuration¶
The OSPL daemon configuration has a global values.yaml
file that sets the
namespace for all sites. All other configuration should be handled in a site
YAML configuration file. Below is the configuration from the
values-ncsa-teststand.yaml
configuration file.
ospl-daemon:
image:
repository: ts-dockerhub.lsst.org/ospl-daemon
tag: c0013
pullPolicy: Always
nexus3: nexus3-docker
env:
LSST_DDS_PARTITION_PREFIX: ncsa
OSPL_INFOFILE: /tmp/ospl-info-daemon.log
OSPL_ERRORFILE: /tmp/ospl-error-daemon.log
shmemDir: /scratch.local/ospl
osplVersion: V6.10.4
3.4.4 Kafka Producer Configuration¶
The Kafka producer configuration has a global values.yaml
file that sets the
namespace and producer CSC configuration for all sites. A snippet of the
configuration is shown below.
kafka-producers:
namespace: kafka-producers
producers:
auxtel:
cscs: >-
ATAOS
ATDome
ATDomeTrajectory
ATHexapod
ATPneumatics
ATPtg
ATMCS
maintel:
cscs: >-
MTAOS
Dome
MTDomeTrajectory
MTPtg
...
Each key under producers is the name for that given producer along with the list of CSCs that producer will monitor.
Warning
Any changes to the values.yaml
will be seen by all sites at
once, so give careful thought to adjustments there.
The Docker image and other producer configuration is handled on a site basis.
Here is an example from the values-ncsa-teststand.yaml
.
kafka-producers:
image:
repository: lsstts/salkafka
tag: v1.1.2_salobj_v5.11.0_xml_v5.1.0
pullPolicy: Always
env:
lsstDdsDomain: ncsa
brokerIp: cp-helm-charts-cp-kafka-headless.cp-helm-charts
brokerPort: 9092
registryAddr: https://lsst-schema-registry-nts-efd.ncsa.illinois.edu
partitions: 1
replication: 3
waitAck: 1
logLevel: 20
The env information is specifically tailored for the NCSA teststand. The image information is applied to all producers at this site. You can override both the producers deployed, reconfigure them if necessary or add new ones to a specific site. You can also change the image information for a given producer as well. You must ensure that the different image can interact with the others already deployed without interfering with their functioning. Below is an example of doing all the above.
kafka-producers:
image:
repository: lsstts/salkafka
tag: v1.1.2_salobj_v5.11.0_xml_v5.1.0
pullPolicy: Always
env:
lsstDdsDomain: ncsa
brokerIp: cp-helm-charts-cp-kafka-headless.cp-helm-charts
brokerPort: 9092
registryAddr: https://lsst-schema-registry-nts-efd.ncsa.illinois.edu
partitions: 1
replication: 3
waitAck: 1
logLevel: 20
producers:
comcam: null
auxtel: null
eas:
cscs: >-
DSM
latiss: null
test:
image:
tag: v1.1.3_salobj_v5.12.0_xml_v5.2.0
ccarchiver:
cscs: >-
CCArchiver
cccamera:
cscs: >-
CCCamera
ccheaderservice:
cscs: >-
CCHeaderService
The null is how to remove producers from the values.yaml
configuration.
The eas
producer changes the list of CSCs from DIMM, DSM, Environment to
DSM. The test
producer changes the site configured image tag to something
different. The ccarchiver
, cccamera
and ccheaderservice
producers
are new ones specified for this site only.
3.4.5 CSC Configuration¶
There are few different variants of CSC configuration as discussed previously.
Most CSC configuration consists of Docker image information and environment
variables that must be set as well as the namespace that the CSC should belong
to. The namespace is handled in the CSC values.yaml
in order to have that
applied uniformly across all sites. An example of a simple one showing a
specific namespace is shown below.
csc:
namespace: maintel
CSCs may have other values they need to applied regardless of site. Here is an
example from the mtcamhexapod
application.
csc:
env:
RUN_ARG: -s 1
namespace: maintel
The RUN_ARG
configuration sets the index for the underlying component that
the container will run. Other global environment variables can be specified in
this manner.
The Docker image configuration is handled on a site basis to allow independent
evolution. This also applies to the LSST_DDS_PARTITION_PREFIX
environment variable
since those are definitely site specific. Below is an example site configuration
from the mtcamhexapod
for the NCSA test stand.
csc:
image:
repository: lsstts/hexapod
tag: v0.5.2
pullPolicy: Always
env:
LSST_DDS_PARTITION_PREFIX: ncsa
Other site specific environment variables can be listed in the env section if they are appropriate to running the CSC container.
Containers that require the use of the Nexus3 repository, currently identified
by the use of ts-dockerhub.lsst.org
in the image.repository name, need to
configure the image.nexus3 key in order for secret access to occur. An example
values.yaml
file for the mtptg
is shown below.
csc:
image:
nexus3: nexus3-docker
env:
TELESCOPE: MT
namespace: maintel
The value in the image.nexus3 entry is specific to the Nexus3 instance that is based in Tucson. This may be expanded to other replications in the future.
The CSC container may need to override the command script that the container automatically runs on startup. An example of how this is accomplished is shown below.
csc:
image:
repository: lsstts/atdometrajectory
tag: v1.2_salobj_v5.4.0_idl_v1.1.2_xml_v4.7.0
pullPolicy: Always
env:
LSST_DDS_PARTITION_PREFIX: lsatmcs
entrypoint: |
#!/usr/bin/env bash
source ~/miniconda3/bin/activate
source $OSPL_HOME/release.com
source /home/saluser/.bashrc
run_atdometrajectory.py
The script for the entrypoint must be entered line by line with an empty line between each one in order for the script to be created with the correct execution formatting. The pipe (|) at the end of the entrypoint keyword is required to help obtain the proper formatting. Using the entrypoint key activates the use of the ConfigMap API.
If a CSC requires a physical volume to write files out to, the mountpoint key should be used. This should be a rarely used variant, but it is supported. The Header Service will use this when deployed to the summit until the S3 system is available. A configuration might look like the following.
csc:
...
mountpoint:
- name: www
path: /home/saluser/www
accessMode: ReadWriteOnce
claimSize: 50Gi
The description of the claimSize units can be found at this page.
3.4.6 Collector Applications¶
As noted earlier, these applications are collections of individual CSC apps
aligned with a particular subsystem. The main configuration is the list of CSC
apps to include on launch. Here is how the values.yaml
file for the
maintel
app looks.
spec:
destination:
server: https://kubernetes.default.svc
source:
repoURL: https://github.com/lsst-ts/argocd-csc
targetRevision: HEAD
env: ncsa-teststand
cscs:
- mtaos
- mtcamhexapod
- mtm1m3
- mtm2
- mtm2hexapod
- mtmount
- mtptg
- mtrotator
noSim:
- mtptg
The spec section is specific to ArgoCD and should not be changed unless you
really understand the consequences. The exceptions to this are the repoURL
and targetRevision parameters. It is possible the Github repository moves
during the lifetime of
the project, so repoURL will need to be updated if that happens. There might
also be a need to testing something that is not on the master
branch of
the repository. To support that, change the targetRevision to the
appropriate branch name. Use this sparingly, as main configuration tracking is
on the master
branch. The env parameter sets the value-<env>.yaml
for
the listed CSC apps. This will change on a per site basis. The cscs parameter
is the listing of the CSC apps that the collector app will control. This can
also be changed on a per site basis.
As an example of per site configuration, below is an example for the summit
configuration of the maintel
app.
env: summit
cscs:
- mtaos
- mtptg
As you can see, the env parameter is overridden to the correct name and the
list of CSCs is much shorter. This is due to the presence of real hardware on
the summit. The auxtel
collector app follows similar configuration
mechanisms but controls a different list of CSC apps as does the obssys
collector app.
[Fau20] | Angelo Fausti. EFD Operations. Technote, March 2020. URL: https://sqr-034.lsst.io. |