Monitoring and metrics collection has recently become a hot topic in the world of container orchestrators (CO). This is especially true in the Cloud Native Computing Foundation (CNCF) ecosystem where you have hosted projects like Prometheus and OpenTracing being placed in the forefront. Why is this? It can be argued that unless you have a way to determine if a given application, components, storage platform, or CO is performing optimally, poorly, or even running at all, then you aren’t running it in production.

This blog is going to talk about a Kubernetes centric approach for getting Prometheus metrics for ScaleIO, configured using either the native Kubernetes’ ScaleIO driver or ScaleIO using a generic deployment.

There is a new open source project that will collect Prometheus metrics on all storage pools within a given ScaleIO cluster. The corresponding Docker images can be found here. It’s designed to be extremely lightweight (the docker image is only 5MB), has the least impact to your Kubernetes and ScaleIO infrastructure (no API polling and no background processing), and is stateless. If there are multiple ScaleIO clusters in your environment, simply deploy multiple instances of the metrics collector that correspond to each ScaleIO deployment.


Why is it good to NOT have polling?

  1. The polling configuration is done in 2 different places. Prometheus scrapes for metrics based on an interval and the collector polls the storage platform based on a sleep. By not aligning the metrics collection with when Prometheus polls, the collector might poll for metrics that will never be pulled by Prometheus.
  2. Metrics drift. Think of calling APIs to collect data and then sleeping for 30 seconds. The API calls take a variable amount of time to complete. Doing this repeatedly begins to skew the time between when Prometheus collects the metrics to when the collector actually gathered the data. The worst case is Prometheus collecting the metrics just before the collector updates them. In this collector, your skew will never be more than the time it took to obtain the metrics from the API call.
  3. The ScaleIO username and password is stored externally, via JSON file for example, instead of using a single source of truth which is leveraging what’s already stored as a Kubernetes secret for the ScaleIO native driver.

The GitHub README.md has a quick review for deploying Prometheus with links for those looking to deploy Prometheus in a production environment with persistent storage. Note that the Prometheus config.yaml in the repo has an entry found at the very end that references a deployment of this metrics collector:

- job_name: 'scaleio'
scrape_interval: 15s
static_configs:
- targets: ['scaleio-metrics:80']

Either use this standard Prometheus configuration or import the lines into an existing configuration. Update the scrape_interval based on how often metrics collection should take place.

Then to deploy this ScaleIO Prometheus metrics collector, you can use the YAML files located in the GitHub repo to make the process easier by executing the following commands below. As a reminder, you will need to update the SCALEIO_ENDPOINT and CLUSTER_NAME in the kubernetes-scaleio-prom.yaml to point to your ScaleIO cluster.

# We need to open up the kubernetes-scaleio-prom port to provide access
# to the endpoint.
cd services
kubectl create -f kubernetes-scaleio-prom.yaml
cd ..

# Let's deploy kubernetes-scaleio-prom
# NOTE: Before you deploy, open the kubernetes-scaleio-prom.yaml and replace the
# SCALEIO_ENDPOINT and CLUSTER_NAME and your Kubernetes' ScaleIO Secret name
# NOTE: If you aren't using the Kubernetes' native driver for ScaleIO, you will
# need to provide values for SCALEIO_USERNAME and SCALEIO_PASSWORD
cd deployments
kubectl create -f kubernetes-scaleio-prom.yaml
cd ..

Pretty simple! If you check the Prometheus UI, you should see ScaleIO metrics that you can setup alerts for!

So this covers the case if the Kubernetes ScaleIO native/in-tree driver is being used. What about for a generic installation where FlexREX is being used? Or maybe even monitoring an external ScaleIO instance not even being used by the Kubernetes cluster? In addition to previously mentioned environment variables, the SCALEIO_USERNAME and SCALEIO_PASSWORD will need to provided for the given ScaleIO cluster. If there are multiple ScaleIO clusters in your environment, a Kubernetes service and deployment with a unique identifier must be created per ScaleIO instance.

 

In conclusion, determining the health of your application and even the underlying CO components, like the storage platform, is critically important when deploying into production. These metrics, like those provided by this ScaleIO Prometheus collector, are the foundation for creating the monitoring and instrumentation needed to create Prometheus alerts and notifications for when things are or are about to go wrong.

The metrics story shouldn’t stop there though. This collector gives specific instrumentation around the storage platform itself. There are additional metrics that we might see in the near future surrounding volume operations in an upper layer of abstraction such as in the Container Storage Interface (CSI). Metrics around how often a particular volume is mounted and possibly how long it took for the mount to occur are things that can be of great interest to operations. More of this to come!