Monitoring of an Underwater Drone

Monitoring streaming data with Kafka and whylogs

Published in

Towards Data Science

15 min readMay 4, 2021

The importance of data in making informed decisions is now universally agreed upon for nearly every application. This also prompts the need for tools that enable us to make use of this data in a sensible and efficient way.

In this article, I’d like to share an approach on how to leverage streaming data by setting up a monitoring dashboard by logging statistical profiles of the streamed data. To do so, I’ll use an underwater Remote Operated Vehicle (ROV) as a use case. More specifically, we’ll be monitoring for faults of an OpenROV v2.8 from the late OpenROV (now Sofar), a low-cost telerobotic underwater drone. To enhance our fault detection capabilities, we’ll use a regression model, trained with the SKLearn platform, to model our vehicle’s behavior. That way, our monitoring dashboard will serve two purposes: to monitor the performance and quality issues of the regression model, but also (and mostly) to monitor the health of the vehicle itself. Since purposefully breaking the ROV would be troublesome, we’ll manually inject some sensor faults in our data, and, hopefully, be able to detect these faults through the use of our dashboard.

As the chosen platform for data logging, we’ll use whylogs, an open-source library from WhyLabs designed to monitor data from ML/AI applications. Whylogs calculates approximate statistics for the logged records, meaning that it is a fitting choice for pipelines with high volumes of data. Even though this example uses a minimal amount of information, it’s nice to know the chosen tools allow for possible future expansions.

Since we’re dealing with continuous streaming of real-time data, we need to choose an adequate event streaming platform. In that sense, using Kafka for this project seemed like a natural choice, and a good opportunity to get acquainted with the tool. That way, we can decouple the sending of telemetry information from our model predictions and logging processes by using a publish/subscribe pattern.

You can follow everything discussed here in the project’s jupyter notebook. In the project’s repository, you’ll also find the individual Python scripts and required files that are shown throughout this article.

Overview
What we end up with
Fault Injection
Developing the Application
∘ Setting Up Kafka
∘ Creating the Kafka Clients
∘ Telemetry Producer
∘ Prediction Producer
∘ Telemetry Logger
∘ Prediction Logger
∘ Session Logger
The Monitoring Dashboard
∘ Session 0 — Constant Gain Fault
∘ Session 1 — Stuck at (Zero)
∘ Session 2 — Stuck-at (Maximum)
∘ Session 3 — Drift and Loss
Conclusion

Overview

Let’s take a closer look at our experiment.

The ROV sends its telemetry information in an online fashion into a telemetry topic. This information is simply a collection of JSON objects, in which each object represents a sampling point in time for different features, such as the vehicle’s angular position, current, voltage, etc. Considering that, in this case, the information is already gathered, we’ll simply send the JSON content through a Python script, simulating the ROV’s live operation. Given that our goal is to check whether we are able to monitor and verify the presence of some common sensor faults, we need to manually inject some faults in our dataset before it is sent to our Telemetry Topic. The sensor faults will be explained in more detail in the next section.

The next component is the Prediction Producer, which consumes from the telemetry topic in order to generate predictions through a trained regression model and streams the results into a separate Prediction Topic. In this use case, the regression model is using the telemetry information to predict the next timestep for one specific feature — GYROZ, which is the ROV’s angular velocity in the yaw direction (how fast the vehicle’s turning). The regression model is a linear least squares regression model with L2 regularization and polynomial basis functions, which was trained on approximately 24 hours of ROV operation, using the SKLearn library.

The next two components will continuously listen to the Telemetry and Prediction topics in order to build statistical profiles of our data, which will be used in a later manual inspection stage of the logged sessions to hopefully detect possible anomalies during the ROV’s operation. In this case, if a sampling point is more than 5 minutes away from the last one, a new operating session is considered to have started and, for each session, we’re aggregating data into batches of 1 minute. Since we have a sampling frequency of 5 Hz, that translates to about 300 sampling points for each batch.

Lastly, we need an easy way to visualize the statistical results. I decided to centralize the most relevant plot types in an iPython widget. This way, we can inspect the current state of the whylogs summaries by simply using a jupyter notebook.

What we end up with

Instead of building up to the conclusion throughout the article, let’s begin by showing what we end up with, in terms of how we’ll monitor our ROV operation and model predictions. Once all of our processes are sending and logging information from the ROV’s operation, the final component is the monitoring dashboard itself, where we hope to centralize all of the necessary information to have an assessment of the vehicle’s behavior.

Considering that the final goal is to detect different kinds of faults, I implemented a monitoring dashboard with customizable parameters, so we can choose the right plots from whylogs to detect different situations. In this dashboard, we can basically select parameters to choose:

The individual operating session to inspect.
The telemetry information we wish to inspect, such as speed, position, motor inputs, CPU usage, and so on.
The linear regression model’s predictions, which is used here to check if our ROV’s behavior is much different than what we expect from nominal conditions.

We’ll discuss the results in more detail towards the end of our article, but the final conclusion is that by adjusting the right parameters of our dashboard, we were definitely able to detect the unusual conditions motivated by the injection of sensory faults in our vehicle. Since the focus here was not to develop the perfect regression model, our predictions alone were not enough to detect the faults in some conditions. However, by combining the insights from both our prediction and input features visualizations, we can make a sense of what’s going on in each of our inspected operating sessions.

Fault Injection

In this experiment, we’ll use data from 4 distinct operating sessions, each of which with durations ranging from 4 to 9 minutes. In each session, a GYROZ sensor fault with a duration of 5s is injected at about half-time. You can take a look at the JSON files, with the faults already injected, in the project’s repository, under the rov_data folder.

In the picture below, the injected fault types are illustrated in orange, applied in a sample sinusoidal signal. In session 0, a constant gain fault is injected, in which the real value is distorted by the multiplication of a constant factor, as shown in a). A stuck-at fault is simulated in both sessions 1 and 2, differing by the value it is “stuck” at — while in session 1 the GYROZ value gets stuck in the value of 0, as depicted in b), session 2 has a stuck-at value of 5.56, which relates to the maximum range of the ROV’s gyroscope sensor. Finally, a drift sensor fault is injected in session 3, as shown in c). This fault simulates a condition where the real value is distorted by the addition of a value that increases in time, in relation to the beginning of the fault.

In addition to these sensor faults, a loss error is also applied in session 3, simulating a condition where the ROV stops sending information for an extended period of time. Unlike the previous ones, this fault affects not only the GYROZ value but the complete set of features that is streamed by the vehicle. This was simulated simply by manually removing some contiguous data points from our data set towards the end of session 3’s run.

Developing the Application

Now that we have an overview of the use case, let’s continue by actually implementing the application. The next following sections will cover the code required to send/read the information to our topics at Kafka, make the predictions with our regression model, and log our statistical profiles with whylogs.

Setting Up Kafka

Before proceeding, we need to set up our streaming platform with Kafka. We’re dealing with low producer throughput, so setting up a local Kafka with Docker will be enough for this project. To do so, you can follow this tutorial. The step-by-step is very well explained in the article, but for a very short version, you can simply copy the docker-compose.yml file in your desired folder and then run docker-compose up. You’ll naturally need to have Docker installed for this step.

Once the containers are up, you can enter the kafka-tools command-line interface with:

docker exec -it kafka /bin/sh

And check the available topics, just to make sure it’s working:

kafka-topics --list --bootstrap-server localhost:9092

Creating the Kafka Clients

Throughout the next Python scripts, we will have to create Kafka Consumers and Producers a lot. Let’s define two simple functions to create the clients, so we can reuse them in the future:

In both cases, the clients are informed of the address they should contact in order to bootstrap initial cluster metadata. In addition, since we’re producing/consuming JSON-formatted values, we need to specify the right treatment for it with the vaue_serializer/ value_deserializer arguments.

In this project, the consumers are set to always start from the topic’s beginning, to make sure each run will end up logging the whole content. To do so, we first have to manually assign partitions to our desired topic before we start.

For more information about integrating whylogs with Kafka, you can refer to this tutorial notebook and this blog post at WhyLabs.

Telemetry Producer

Now that we have our Kafka running, we can start by sending ROV data to our telemetry topic. First, we need to create a Kafka Producer Client that will publish records to our cluster and read the JSON-encoded files at the rov_data folder to create a list of dictionaries:

As we can see, our sampling points consists of a series of features:

mtarg1, mtarg2, mtarg3: input commands sent to each of the 3 propellers.
roll, pitch, yaw: vehicle’s angular position around the X,Y,Z-axis.
LACCX, LACCY, LACCZ: vehicle’s angular acceleration around the X,Y,Z-axis.
GYROX, GYROY, GYROZ: vehicle’s angular velocity around the X,Y,Z-axis.
SC1I,SC2I,SC3I: electric current readings from each of the propellers.
BT1I, BT2I: electric current readings from each of the battery packs.
vout, iout, and cpuUsage: voltage, current, and CPU usage of the ROV’s BeagleBone microcomputer.

We can now publish our data points in our topic:

If you’d like, before sending the information, it is good practice to create the Kafka topic with the appropriate partitions/replication by entering atkafka-tools:

kafka-topics --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic telemetry-rov

But you don’t have to, since it automatically creates the topic if it doesn’t already exist.

Maintaining the order of our records is very important in this example, so we call a producer.flush() to ensure a new request is made only when the previous one has been delivered.

Prediction Producer

With the telemetry data at hand, we are able to start making predictions with the regression model. Given that we’re making one-step-ahead predictions for the GYROZ, we only need to wait for one timestep in order to have our ground truth. By calculating the difference between the last timestep’s prediction and the current actual value, we define the so-called residual. The idea is that potential anomalies in the ROV would yield higher residuals since our prediction would be farther than usual from the actual value.

As before, we create the Consumer client (to read from the telemetry topic) and the Producer client (to publish to our prediction topic). We also have to load our SKLearn model, persisted with joblib:

In our main function, the script will fetch data from the topic continuously and calculate our residuals for each timestep read. If 10 seconds pass without new records being received, we output a warning just to inform the user.

Once we have data for the current and previous timesteps, the residual is calculated:

In order to make the predictions, we also need to load the Min-Max scalers fitted during the model’s training process (BL_x.pickle and BL_y.pickle) so we can transform the features for prediction and inverse transform to output the unscaled residual.

To keep the correspondence between the telemetry and prediction topics, we want to store the residuals for every timestamp, even when calculating it is not possible. So, for the first sampling point, or when two sampling points are too distant apart in time, the residual is recorded as a nan value.

Telemetry Logger

Once both Kafka Producers are appropriately set, we can start logging our session profiles with whylogs. Records are polled from the telemetry topic and split into individual sessions. If the next sampling point is more than 5 minutes apart from the last one, the current session is wrapped up and logged before starting the following session. When more than 10 seconds have passed without new information, we also consider that the current session has ended, and log the results.

Prediction Logger

Let’s do the same for our prediction topic. The difference here is that we’ll not only log the residual feature but also make some transformations in order to create and log additional features.

Even for the vehicle’s nominal conditions, we can have spikes in the residual values, leading to a high rate of false positives. In order to reduce that, we calculate the moving average of the residual according to different time frames. In this example, we’ll log, in addition to the unchanged residual, the moving average for the last 5, 10, and 15 timesteps (1, 2, and 3 seconds):

If, for the given sliding window, we have nan values, or we simply don’t have enough records to calculate the moving average, nan is logged for the given timestamp.

Session Logger

Whenever the loggers wrap up a session, the log_session() function is called, which is responsible for initializing a logging session and logging every record of the given session.

To do that, let’s organize our data into a Pandas Dataframe, convert our timestamp column into datetime objects, and then split the DataFrame into batches of 1 minute each using df.groupby. That way, each operating session will be logged with statistical profiles for each minute of the ROV’s activity. The keyand freq parameters tell us the column to group for and frequency with which to group, respectively.

With the data appropriately split into batches, we now can call session.logger() for each batch of data, passing along the dataset_timestamp to mark the beginning for each window, and proceed to log each batch.

The Monitoring Dashboard

At this stage, we have our consumers listening to any new records sent and logging them into our whylogs output folder which, according to the .whylogs.yaml configuration file, is the whylogs-output folder. What is left to do is to use the generated statistical profiles to inspect the operating sessions in search of signs of the injected faults.

Whylogs gives us plenty of ways to explore the logged statistical properties. In this example, I’ll limit myself to the in-built plots, which gives us a way to visually inspect properties such as feature distribution, missing values, and data types.

Even though the main telemetry feature to be monitored is GYROZ, it would be useful to have an easy way to check the plots for all of the remaining features. As for the residual features, we might want to check the plots for different moving averages. Considering that for any of these options, we have 3 or 4 types of plots that could be of interest, it would be interesting to centralize all these options into an interactive monitoring dashboard, where the user can select the specific plots he’s interested in. In this project, this was done in the form of an iPython widget.

In this article, I’ll show only the widget’s outputs, but you can check the full code and the detailed explanation for it in the accompanying jupyter notebook. At the project’s repository, you’ll find the pre-logged information ready, so you’re free to skip directly to the dashboard section. Just be sure to install ipywidgets:

python -m pip install ipywidgets

And enable the widgets extension:

jupyter nbextension enable --py widgetsnbextension

Session 0 — Constant Gain Fault

Now, let’s see some output examples. By selecting the first session, we are presented with the date and time of the ROV’s operation, along with different choices for telemetry and residuals plots:

This session has a constant gain sensor fault. From the picture above, assessing the GYROZ distribution alone is difficult, since different piloting patterns will lead to different distributions, which is normal. The unchanged residual values also are not very clear. But by applying a moving average of 3s, we definitely start to see some signs of unusual behavior. This is a case where only monitoring the vehicle’s sensory information would not be enough to detect the fault.

Session 1 — Stuck at (Zero)

Let’s continue to our second example, session 1:

In this session, a stuck-at 0 fault was injected. Unfortunately, neither of the telemetry and residual distributions was of much help in this case. The regression model was not sensitive enough to capture the fault. However, the Data Types plot yields an interesting result. The feature’s value is normally a Float, but a series of values of exactly 0 is inferred by whylogs as a sequence of Integers, making it stand out from the rest.

Session 2 — Stuck-at (Maximum)

The third example is also a stuck-at value, but fixed at the sensor’s maximum range:

This one’s fairly easy to detect, since the maximum value is really above the usual distribution. Both telemetry and residual distribution plots agree that there’s something abnormal around 21:13.

Session 3 — Drift and Loss

To avoid repetition, I won’t display the distribution plots for the Drift sensor fault, as this case produces plots very similar to Session 0. However, the Missing Values plot applied to the residual tells us something about the injected loss error:

The residuals are being logged with nan values whenever we don’t have a valid previous timestep. Therefore, missing values are expected for the first session’s sampling points. However, when they appear in the middle of the run, that’s a sign that a prolonged period (>0.5s) has passed between two sampling points, indicating a loss error. Even though we are not directly monitoring for loss errors, we can see its presence towards the end of the run, at around 09:14.

Conclusion

In this article, I wanted to explore available tools for data streaming & monitoring by applying them in a specific use-case: fault detection of an underwater drone. Even though the MLOps ecosystem is rapidly evolving, I feel like solutions for monitoring ML applications in production are still few and far between. This is why open-source tools like whylogs are a very welcome addition to the ML practitioner’s toolset.

In this project, our regression model was far from perfect, and in many cases, the predictions alone were not enough to catch our injected fault conditions. However, by setting up a simple monitoring dashboard, we were able to have a broader view by seeing both input and output plots in conjunction, besides being able to choose from a number of different plots for different features. For this project, including only the in-built plots from whylogs in the dashboard was enough, but we could add even more information by including a separate tab with the available summaries for each complete session.

As for the application itself, there’s also much room for expansion. In this example, we used Kafka locally for demonstrational purposes, but I’m certain that for a production environment things can get more complicated. Maybe in the future, we could have entire fleets of ROVs being automatically monitored for faults!

That’s it for now! Thank you for reading, and if you have any questions or suggestions, feel free to reach out!