Ignition SCADA Data Lake Implementation with AWS (Approach 2)

Introduction

After implementing Approach 1 (database export method), which ran successfully on daily schedules, the system was upgraded to a more real-time and event-driven architecture once the Ignition license was updated to include the Cirrus Link MQTT Transmission Module. This upgrade allowed for live streaming of SCADA data using MQTT with Sparkplug B, greatly improving responsiveness and data availability.

This second phase replaced batch data exports with a continuous stream of operational data into AWS, allowing near real-time analytics, alerting, and monitoring through the data lake.

Project Steps: Approach 2

Step 1: Enabling MQTT Transmission in Ignition

With the Cirrus Link MQTT Transmission module now active, I configured MQTT tags in Ignition to publish live process data to a broker. Sparkplug B (Sparkplug_B_v1_0_JSON) format was used for its efficiency in structuring industrial data and maintaining state.

Configured MQTT Transmission settings with AWS IoT Core endpoint
Set Sparkplug B namespace and Edge Node configuration
Selected tag groups and published using "always" mode to reflect live values

The MQTT server configuration with broker endpoint.

MQTT transmitter configuration

MQTT Sparkplug namespace configuration

Step 2: Connecting to AWS IoT Core

To securely receive MQTT data in AWS, I registered an IoT Thing in AWS IoT Core and created a certificate/key pair for secure communication.

Registered a thing
Attached a policy allowing iot:Connect, iot:Publish, iot:Receive, iot:RetainPublish
Used the endpoint from IoT Core in Ignition's broker config

Once connected, MQTT messages were successfully published to AWS.

AWS IoT attached policy

AWS IoT certificates

Created thing

AWS IoT connection and subscription test

Step 3: Routing Data to S3 via Kinesis Firehose

To persist streamed MQTT data into S3, I created a Rule in AWS IoT Core that routes all incoming messages to Kinesis Data Firehose.

SQL rule: SELECT * FROM ' spBv1.0/My MQTT Group/DDATA/Edge Node 6dd918/Temperatures'
Action: Send matching messages to Firehose stream
Firehose destination: S3 bucket (same as Approach 1)

I configured Firehose buffering and GZIP compression to reduce storage size and cost.

Firehose delivery stream config showing S3 target

Step 4: Organizing S3 Folder Structure

Data streamed through Firehose was saved into S3 using a structured path format:

s3://abrouma-mqtt1 /year=2024/month=07/day=20/

This made partitioning easier later in Glue. File format was JSON to preserve full MQTT payload structure.

S3 folder showing streamed JSON files by day

Step 5: AWS Glue Crawler for Real-Time Data

Glue was already configured from Approach 1. I added a new crawler to catalog the real-time streamed data:

Crawler name: block1_stream
Source: s3://abrouma-mqtt1 /
Schedule: Every hour
Format: JSON
Target DB: abrouma_realtime_data
Table: abrouma_mqtt1

Glue crawler targeting abrouma-mqtt1 bucket

Step 6: Querying and Analysis in Athena

Using the Glue Catalog, I was able to run SQL queries on the real-time data using Athena. This included queries to:

SELECT *
FROM "AwsDataCatalog"."abrouma_realtime_data"."abrouma_mqtt1"
limit 10;

Athena query editor showing real-time result

Comparison with Approach 1

Stage	Approach 1 (DB Export)	Approach 2 (MQTT Streaming)
Ingestion Method	Daily Python script	Real-time MQTT over TLS
Source	MySQL	Ignition via MQTT Sparkplug B
Transport	boto3 (HTTP)	AWS IoT Core + Kinesis Firehose
S3 Format	CSV	JSON
Freshness	Daily	Seconds
Complexity	Simple scripting	Needs IoT config & broker setup

Summary of AWS Services Used (Approach 2)

AWS IoT Core

AWS IoT Core acted as the secure entry point for all MQTT data published from Ignition. It managed connections, message routing, and triggered downstream delivery to other AWS services like Kinesis.

Amazon Kinesis Data Firehose

Firehose served as the pipe that streamed data from IoT Core to S3 in near real time. It supported buffering, compression, and S3 path partitioning.

Amazon S3

S3 continued to serve as the central data lake storage. In this phase, it stored real-time JSON files partitioned by date, allowing both cold and hot data to coexist.

AWS Glue

Glue crawlers cataloged the streamed JSON data into queryable tables. This enabled Athena to understand and query structureless MQTT payloads.

Amazon Athena

Athena provided the interface to analyze real-time data from the data lake using standard SQL, enabling faster insights and trend tracking.

Conclusion

Approach 2 greatly improved the responsiveness and flexibility of the SCADA data lake. Moving from batch exports to real-time MQTT streaming enabled better visibility into current plant conditions and laid the groundwork for advanced alerting and dashboards. With both approaches now in place, the system supports both historical analysis and real-time visibility, using only scalable and fully managed AWS services.