Ignition SCADA Data Lake Implementation with AWS (Approach 1)

Building a Data Lake for Ignition SCADA Data - Approach 1

Introduction

In industrial automation, data is everything. It powers how we design, run, monitor, and optimize plant operations. As operations scale, the volume of data grows quickly and if you're collecting it in real time, local storage alone doesn't cut it for long-term analytics or sustainability.

For this project, I was tasked with building a cloud-based data lake for an Ignition SCADA deployment. The goal was to securely and reliably offload raw SCADA data from the plant to the cloud, for easier storage, reporting, and future machine learning work. The client was already using AWS for other services, so integrating the data lake on the same platform made perfect sense.

This setup was done in two phases:

Approach 1 (this document): Export SCADA data from the database to Amazon S3 via Python
Approach 2 (next): Real-time MQTT-based data streaming using the Cirrus Link IIoT modules (after license upgrade)

Execution stages

Project Steps: Approach 1

Step 1: SCADA Sensor Data into Ignition

I first configured OPC UA tags in Ignition to bring in real-time process data from the PLCs. The tags were set to update frequently (1-second scan class) to match the nature of the plant's operations.

Tag browser in Ignition showing live tag values

Step 2: Logging SCADA Data to a Local Database

To store the live data, I used Ignition's Transaction Groups to log tags into a local MySQL database. The goal was to persist sensor data every second for each active machine.

Transaction group config showing OPC tags and DB mapping

Step 3: Exporting Daily Data to Amazon S3

The plant runs 24/6, and the local DB was growing rapidly. To avoid bloating the server and to make data available remotely, I implemented a Python script that runs every midnight to:

Extract that day’s data from MySQL
Format it as a CSV
Upload it to a designated S3 bucket

This approach ensures each day's data is offloaded consistently and stored in a format suitable for long-term analysis.

Why S3?

S3 offers highly available, scalable storage
It's cheaper long-term than keeping all data in RDS
Easy to integrate with AWS analytics services

To automate the daily exports:

I installed necessary Python packages (boto3, mysql-connector-python)
Configured AWS CLI with aws configure
Used Windows Task Scheduler to run the script every day at midnight

Windows Task Scheduler trigger + action

S3 bucket showing uploaded CSVs

Python Export Script

import mysql.connector
import boto3
import csv
import os
from datetime import datetime

DB_CONFIG = {
    'host': os.getenv('DB_HOST'),
    'user': os.getenv('DB_USER'),
    'password': os.getenv('DB_PASSWORD'),
    'database': os.getenv('DB_NAME')
}

S3_BUCKET = os.getenv('S3_BUCKET')
S3_FOLDER = 'block1/'
TABLES = ['block1_temps']
TIME_COLUMN = 't_stamp'

def dataexport(table_name, conn, export_date):
    cursor = conn.cursor()
    query = f"""
        SELECT * FROM {table_name}
        WHERE DATE({TIME_COLUMN}) = %s
    """
    cursor.execute(query, (export_date,))
    rows = cursor.fetchall()
    columns = [desc[0] for desc in cursor.description]

    filename = f"{export_date.strftime('%y-%m-%d')}-{table_name}.csv"
    with open(filename, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(columns)
        writer.writerows(rows)
    cursor.close()
    return filename

def upload_to_s3(filename, bucket, folder):
    s3 = boto3.client('s3')
    s3_key = f"{folder}{filename}" if folder else filename
    s3.upload_file(filename, bucket, s3_key)
    print(f"Uploaded: {filename} → s3://{bucket}/{s3_key}")

def main():
    today = datetime.now().date()
    conn = mysql.connector.connect(**DB_CONFIG)

    try:
        for table in TABLES:
            filename = dataexport(table, conn, today)
            upload_to_s3(filename, S3_BUCKET, S3_FOLDER)
            os.remove(filename)
    finally:
        conn.close()

if __name__ == "__main__":
    main()

Step 4: AWS Glue - Making the Data Queryable

Once the CSVs are in S3, S3 itself doesn’t know how to interpret the files as it just stores them. To make this data queryable, I used AWS Glue Crawlers to:

Scan the S3 folder structure
Infer schema (columns, types, partitions)
Create a table in the Glue Data Catalog

This essentially gave me a virtual database table sitting on top of my S3 files.

Configuration Summary:

Glue Database: abrouma_scada_data
Crawler name: block1_crawler
Schedule: Daily at 1 AM (right after the export)
Table name inferred: block1_temps (1 of the tables)

Glue crawler settings

Created db table in Glue

Glue table schema view

Step 5: Querying with Amazon Athena

With the schema cataloged by Glue, I was now able to run SQL queries directly on the CSVs stored in S3 using Amazon Athena.

Athena is serverless. I didn't have to spin up any database instances or manage infrastructure. It's fast, efficient, and perfect for reporting, dashboards, or just ad-hoc analysis.

After setting up the S3 path for query results (s3://abrouma-scada/athena-results/), I ran queries like:

select *
from block1_block1
limit 10;

Athena query editor showing successful query result

Summary of AWS Services Used

Amazon S3

Amazon S3 was the core storage layer in this setup. It stores the exported SCADA data as raw CSV files, providing a scalable and durable place to archive time-series data from the plant.

AWS IAM

IAM was used to create access credentials for the Python export script. It ensured secure access to the S3 bucket and protected the data pipeline with fine-grained permissions.

AWS Glue

AWS Glue made it possible to catalog and interpret the raw files in S3. It automatically detected table structure, data types, and partitions enabling Athena to query the files as if they were in a database.

Amazon Athena

Athena provided the ability to run SQL queries directly on the S3 files using the schema defined by Glue. It required no infrastructure and gave us immediate insight into the SCADA data without duplicating it into a database.

Conclusion

Now that Approach 1 (database export method) was fully implemented and running reliably, I transitioned to Approach 2 after the Ignition license was upgraded to include the Cirrus Link MQTT modules. This second phase leverages MQTT with Sparkplug B to stream live tag data directly from Ignition to AWS IoT Core, enabling real-time data ingestion into the data lake. With this setup, we move from scheduled batch uploads to a fully event-driven architecture, opening possibilities for real-time monitoring, alerting, and dynamic dashboards.