Back to Blog
IoT Engineering

Real-Time IoT Alerting: From Simple Thresholds to ML Anomaly Detection

A static threshold alert is a starting point, not a solution — production IoT systems need debounced thresholds, rate-of-change detection, and statistical anomaly detection to avoid alert fatigue while catching real problems. Here's how to build the full stack.

April 1, 2024
12 min read
IoT AlertingAnomaly DetectionAWS LambdaMQTT

Real-Time IoT Alerting: From Simple Thresholds to ML Anomaly Detection

The first alert system everyone builds is: "if temperature > 80°C, send an email." By month two, the on-call engineer is ignoring alerts because the sensor spikes briefly every time a forklift passes the unit. The challenge isn't detecting anomalies — it's detecting *meaningful* anomalies and routing them to the right person without crying wolf.

Layer 1: Simple Thresholds

The foundation. Fast to implement, easy to understand, and covers most critical failure modes.

// AWS Lambda: simple threshold alert handler
// Triggered by IoT Core Rule: SELECT * FROM 'devices/+/telemetry'
exports.handler = async (event) => {
  const { deviceId, temperature, humidity, ts } = event

const alerts = []

if (temperature > 80) { alerts.push({ severity: 'CRITICAL', type: 'HIGH_TEMPERATURE', message: ${deviceId} temperature ${temperature}°C exceeds 80°C limit, deviceId, value: temperature, threshold: 80, ts, }) }

if (humidity < 10) { alerts.push({ severity: 'WARNING', type: 'LOW_HUMIDITY', message: ${deviceId} humidity ${humidity}% is critically low, deviceId, value: humidity, threshold: 10, ts, }) }

for (const alert of alerts) { await publishAlert(alert) } }

Problem: A reading of 81°C that lasts 200ms triggers a CRITICAL page. Sensor noise does this constantly.

Layer 2: Debounced Thresholds

Only alert if the condition persists for a minimum duration. Store state in DynamoDB or ElastiCache.

const { DynamoDBClient, GetItemCommand, PutItemCommand } = require('@aws-sdk/client-dynamodb')
const dynamo = new DynamoDBClient({ region: 'us-east-1' })

const ALERT_DEBOUNCE_SECONDS = 30 // condition must persist 30s

async function checkDebouncedThreshold(deviceId, field, value, threshold, severity) { const key = ${deviceId}#${field} const now = Math.floor(Date.now() / 1000) const isBreaching = value > threshold

const existing = await dynamo.send(new GetItemCommand({ TableName: 'AlertState', Key: { alertKey: { S: key } }, }))

const state = existing.Item ? { breachingSince: parseInt(existing.Item.breachingSince.N), alerted: existing.Item.alerted.BOOL, } : null

if (!isBreaching) { // Condition cleared — reset state if (state?.alerted) await resolveAlert(deviceId, field) await dynamo.send(new PutItemCommand({ TableName: 'AlertState', Item: { alertKey: { S: key }, breachingSince: { N: '0' }, alerted: { BOOL: false } }, })) return }

if (!state || state.breachingSince === 0) { // First breach — record start time await dynamo.send(new PutItemCommand({ TableName: 'AlertState', Item: { alertKey: { S: key }, breachingSince: { N: String(now) }, alerted: { BOOL: false } }, })) return }

const durationSeconds = now - state.breachingSince

if (durationSeconds >= ALERT_DEBOUNCE_SECONDS && !state.alerted) { // Condition has persisted long enough — fire alert await publishAlert({ deviceId, field, value, threshold, severity, durationSeconds }) await dynamo.send(new PutItemCommand({ TableName: 'AlertState', Item: { alertKey: { S: key }, breachingSince: { N: String(state.breachingSince) }, alerted: { BOOL: true } }, })) } }

Layer 3: Compound and Rate-of-Change Alerts

Some conditions only matter in combination. A temperature of 75°C is fine in a furnace room, alarming in a server room.

// Compound alert: temperature high AND cooling system offline
function checkCompoundAlert(deviceId, readings, deviceConfig) {
  const { temperature, coolingSystemOnline } = readings
  const { maxTemp } = deviceConfig

if (temperature > maxTemp * 0.9 && !coolingSystemOnline) { return { severity: 'CRITICAL', type: 'THERMAL_RUNAWAY_RISK', message: ${deviceId}: ${temperature}°C with cooling offline. Imminent thermal shutdown., } } return null }

// Rate-of-change alert: temperature rising > 2°C/minute function checkRateOfChange(history, windowMs = 60000) { if (history.length < 2) return null const recent = history.filter(r => r.ts > Date.now() - windowMs) if (recent.length < 2) return null

const oldest = recent[0] const newest = recent[recent.length - 1] const ratePerMinute = (newest.value - oldest.value) / ((newest.ts - oldest.ts) / 60000)

if (Math.abs(ratePerMinute) > 2) { return { severity: 'WARNING', type: 'RAPID_TEMPERATURE_CHANGE', ratePerMinute: ratePerMinute.toFixed(2), message: Temperature changing at ${ratePerMinute.toFixed(1)}°C/min, } } return null }

Layer 4: Statistical Anomaly Detection

Thresholds require you to know what's abnormal in advance. Statistical methods catch anomalies you didn't predict.

Z-Score (standard deviations from mean):

function zScoreAnomaly(history, currentValue, threshold = 3.0) {
  if (history.length < 30) return null // need enough data

const values = history.map(r => r.value) const mean = values.reduce((a, b) => a + b, 0) / values.length const stdDev = Math.sqrt( values.reduce((sum, v) => sum + Math.pow(v - mean, 2), 0) / values.length )

if (stdDev === 0) return null

const zScore = Math.abs((currentValue - mean) / stdDev)

if (zScore > threshold) { return { type: 'STATISTICAL_ANOMALY', severity: zScore > 5 ? 'CRITICAL' : 'WARNING', zScore: zScore.toFixed(2), mean: mean.toFixed(2), stdDev: stdDev.toFixed(2), message: Value ${currentValue} is ${zScore.toFixed(1)} standard deviations from mean (${mean.toFixed(1)}), } } return null }

IQR-based outlier detection (more robust to skewed distributions):

function iqrAnomaly(history, currentValue, multiplier = 2.5) {
  const sorted = [...history.map(r => r.value)].sort((a, b) => a - b)
  const q1 = sorted[Math.floor(sorted.length * 0.25)]
  const q3 = sorted[Math.floor(sorted.length * 0.75)]
  const iqr = q3 - q1
  const lower = q1 - multiplier * iqr
  const upper = q3 + multiplier * iqr

if (currentValue < lower || currentValue > upper) { return { type: 'IQR_ANOMALY', lower: lower.toFixed(2), upper: upper.toFixed(2) } } return null }

For LSTM-based anomaly detection, TensorFlow.js can run inference directly in the Lambda function using a pre-trained model, with the model stored in S3 and loaded cold-start once per Lambda container lifetime.

Layer 5: Alert Routing

Different alert severities should reach different channels.

const { SNSClient, PublishCommand } = require('@aws-sdk/client-sns')
const sns = new SNSClient({ region: 'us-east-1' })

async function routeAlert(alert) { const routingMap = { CRITICAL: [ process.env.SNS_CRITICAL_ARN, // PagerDuty integration process.env.SNS_WHATSAPP_ARN, // WhatsApp Business via SNS → Lambda → API ], WARNING: [ process.env.SNS_WARNING_ARN, // Email via SES ], INFO: [ process.env.SNS_INFO_ARN, // Slack webhook ], }

const targets = routingMap[alert.severity] || routingMap.INFO

await Promise.all(targets.map(topicArn => sns.send(new PublishCommand({ TopicArn: topicArn, Message: JSON.stringify(alert), Subject: [${alert.severity}] ${alert.type} — ${alert.deviceId}, MessageAttributes: { severity: { DataType: 'String', StringValue: alert.severity }, deviceId: { DataType: 'String', StringValue: alert.deviceId }, }, })) )) }

PagerDuty routing: Use SNS → Lambda → PagerDuty Events API v2. Critical alerts page the on-call engineer; warnings create low-urgency incidents.

WhatsApp: AWS SNS → Lambda → WhatsApp Business Cloud API. Effective for operational teams in regions where WhatsApp is the primary business communication channel.

Putting It Together: Alert Priority Stack

Process layers in order, short-circuit when an alert fires:

  • 1. Device offline detection (no heartbeat in 5 minutes) → CRITICAL
  • 2. Compound threshold check (high severity) → CRITICAL
  • 3. Simple debounced threshold → WARNING/CRITICAL
  • 4. Rate-of-change check → WARNING
  • 5. Z-score anomaly → INFO/WARNING
  • 6. IQR outlier → INFO
  • This layering ensures the most actionable, highest-confidence alerts fire first.

    For the data storage layer that feeds historical analysis for anomaly detection baselines, see [Time-Series Databases for IoT](/blog/timeseries-databases-iot-influxdb-vs-timestream).

    For the full data pipeline context, see [IoT Data Pipeline: Sensor to Dashboard](/blog/iot-data-pipeline-sensor-to-dashboard).

    Need help with IoT real-time alerting? [Contact Code Caracal](/contact) — we've shipped these systems for clients across 15+ countries.

    Written by CodeCaracal Engineering

    We write from production experience — every technique in our articles has been deployed to real clients. No academic theory.

    More Articles

    Business · 12 min read

    IoT Device Compliance: FCC, CE, and Product Certification Guide for Hardware Startups

    Business · 11 min read

    What to Look for When Hiring an IoT Development Partner: 8 Critical Criteria

    Business · 11 min read

    IoT MVP to Production: Realistic Timeline and Budget for Hardware Startups

    Business · 11 min read

    IoT Development Agency vs Building In-House: A Decision Framework for Founders

    IoT Dashboard · 13 min read

    Next.js IoT Analytics Dashboard: From Sensor Data to Production App

    Business · 11 min read

    How Much Does It Cost to Build an IoT Product in 2024? A Realistic Breakdown

    IoT Dashboard · 11 min read

    IoT Dashboard UX: Design Principles for Industrial Monitoring Interfaces

    IoT Dashboard · 12 min read

    Node.js WebSocket Server: The Real-Time Backend for IoT Dashboards

    Cloud & DevOps · 12 min read

    Containerizing IoT Backend Services with Docker: From Dev to Production

    IoT Dashboard · 14 min read

    Grafana + InfluxDB IoT Monitoring: Complete Production Setup Guide

    IoT Dashboard · 12 min read

    Building Real-Time IoT Dashboards with React and Recharts

    Cloud & DevOps · 13 min read

    CI/CD for Embedded Firmware: Automated Build, Test, and OTA Release Pipeline

    Mobile Development · 12 min read

    Flutter Offline-First IoT Apps: Hive + Sync Architecture That Works in the Field

    Cloud & DevOps · 14 min read

    Terraform for IoT Infrastructure: Provisioning AWS IoT Core, Lambda, and InfluxDB as Code

    Mobile Development · 10 min read

    Flutter IoT Alerts: Firebase Push Notifications for Device Events

    Cloud & DevOps · 12 min read

    Deploying IoT Backends on AWS: ECS Fargate vs Lambda vs EC2 Decision Guide

    Mobile Development · 11 min read

    Flutter + MQTT: Building Production IoT Mobile Apps That Scale

    Mobile Development · 13 min read

    Flutter BLE: Building a Bluetooth IoT Controller App from Scratch

    Cloud & DevOps · 13 min read

    AWS IoT Core vs Azure IoT Hub vs Google Cloud IoT: 2024 Honest Comparison

    IoT Engineering · 13 min read

    Kafka vs RabbitMQ for IoT: Choosing the Right Message Queue for High-Volume Telemetry

    IoT Engineering · 14 min read

    IoT System Testing: Unit, Integration, Hardware-in-the-Loop, and End-to-End

    IoT Engineering · 14 min read

    Predictive Maintenance with IoT Sensor Data: From Threshold to Machine Learning

    Embedded Systems · 14 min read

    IoT Bootloader Design: Secure Boot, A/B Partitions, and Reliable OTA Recovery

    IoT Engineering · 14 min read

    Multi-Tenant IoT Platform Architecture: Isolation, Scaling, and Data Partitioning

    Embedded Systems · 14 min read

    Memory Management in Embedded Firmware: Avoiding Heap Fragmentation and Stack Overflows

    IoT Engineering · 13 min read

    IoT Cost Optimization: How We Cut AWS IoT Bills by 60% Without Sacrificing Reliability

    IoT Engineering · 12 min read

    Edge Computing in IoT: When to Process On-Device vs In the Cloud

    IoT Engineering · 13 min read

    Digital Twins for IoT: Building a Virtual Mirror of Your Physical Devices

    Embedded Systems · 14 min read

    ESP32 Deep Sleep Mastery: Cutting Power Consumption from 240mA to 10µA

    IoT Engineering · 10 min read

    MQTT QoS 0, 1, and 2 Explained: Choosing the Right Level for IoT

    IoT Engineering · 14 min read

    IoT Monitoring and Observability: Metrics, Logs, and Distributed Tracing

    Embedded Systems · 14 min read

    Debugging Embedded Firmware: JTAG, GDB, Logic Analyzers, and Serial Tracing

    IoT Engineering · 12 min read

    WebSocket vs MQTT vs Server-Sent Events: Real-Time IoT Protocol Deep Dive

    Embedded Systems · 13 min read

    STM32 HAL vs Low-Level Drivers: When the Abstraction Costs You Too Much

    IoT Engineering · 13 min read

    IoT Data Pipeline: From Raw Sensor Reading to Live Dashboard in Under 100ms

    IoT Engineering · 13 min read

    Zero-Touch IoT Device Provisioning: Scaling from 10 to 100,000 Devices

    Embedded Systems · 13 min read

    UART vs SPI vs I2C: Choosing the Right Protocol for Sensor Integration

    Embedded Systems · 12 min read

    ESP32 Partition Table: Designing Flash Layout for Production Firmware

    IoT Engineering · 12 min read

    IoT Architecture Patterns: Hub-and-Spoke, Mesh, and Edge-Cloud Hybrid

    Embedded Systems · 13 min read

    IoT Battery Life Optimization: Engineering Devices That Last Years on a Single Charge

    IoT Engineering · 13 min read

    Time-Series Databases for IoT: InfluxDB vs TimescaleDB vs AWS Timestream

    Security · 14 min read

    Zero-Trust Security for Embedded IoT: Why Your Devices Are Probably Vulnerable

    Embedded Systems · 14 min read

    FreeRTOS on ESP32: Task Scheduling, Queues, and Resource Management for IoT

    IoT Engineering · 12 min read

    Building a Production IoT Gateway with Raspberry Pi and Node.js

    Embedded Systems · 13 min read

    ESP32 vs STM32: Choosing the Right Microcontroller for Your IoT Project

    Mobile Development · 10 min read

    Flutter + WebSocket: Building Real-Time IoT Dashboards That Don't Stutter

    IoT Engineering · 13 min read

    IoT Fleet Management at Scale: AWS IoT Core Device Registry and Provisioning

    IoT Engineering · 11 min read

    MQTT vs HTTP for IoT: Which Protocol Wins in Production?

    IoT Engineering · 12 min read

    ESP32 → MQTT → AWS IoT Core: The Production-Grade Architecture Guide

    Let's Build Together

    Got an IoT challenge?
    We've shipped it.

    Whether you need a fleet to track, a factory to monitor, or a farm to automate — our team has done it before and we'd love to build it with you. Typical response time: under 24 hours.

    No upfront commitment99.9% uptime SLANDA on requestFixed-price options