Real-Time IoT Alerting: From Simple Thresholds to ML Anomaly Detection
The first alert system everyone builds is: "if temperature > 80°C, send an email." By month two, the on-call engineer is ignoring alerts because the sensor spikes briefly every time a forklift passes the unit. The challenge isn't detecting anomalies — it's detecting *meaningful* anomalies and routing them to the right person without crying wolf.
Layer 1: Simple Thresholds
The foundation. Fast to implement, easy to understand, and covers most critical failure modes.
// AWS Lambda: simple threshold alert handler
// Triggered by IoT Core Rule: SELECT * FROM 'devices/+/telemetry'
exports.handler = async (event) => {
const { deviceId, temperature, humidity, ts } = event const alerts = []
if (temperature > 80) {
alerts.push({
severity: 'CRITICAL',
type: 'HIGH_TEMPERATURE',
message: ${deviceId} temperature ${temperature}°C exceeds 80°C limit,
deviceId,
value: temperature,
threshold: 80,
ts,
})
}
if (humidity < 10) {
alerts.push({
severity: 'WARNING',
type: 'LOW_HUMIDITY',
message: ${deviceId} humidity ${humidity}% is critically low,
deviceId,
value: humidity,
threshold: 10,
ts,
})
}
for (const alert of alerts) {
await publishAlert(alert)
}
}
Problem: A reading of 81°C that lasts 200ms triggers a CRITICAL page. Sensor noise does this constantly.
Layer 2: Debounced Thresholds
Only alert if the condition persists for a minimum duration. Store state in DynamoDB or ElastiCache.
const { DynamoDBClient, GetItemCommand, PutItemCommand } = require('@aws-sdk/client-dynamodb')
const dynamo = new DynamoDBClient({ region: 'us-east-1' })const ALERT_DEBOUNCE_SECONDS = 30 // condition must persist 30s
async function checkDebouncedThreshold(deviceId, field, value, threshold, severity) {
const key = ${deviceId}#${field}
const now = Math.floor(Date.now() / 1000)
const isBreaching = value > threshold
const existing = await dynamo.send(new GetItemCommand({
TableName: 'AlertState',
Key: { alertKey: { S: key } },
}))
const state = existing.Item ? {
breachingSince: parseInt(existing.Item.breachingSince.N),
alerted: existing.Item.alerted.BOOL,
} : null
if (!isBreaching) {
// Condition cleared — reset state
if (state?.alerted) await resolveAlert(deviceId, field)
await dynamo.send(new PutItemCommand({
TableName: 'AlertState',
Item: { alertKey: { S: key }, breachingSince: { N: '0' }, alerted: { BOOL: false } },
}))
return
}
if (!state || state.breachingSince === 0) {
// First breach — record start time
await dynamo.send(new PutItemCommand({
TableName: 'AlertState',
Item: { alertKey: { S: key }, breachingSince: { N: String(now) }, alerted: { BOOL: false } },
}))
return
}
const durationSeconds = now - state.breachingSince
if (durationSeconds >= ALERT_DEBOUNCE_SECONDS && !state.alerted) {
// Condition has persisted long enough — fire alert
await publishAlert({ deviceId, field, value, threshold, severity, durationSeconds })
await dynamo.send(new PutItemCommand({
TableName: 'AlertState',
Item: { alertKey: { S: key }, breachingSince: { N: String(state.breachingSince) }, alerted: { BOOL: true } },
}))
}
}
Layer 3: Compound and Rate-of-Change Alerts
Some conditions only matter in combination. A temperature of 75°C is fine in a furnace room, alarming in a server room.
// Compound alert: temperature high AND cooling system offline
function checkCompoundAlert(deviceId, readings, deviceConfig) {
const { temperature, coolingSystemOnline } = readings
const { maxTemp } = deviceConfig if (temperature > maxTemp * 0.9 && !coolingSystemOnline) {
return {
severity: 'CRITICAL',
type: 'THERMAL_RUNAWAY_RISK',
message: ${deviceId}: ${temperature}°C with cooling offline. Imminent thermal shutdown.,
}
}
return null
}
// Rate-of-change alert: temperature rising > 2°C/minute
function checkRateOfChange(history, windowMs = 60000) {
if (history.length < 2) return null
const recent = history.filter(r => r.ts > Date.now() - windowMs)
if (recent.length < 2) return null
const oldest = recent[0]
const newest = recent[recent.length - 1]
const ratePerMinute = (newest.value - oldest.value) / ((newest.ts - oldest.ts) / 60000)
if (Math.abs(ratePerMinute) > 2) {
return {
severity: 'WARNING',
type: 'RAPID_TEMPERATURE_CHANGE',
ratePerMinute: ratePerMinute.toFixed(2),
message: Temperature changing at ${ratePerMinute.toFixed(1)}°C/min,
}
}
return null
}
Layer 4: Statistical Anomaly Detection
Thresholds require you to know what's abnormal in advance. Statistical methods catch anomalies you didn't predict.
Z-Score (standard deviations from mean):
function zScoreAnomaly(history, currentValue, threshold = 3.0) {
if (history.length < 30) return null // need enough data const values = history.map(r => r.value)
const mean = values.reduce((a, b) => a + b, 0) / values.length
const stdDev = Math.sqrt(
values.reduce((sum, v) => sum + Math.pow(v - mean, 2), 0) / values.length
)
if (stdDev === 0) return null
const zScore = Math.abs((currentValue - mean) / stdDev)
if (zScore > threshold) {
return {
type: 'STATISTICAL_ANOMALY',
severity: zScore > 5 ? 'CRITICAL' : 'WARNING',
zScore: zScore.toFixed(2),
mean: mean.toFixed(2),
stdDev: stdDev.toFixed(2),
message: Value ${currentValue} is ${zScore.toFixed(1)} standard deviations from mean (${mean.toFixed(1)}),
}
}
return null
}
IQR-based outlier detection (more robust to skewed distributions):
function iqrAnomaly(history, currentValue, multiplier = 2.5) {
const sorted = [...history.map(r => r.value)].sort((a, b) => a - b)
const q1 = sorted[Math.floor(sorted.length * 0.25)]
const q3 = sorted[Math.floor(sorted.length * 0.75)]
const iqr = q3 - q1
const lower = q1 - multiplier * iqr
const upper = q3 + multiplier * iqr if (currentValue < lower || currentValue > upper) {
return { type: 'IQR_ANOMALY', lower: lower.toFixed(2), upper: upper.toFixed(2) }
}
return null
}
For LSTM-based anomaly detection, TensorFlow.js can run inference directly in the Lambda function using a pre-trained model, with the model stored in S3 and loaded cold-start once per Lambda container lifetime.
Layer 5: Alert Routing
Different alert severities should reach different channels.
const { SNSClient, PublishCommand } = require('@aws-sdk/client-sns')
const sns = new SNSClient({ region: 'us-east-1' })async function routeAlert(alert) {
const routingMap = {
CRITICAL: [
process.env.SNS_CRITICAL_ARN, // PagerDuty integration
process.env.SNS_WHATSAPP_ARN, // WhatsApp Business via SNS → Lambda → API
],
WARNING: [
process.env.SNS_WARNING_ARN, // Email via SES
],
INFO: [
process.env.SNS_INFO_ARN, // Slack webhook
],
}
const targets = routingMap[alert.severity] || routingMap.INFO
await Promise.all(targets.map(topicArn =>
sns.send(new PublishCommand({
TopicArn: topicArn,
Message: JSON.stringify(alert),
Subject: [${alert.severity}] ${alert.type} — ${alert.deviceId},
MessageAttributes: {
severity: { DataType: 'String', StringValue: alert.severity },
deviceId: { DataType: 'String', StringValue: alert.deviceId },
},
}))
))
}
PagerDuty routing: Use SNS → Lambda → PagerDuty Events API v2. Critical alerts page the on-call engineer; warnings create low-urgency incidents.
WhatsApp: AWS SNS → Lambda → WhatsApp Business Cloud API. Effective for operational teams in regions where WhatsApp is the primary business communication channel.
Putting It Together: Alert Priority Stack
Process layers in order, short-circuit when an alert fires:
This layering ensures the most actionable, highest-confidence alerts fire first.
For the data storage layer that feeds historical analysis for anomaly detection baselines, see [Time-Series Databases for IoT](/blog/timeseries-databases-iot-influxdb-vs-timestream).
For the full data pipeline context, see [IoT Data Pipeline: Sensor to Dashboard](/blog/iot-data-pipeline-sensor-to-dashboard).
Need help with IoT real-time alerting? [Contact Code Caracal](/contact) — we've shipped these systems for clients across 15+ countries.