IoT Monitoring and Observability: Metrics, Logs, and Distributed Tracing
At 3 AM, a dashboard goes dark for 200 devices. Is it a network outage? A broker crash? A firmware bug that crept into last night's OTA? A misconfigured IoT rule? Your cloud backend?
Without proper IoT monitoring observability, answering that question takes hours of log-grepping across five AWS services. With the right stack, it takes two minutes and a Slack alert has already fired.
This guide covers what we monitor, how we instrument it, and how we build dashboards that actually help on-call engineers find problems fast.
The Four Observability Signals in IoT
Classical observability talks about metrics, logs, and traces. IoT adds a fourth: device health signals — data that originates on the device itself and must survive the journey to your monitoring system even when the rest of your stack is degraded.
| Signal | Source | Tooling | |---|---|---| | Device health | Firmware (heartbeat, RSSI, free heap) | AWS IoT → CloudWatch | | Infrastructure metrics | AWS IoT Core, Lambda, DynamoDB | CloudWatch Metrics | | Structured logs | Lambda, backend, broker | CloudWatch Logs Insights | | Distributed traces | End-to-end message flow | AWS X-Ray |
What to Monitor: The Essential Metric Set
Device-Side Metrics
Every firmware build should publish a heartbeat every 60 seconds containing:
// Firmware: heartbeat payload
void publishHeartbeat() {
StaticJsonDocument<256> doc; // Connectivity
doc["rssi"] = WiFi.RSSI();
doc["reconnects"] = reconnectCount;
// Resources
doc["freeHeap"] = esp_get_free_heap_size();
doc["uptime"] = esp_timer_get_time() / 1000000; // seconds
// Application health
doc["mqttDropped"] = droppedMessages; // messages lost when offline
doc["lastSensorErr"] = lastSensorErrorCode;
doc["fwVersion"] = FW_VERSION;
char payload[256];
serializeJson(doc, payload);
mqttClient.publish("devices/${DEVICE_ID}/heartbeat", payload, 1);
droppedMessages = 0; // reset counter after reporting
}
An IoT Rule forwards these heartbeats to CloudWatch as custom metrics. When freeHeap drops below 20 KB, you have a memory leak. When reconnects spikes, there is a network problem. When mqttDropped is non-zero, your publish interval is too aggressive for the connection quality.
Cloud-Side Metrics (AWS IoT Core)
AWS IoT Core publishes these natively to CloudWatch — enable them in the IoT console:
Connect.Success / Connect.ClientError — authentication failures often spike before fleet-wide problemsPublishIn.Success / PublishIn.Throttled — throttling means you've hit the per-account limitRuleExecution.Success / RuleExecution.Failure — silent failures in IoT Rules are common and dangerousSubscribe.Success — devices renewing subscriptions after reconnectThe metric we watch most closely: RuleExecution.Failure. IoT Rules fail silently by default — a misconfigured SQL filter drops every message without any visible error unless you've set up a Dead Letter Queue and an alarm.
Application SLOs: The Metrics That Matter to the Business
Technical metrics describe your system. SLO metrics describe your promise to customers:
// Lambda: custom SLO metric — end-to-end ingestion latency
import { CloudWatch } from '@aws-sdk/client-cloudwatch'const cw = new CloudWatch({ region: 'us-east-1' })
export async function recordIngestionLatency(
deviceId: string,
deviceTimestamp: number
) {
const latencyMs = Date.now() - deviceTimestamp
await cw.putMetricData({
Namespace: 'IoTApp/SLO',
MetricData: [
{
MetricName: 'IngestionLatencyMs',
Value: latencyMs,
Unit: 'Milliseconds',
Dimensions: [
{ Name: 'DeviceType', Value: getDeviceType(deviceId) },
],
},
],
})
// Alert if p99 latency > 5 seconds — SLO violation
if (latencyMs > 5000) {
console.error(JSON.stringify({
level: 'ERROR',
event: 'slo_violation',
deviceId,
latencyMs,
slo: 'ingestion_latency_p99_5s',
}))
}
}
Target SLOs we typically set for IoT backends:
Structured Logging: Making Logs Queryable
Unstructured logs are archaeology. Structured JSON logs are searchable data. Every Lambda, every backend service, and ideally every significant firmware event should log JSON.
// Consistent log structure for CloudWatch Logs Insights queries
const log = {
level: 'INFO', // DEBUG | INFO | WARN | ERROR
service: 'telemetry-ingestor',
traceId: event.requestId, // X-Ray trace ID
deviceId: message.deviceId,
event: 'telemetry_received',
payloadBytes: rawPayload.length,
processingMs: Date.now() - startTime,
region: process.env.AWS_REGION,
}console.log(JSON.stringify(log))
With structured logs, CloudWatch Logs Insights queries become powerful:
-- Find devices with elevated error rates in the last hour
fields deviceId, level, event
| filter level = "ERROR"
| stats count() as errorCount by deviceId
| sort errorCount desc
| limit 20-- Track p99 processing latency per device type
fields processingMs, deviceType
| filter event = "telemetry_received"
| stats pct(processingMs, 99) as p99 by deviceType
Distributed Tracing with AWS X-Ray
A sensor reading touches at least five components before it reaches your database: firmware → MQTT → IoT Core → IoT Rule → Lambda → database. When latency spikes, which hop is slow?
X-Ray traces the entire journey. Instrument your Lambda processors:
import AWSXRay from 'aws-xray-sdk-core'
import { DynamoDBClient } from '@aws-sdk/client-dynamodb'// Wrap AWS SDK calls — X-Ray tracks them automatically
const dynamodb = AWSXRay.captureAWSv3Client(new DynamoDBClient({}))
export const handler = async (event: IoTRuleEvent) => {
const segment = AWSXRay.getSegment()!
const subsegment = segment.addNewSubsegment('process-telemetry')
try {
subsegment.addAnnotation('deviceId', event.deviceId)
subsegment.addAnnotation('deviceType', event.deviceType)
const result = await processTelemetry(event, dynamodb)
subsegment.addMetadata('result', result)
return result
} catch (err) {
subsegment.addError(err as Error)
throw err
} finally {
subsegment.close()
}
}
After a week of production traffic, the X-Ray service map shows you exactly where time is spent across every component. We've used this to discover that 40% of ingestion latency was coming from a cold-start DynamoDB connection pool in Lambda — fixed with provisioned concurrency and connection reuse.
Grafana Dashboards for IoT Fleets
CloudWatch dashboards work but Grafana gives you more flexibility for IoT use cases. With the CloudWatch data source:
Fleet Overview Dashboard panels:
Device Drill-Down Dashboard panels:
Alerting on SLO violations:
Configure alerts that page on-call when:
Avoid alert fatigue — page on SLO violations, not on every individual device blip. IoT devices go offline routinely due to power cycles, network hiccups, and user interactions. Alerting on every missed heartbeat in a 10,000-device fleet generates hundreds of false positives per day.
The Observability Checklist
Before going to production, verify you have:
The last point is non-negotiable. The best monitoring system is the one that gets tested before the 3 AM outage, not during it.
Need help? [Contact Code Caracal](/contact) — we've shipped these systems for clients across 15+ countries.