IoT Monitoring and Observability: Metrics, Logs, and Distributed Tracing

At 3 AM, a dashboard goes dark for 200 devices. Is it a network outage? A broker crash? A firmware bug that crept into last night's OTA? A misconfigured IoT rule? Your cloud backend?

Without proper IoT monitoring observability, answering that question takes hours of log-grepping across five AWS services. With the right stack, it takes two minutes and a Slack alert has already fired.

This guide covers what we monitor, how we instrument it, and how we build dashboards that actually help on-call engineers find problems fast.

The Four Observability Signals in IoT

Classical observability talks about metrics, logs, and traces. IoT adds a fourth: device health signals — data that originates on the device itself and must survive the journey to your monitoring system even when the rest of your stack is degraded.

| Signal | Source | Tooling | |---|---|---| | Device health | Firmware (heartbeat, RSSI, free heap) | AWS IoT → CloudWatch | | Infrastructure metrics | AWS IoT Core, Lambda, DynamoDB | CloudWatch Metrics | | Structured logs | Lambda, backend, broker | CloudWatch Logs Insights | | Distributed traces | End-to-end message flow | AWS X-Ray |

What to Monitor: The Essential Metric Set

Device-Side Metrics

Every firmware build should publish a heartbeat every 60 seconds containing:

// Firmware: heartbeat payload
void publishHeartbeat() {
  StaticJsonDocument<256> doc;
  // Connectivity
  doc["rssi"]          = WiFi.RSSI();
  doc["reconnects"]    = reconnectCount;
  // Resources
  doc["freeHeap"]      = esp_get_free_heap_size();
  doc["uptime"]        = esp_timer_get_time() / 1000000; // seconds
  // Application health
  doc["mqttDropped"]   = droppedMessages;    // messages lost when offline
  doc["lastSensorErr"] = lastSensorErrorCode;
  doc["fwVersion"]     = FW_VERSION;
  char payload[256];
  serializeJson(doc, payload);
  mqttClient.publish("devices/${DEVICE_ID}/heartbeat", payload, 1);  droppedMessages = 0; // reset counter after reporting
}

An IoT Rule forwards these heartbeats to CloudWatch as custom metrics. When freeHeap drops below 20 KB, you have a memory leak. When reconnects spikes, there is a network problem. When mqttDropped is non-zero, your publish interval is too aggressive for the connection quality.

Cloud-Side Metrics (AWS IoT Core)

AWS IoT Core publishes these natively to CloudWatch — enable them in the IoT console:

Connect.Success / Connect.ClientError — authentication failures often spike before fleet-wide problems

PublishIn.Success / PublishIn.Throttled — throttling means you've hit the per-account limit

RuleExecution.Success / RuleExecution.Failure — silent failures in IoT Rules are common and dangerous

Subscribe.Success — devices renewing subscriptions after reconnect

The metric we watch most closely: RuleExecution.Failure. IoT Rules fail silently by default — a misconfigured SQL filter drops every message without any visible error unless you've set up a Dead Letter Queue and an alarm.

Application SLOs: The Metrics That Matter to the Business

Technical metrics describe your system. SLO metrics describe your promise to customers:

// Lambda: custom SLO metric — end-to-end ingestion latency
import { CloudWatch } from '@aws-sdk/client-cloudwatch'
const cw = new CloudWatch({ region: 'us-east-1' })
export async function recordIngestionLatency(
  deviceId: string,
  deviceTimestamp: number
) {
  const latencyMs = Date.now() - deviceTimestamp
  await cw.putMetricData({
    Namespace: 'IoTApp/SLO',
    MetricData: [
      {
        MetricName: 'IngestionLatencyMs',
        Value: latencyMs,
        Unit: 'Milliseconds',
        Dimensions: [
          { Name: 'DeviceType', Value: getDeviceType(deviceId) },
        ],
      },
    ],
  })  // Alert if p99 latency > 5 seconds — SLO violation
  if (latencyMs > 5000) {
    console.error(JSON.stringify({
      level: 'ERROR',
      event: 'slo_violation',
      deviceId,
      latencyMs,
      slo: 'ingestion_latency_p99_5s',
    }))
  }
}

Target SLOs we typically set for IoT backends:

Ingestion latency p99 < 5 seconds (sensor read to database write)

Heartbeat miss rate < 0.1% over 1 hour

Rule execution success rate > 99.9%

Device online rate > 97% for battery-powered devices

Structured Logging: Making Logs Queryable

Unstructured logs are archaeology. Structured JSON logs are searchable data. Every Lambda, every backend service, and ideally every significant firmware event should log JSON.

// Consistent log structure for CloudWatch Logs Insights queries
const log = {
  level: 'INFO',             // DEBUG | INFO | WARN | ERROR
  service: 'telemetry-ingestor',
  traceId: event.requestId, // X-Ray trace ID
  deviceId: message.deviceId,
  event: 'telemetry_received',
  payloadBytes: rawPayload.length,
  processingMs: Date.now() - startTime,
  region: process.env.AWS_REGION,
}console.log(JSON.stringify(log))

With structured logs, CloudWatch Logs Insights queries become powerful:

-- Find devices with elevated error rates in the last hour
fields deviceId, level, event
| filter level = "ERROR"
| stats count() as errorCount by deviceId
| sort errorCount desc
| limit 20-- Track p99 processing latency per device type
fields processingMs, deviceType
| filter event = "telemetry_received"
| stats pct(processingMs, 99) as p99 by deviceType

Distributed Tracing with AWS X-Ray

A sensor reading touches at least five components before it reaches your database: firmware → MQTT → IoT Core → IoT Rule → Lambda → database. When latency spikes, which hop is slow?

X-Ray traces the entire journey. Instrument your Lambda processors:

import AWSXRay from 'aws-xray-sdk-core'
import { DynamoDBClient } from '@aws-sdk/client-dynamodb'
// Wrap AWS SDK calls — X-Ray tracks them automatically
const dynamodb = AWSXRay.captureAWSv3Client(new DynamoDBClient({}))
export const handler = async (event: IoTRuleEvent) => {
  const segment = AWSXRay.getSegment()!
  const subsegment = segment.addNewSubsegment('process-telemetry')
  try {
    subsegment.addAnnotation('deviceId', event.deviceId)
    subsegment.addAnnotation('deviceType', event.deviceType)
    const result = await processTelemetry(event, dynamodb)    subsegment.addMetadata('result', result)
    return result
  } catch (err) {
    subsegment.addError(err as Error)
    throw err
  } finally {
    subsegment.close()
  }
}

After a week of production traffic, the X-Ray service map shows you exactly where time is spent across every component. We've used this to discover that 40% of ingestion latency was coming from a cold-start DynamoDB connection pool in Lambda — fixed with provisioned concurrency and connection reuse.

Grafana Dashboards for IoT Fleets

CloudWatch dashboards work but Grafana gives you more flexibility for IoT use cases. With the CloudWatch data source:

Fleet Overview Dashboard panels:

Fleet online/offline ratio (gauge)

Messages per minute (time series, by device type)

Top 10 devices by error rate (table)

Geographic heatmap (if devices report location)

Firmware version distribution (pie chart — critical for OTA rollout tracking)

Device Drill-Down Dashboard panels:

RSSI over time (spot connection quality trends)

Free heap over time (catch memory leaks before crash)

Message publish rate vs expected rate (detect stuck firmware)

Reconnect count (find flaky network locations)

Alerting on SLO violations:

Configure alerts that page on-call when:

Fleet online rate drops below 95% for 5 minutes

Ingestion latency p99 exceeds SLO for 10 minutes

IoT Rule failure rate exceeds 0.5%

Any single device has not sent a heartbeat in 10 minutes (critical devices only)

Avoid alert fatigue — page on SLO violations, not on every individual device blip. IoT devices go offline routinely due to power cycles, network hiccups, and user interactions. Alerting on every missed heartbeat in a 10,000-device fleet generates hundreds of false positives per day.

The Observability Checklist

Before going to production, verify you have:

Device heartbeat published every 60 seconds, forwarded to CloudWatch

IoT Rule DLQ configured for every rule that touches critical data

Structured JSON logging in every Lambda and backend service

X-Ray active tracing on all Lambda functions

CloudWatch alarms on the three most important SLOs

Grafana dashboard with fleet overview and device drill-down

On-call runbook linked from every alarm description

Tested: can you identify the root cause of a simulated outage in under 5 minutes?

The last point is non-negotiable. The best monitoring system is the one that gets tested before the 3 AM outage, not during it.

Need help? [Contact Code Caracal](/contact) — we've shipped these systems for clients across 15+ countries.

IoT Monitoring and Observability: Metrics, Logs, and Distributed Tracing

IoT Monitoring and Observability: Metrics, Logs, and Distributed Tracing

The Four Observability Signals in IoT

What to Monitor: The Essential Metric Set

Device-Side Metrics

Cloud-Side Metrics (AWS IoT Core)

Application SLOs: The Metrics That Matter to the Business

Structured Logging: Making Logs Queryable

Distributed Tracing with AWS X-Ray

Grafana Dashboards for IoT Fleets

The Observability Checklist

More Articles

IoT Device Compliance: FCC, CE, and Product Certification Guide for Hardware Startups

What to Look for When Hiring an IoT Development Partner: 8 Critical Criteria

IoT MVP to Production: Realistic Timeline and Budget for Hardware Startups

IoT Development Agency vs Building In-House: A Decision Framework for Founders

Next.js IoT Analytics Dashboard: From Sensor Data to Production App

How Much Does It Cost to Build an IoT Product in 2024? A Realistic Breakdown

IoT Dashboard UX: Design Principles for Industrial Monitoring Interfaces

Node.js WebSocket Server: The Real-Time Backend for IoT Dashboards

Containerizing IoT Backend Services with Docker: From Dev to Production

Grafana + InfluxDB IoT Monitoring: Complete Production Setup Guide

Building Real-Time IoT Dashboards with React and Recharts

CI/CD for Embedded Firmware: Automated Build, Test, and OTA Release Pipeline

Flutter Offline-First IoT Apps: Hive + Sync Architecture That Works in the Field

Terraform for IoT Infrastructure: Provisioning AWS IoT Core, Lambda, and InfluxDB as Code

Flutter IoT Alerts: Firebase Push Notifications for Device Events

Deploying IoT Backends on AWS: ECS Fargate vs Lambda vs EC2 Decision Guide

Flutter + MQTT: Building Production IoT Mobile Apps That Scale

Flutter BLE: Building a Bluetooth IoT Controller App from Scratch

AWS IoT Core vs Azure IoT Hub vs Google Cloud IoT: 2024 Honest Comparison

Kafka vs RabbitMQ for IoT: Choosing the Right Message Queue for High-Volume Telemetry

IoT System Testing: Unit, Integration, Hardware-in-the-Loop, and End-to-End

Predictive Maintenance with IoT Sensor Data: From Threshold to Machine Learning

IoT Bootloader Design: Secure Boot, A/B Partitions, and Reliable OTA Recovery

Multi-Tenant IoT Platform Architecture: Isolation, Scaling, and Data Partitioning

Memory Management in Embedded Firmware: Avoiding Heap Fragmentation and Stack Overflows

IoT Cost Optimization: How We Cut AWS IoT Bills by 60% Without Sacrificing Reliability

Edge Computing in IoT: When to Process On-Device vs In the Cloud

Digital Twins for IoT: Building a Virtual Mirror of Your Physical Devices

ESP32 Deep Sleep Mastery: Cutting Power Consumption from 240mA to 10µA

MQTT QoS 0, 1, and 2 Explained: Choosing the Right Level for IoT

Debugging Embedded Firmware: JTAG, GDB, Logic Analyzers, and Serial Tracing

WebSocket vs MQTT vs Server-Sent Events: Real-Time IoT Protocol Deep Dive

STM32 HAL vs Low-Level Drivers: When the Abstraction Costs You Too Much

IoT Data Pipeline: From Raw Sensor Reading to Live Dashboard in Under 100ms

Zero-Touch IoT Device Provisioning: Scaling from 10 to 100,000 Devices

UART vs SPI vs I2C: Choosing the Right Protocol for Sensor Integration

Real-Time IoT Alerting: From Simple Thresholds to ML Anomaly Detection

ESP32 Partition Table: Designing Flash Layout for Production Firmware

IoT Architecture Patterns: Hub-and-Spoke, Mesh, and Edge-Cloud Hybrid

IoT Battery Life Optimization: Engineering Devices That Last Years on a Single Charge

Time-Series Databases for IoT: InfluxDB vs TimescaleDB vs AWS Timestream

Zero-Trust Security for Embedded IoT: Why Your Devices Are Probably Vulnerable

FreeRTOS on ESP32: Task Scheduling, Queues, and Resource Management for IoT

Building a Production IoT Gateway with Raspberry Pi and Node.js

ESP32 vs STM32: Choosing the Right Microcontroller for Your IoT Project

Flutter + WebSocket: Building Real-Time IoT Dashboards That Don't Stutter

IoT Fleet Management at Scale: AWS IoT Core Device Registry and Provisioning

MQTT vs HTTP for IoT: Which Protocol Wins in Production?

ESP32 → MQTT → AWS IoT Core: The Production-Grade Architecture Guide

Got an IoT challenge?We've shipped it.

Got an IoT challenge?
We've shipped it.