Back to Blog
IoT Engineering

IoT Monitoring and Observability: Metrics, Logs, and Distributed Tracing

When a device stops reporting at 3 AM, how fast can you find the root cause? A proper IoT observability stack tells you in seconds. Here is how to build one.

May 5, 2024
14 min read
IoT MonitoringObservabilityCloudWatchGrafana

IoT Monitoring and Observability: Metrics, Logs, and Distributed Tracing

At 3 AM, a dashboard goes dark for 200 devices. Is it a network outage? A broker crash? A firmware bug that crept into last night's OTA? A misconfigured IoT rule? Your cloud backend?

Without proper IoT monitoring observability, answering that question takes hours of log-grepping across five AWS services. With the right stack, it takes two minutes and a Slack alert has already fired.

This guide covers what we monitor, how we instrument it, and how we build dashboards that actually help on-call engineers find problems fast.

The Four Observability Signals in IoT

Classical observability talks about metrics, logs, and traces. IoT adds a fourth: device health signals — data that originates on the device itself and must survive the journey to your monitoring system even when the rest of your stack is degraded.

| Signal | Source | Tooling | |---|---|---| | Device health | Firmware (heartbeat, RSSI, free heap) | AWS IoT → CloudWatch | | Infrastructure metrics | AWS IoT Core, Lambda, DynamoDB | CloudWatch Metrics | | Structured logs | Lambda, backend, broker | CloudWatch Logs Insights | | Distributed traces | End-to-end message flow | AWS X-Ray |

What to Monitor: The Essential Metric Set

Device-Side Metrics

Every firmware build should publish a heartbeat every 60 seconds containing:

// Firmware: heartbeat payload
void publishHeartbeat() {
  StaticJsonDocument<256> doc;

// Connectivity doc["rssi"] = WiFi.RSSI(); doc["reconnects"] = reconnectCount;

// Resources doc["freeHeap"] = esp_get_free_heap_size(); doc["uptime"] = esp_timer_get_time() / 1000000; // seconds

// Application health doc["mqttDropped"] = droppedMessages; // messages lost when offline doc["lastSensorErr"] = lastSensorErrorCode; doc["fwVersion"] = FW_VERSION;

char payload[256]; serializeJson(doc, payload); mqttClient.publish("devices/${DEVICE_ID}/heartbeat", payload, 1);

droppedMessages = 0; // reset counter after reporting }

An IoT Rule forwards these heartbeats to CloudWatch as custom metrics. When freeHeap drops below 20 KB, you have a memory leak. When reconnects spikes, there is a network problem. When mqttDropped is non-zero, your publish interval is too aggressive for the connection quality.

Cloud-Side Metrics (AWS IoT Core)

AWS IoT Core publishes these natively to CloudWatch — enable them in the IoT console:

  • Connect.Success / Connect.ClientError — authentication failures often spike before fleet-wide problems
  • PublishIn.Success / PublishIn.Throttled — throttling means you've hit the per-account limit
  • RuleExecution.Success / RuleExecution.Failure — silent failures in IoT Rules are common and dangerous
  • Subscribe.Success — devices renewing subscriptions after reconnect
  • The metric we watch most closely: RuleExecution.Failure. IoT Rules fail silently by default — a misconfigured SQL filter drops every message without any visible error unless you've set up a Dead Letter Queue and an alarm.

    Application SLOs: The Metrics That Matter to the Business

    Technical metrics describe your system. SLO metrics describe your promise to customers:

    // Lambda: custom SLO metric — end-to-end ingestion latency
    import { CloudWatch } from '@aws-sdk/client-cloudwatch'

    const cw = new CloudWatch({ region: 'us-east-1' })

    export async function recordIngestionLatency( deviceId: string, deviceTimestamp: number ) { const latencyMs = Date.now() - deviceTimestamp

    await cw.putMetricData({ Namespace: 'IoTApp/SLO', MetricData: [ { MetricName: 'IngestionLatencyMs', Value: latencyMs, Unit: 'Milliseconds', Dimensions: [ { Name: 'DeviceType', Value: getDeviceType(deviceId) }, ], }, ], })

    // Alert if p99 latency > 5 seconds — SLO violation if (latencyMs > 5000) { console.error(JSON.stringify({ level: 'ERROR', event: 'slo_violation', deviceId, latencyMs, slo: 'ingestion_latency_p99_5s', })) } }

    Target SLOs we typically set for IoT backends:

  • Ingestion latency p99 < 5 seconds (sensor read to database write)
  • Heartbeat miss rate < 0.1% over 1 hour
  • Rule execution success rate > 99.9%
  • Device online rate > 97% for battery-powered devices
  • Structured Logging: Making Logs Queryable

    Unstructured logs are archaeology. Structured JSON logs are searchable data. Every Lambda, every backend service, and ideally every significant firmware event should log JSON.

    // Consistent log structure for CloudWatch Logs Insights queries
    const log = {
      level: 'INFO',             // DEBUG | INFO | WARN | ERROR
      service: 'telemetry-ingestor',
      traceId: event.requestId, // X-Ray trace ID
      deviceId: message.deviceId,
      event: 'telemetry_received',
      payloadBytes: rawPayload.length,
      processingMs: Date.now() - startTime,
      region: process.env.AWS_REGION,
    }

    console.log(JSON.stringify(log))

    With structured logs, CloudWatch Logs Insights queries become powerful:

    -- Find devices with elevated error rates in the last hour
    fields deviceId, level, event
    | filter level = "ERROR"
    | stats count() as errorCount by deviceId
    | sort errorCount desc
    | limit 20

    -- Track p99 processing latency per device type fields processingMs, deviceType | filter event = "telemetry_received" | stats pct(processingMs, 99) as p99 by deviceType

    Distributed Tracing with AWS X-Ray

    A sensor reading touches at least five components before it reaches your database: firmware → MQTT → IoT Core → IoT Rule → Lambda → database. When latency spikes, which hop is slow?

    X-Ray traces the entire journey. Instrument your Lambda processors:

    import AWSXRay from 'aws-xray-sdk-core'
    import { DynamoDBClient } from '@aws-sdk/client-dynamodb'

    // Wrap AWS SDK calls — X-Ray tracks them automatically const dynamodb = AWSXRay.captureAWSv3Client(new DynamoDBClient({}))

    export const handler = async (event: IoTRuleEvent) => { const segment = AWSXRay.getSegment()! const subsegment = segment.addNewSubsegment('process-telemetry')

    try { subsegment.addAnnotation('deviceId', event.deviceId) subsegment.addAnnotation('deviceType', event.deviceType)

    const result = await processTelemetry(event, dynamodb)

    subsegment.addMetadata('result', result) return result } catch (err) { subsegment.addError(err as Error) throw err } finally { subsegment.close() } }

    After a week of production traffic, the X-Ray service map shows you exactly where time is spent across every component. We've used this to discover that 40% of ingestion latency was coming from a cold-start DynamoDB connection pool in Lambda — fixed with provisioned concurrency and connection reuse.

    Grafana Dashboards for IoT Fleets

    CloudWatch dashboards work but Grafana gives you more flexibility for IoT use cases. With the CloudWatch data source:

    Fleet Overview Dashboard panels:

  • Fleet online/offline ratio (gauge)
  • Messages per minute (time series, by device type)
  • Top 10 devices by error rate (table)
  • Geographic heatmap (if devices report location)
  • Firmware version distribution (pie chart — critical for OTA rollout tracking)
  • Device Drill-Down Dashboard panels:

  • RSSI over time (spot connection quality trends)
  • Free heap over time (catch memory leaks before crash)
  • Message publish rate vs expected rate (detect stuck firmware)
  • Reconnect count (find flaky network locations)
  • Alerting on SLO violations:

    Configure alerts that page on-call when:

  • Fleet online rate drops below 95% for 5 minutes
  • Ingestion latency p99 exceeds SLO for 10 minutes
  • IoT Rule failure rate exceeds 0.5%
  • Any single device has not sent a heartbeat in 10 minutes (critical devices only)
  • Avoid alert fatigue — page on SLO violations, not on every individual device blip. IoT devices go offline routinely due to power cycles, network hiccups, and user interactions. Alerting on every missed heartbeat in a 10,000-device fleet generates hundreds of false positives per day.

    The Observability Checklist

    Before going to production, verify you have:

  • Device heartbeat published every 60 seconds, forwarded to CloudWatch
  • IoT Rule DLQ configured for every rule that touches critical data
  • Structured JSON logging in every Lambda and backend service
  • X-Ray active tracing on all Lambda functions
  • CloudWatch alarms on the three most important SLOs
  • Grafana dashboard with fleet overview and device drill-down
  • On-call runbook linked from every alarm description
  • Tested: can you identify the root cause of a simulated outage in under 5 minutes?
  • The last point is non-negotiable. The best monitoring system is the one that gets tested before the 3 AM outage, not during it.

    Need help? [Contact Code Caracal](/contact) — we've shipped these systems for clients across 15+ countries.

    Written by CodeCaracal Engineering

    We write from production experience — every technique in our articles has been deployed to real clients. No academic theory.

    More Articles

    Business · 12 min read

    IoT Device Compliance: FCC, CE, and Product Certification Guide for Hardware Startups

    Business · 11 min read

    What to Look for When Hiring an IoT Development Partner: 8 Critical Criteria

    Business · 11 min read

    IoT MVP to Production: Realistic Timeline and Budget for Hardware Startups

    Business · 11 min read

    IoT Development Agency vs Building In-House: A Decision Framework for Founders

    IoT Dashboard · 13 min read

    Next.js IoT Analytics Dashboard: From Sensor Data to Production App

    Business · 11 min read

    How Much Does It Cost to Build an IoT Product in 2024? A Realistic Breakdown

    IoT Dashboard · 11 min read

    IoT Dashboard UX: Design Principles for Industrial Monitoring Interfaces

    IoT Dashboard · 12 min read

    Node.js WebSocket Server: The Real-Time Backend for IoT Dashboards

    Cloud & DevOps · 12 min read

    Containerizing IoT Backend Services with Docker: From Dev to Production

    IoT Dashboard · 14 min read

    Grafana + InfluxDB IoT Monitoring: Complete Production Setup Guide

    IoT Dashboard · 12 min read

    Building Real-Time IoT Dashboards with React and Recharts

    Cloud & DevOps · 13 min read

    CI/CD for Embedded Firmware: Automated Build, Test, and OTA Release Pipeline

    Mobile Development · 12 min read

    Flutter Offline-First IoT Apps: Hive + Sync Architecture That Works in the Field

    Cloud & DevOps · 14 min read

    Terraform for IoT Infrastructure: Provisioning AWS IoT Core, Lambda, and InfluxDB as Code

    Mobile Development · 10 min read

    Flutter IoT Alerts: Firebase Push Notifications for Device Events

    Cloud & DevOps · 12 min read

    Deploying IoT Backends on AWS: ECS Fargate vs Lambda vs EC2 Decision Guide

    Mobile Development · 11 min read

    Flutter + MQTT: Building Production IoT Mobile Apps That Scale

    Mobile Development · 13 min read

    Flutter BLE: Building a Bluetooth IoT Controller App from Scratch

    Cloud & DevOps · 13 min read

    AWS IoT Core vs Azure IoT Hub vs Google Cloud IoT: 2024 Honest Comparison

    IoT Engineering · 13 min read

    Kafka vs RabbitMQ for IoT: Choosing the Right Message Queue for High-Volume Telemetry

    IoT Engineering · 14 min read

    IoT System Testing: Unit, Integration, Hardware-in-the-Loop, and End-to-End

    IoT Engineering · 14 min read

    Predictive Maintenance with IoT Sensor Data: From Threshold to Machine Learning

    Embedded Systems · 14 min read

    IoT Bootloader Design: Secure Boot, A/B Partitions, and Reliable OTA Recovery

    IoT Engineering · 14 min read

    Multi-Tenant IoT Platform Architecture: Isolation, Scaling, and Data Partitioning

    Embedded Systems · 14 min read

    Memory Management in Embedded Firmware: Avoiding Heap Fragmentation and Stack Overflows

    IoT Engineering · 13 min read

    IoT Cost Optimization: How We Cut AWS IoT Bills by 60% Without Sacrificing Reliability

    IoT Engineering · 12 min read

    Edge Computing in IoT: When to Process On-Device vs In the Cloud

    IoT Engineering · 13 min read

    Digital Twins for IoT: Building a Virtual Mirror of Your Physical Devices

    Embedded Systems · 14 min read

    ESP32 Deep Sleep Mastery: Cutting Power Consumption from 240mA to 10µA

    IoT Engineering · 10 min read

    MQTT QoS 0, 1, and 2 Explained: Choosing the Right Level for IoT

    Embedded Systems · 14 min read

    Debugging Embedded Firmware: JTAG, GDB, Logic Analyzers, and Serial Tracing

    IoT Engineering · 12 min read

    WebSocket vs MQTT vs Server-Sent Events: Real-Time IoT Protocol Deep Dive

    Embedded Systems · 13 min read

    STM32 HAL vs Low-Level Drivers: When the Abstraction Costs You Too Much

    IoT Engineering · 13 min read

    IoT Data Pipeline: From Raw Sensor Reading to Live Dashboard in Under 100ms

    IoT Engineering · 13 min read

    Zero-Touch IoT Device Provisioning: Scaling from 10 to 100,000 Devices

    Embedded Systems · 13 min read

    UART vs SPI vs I2C: Choosing the Right Protocol for Sensor Integration

    IoT Engineering · 12 min read

    Real-Time IoT Alerting: From Simple Thresholds to ML Anomaly Detection

    Embedded Systems · 12 min read

    ESP32 Partition Table: Designing Flash Layout for Production Firmware

    IoT Engineering · 12 min read

    IoT Architecture Patterns: Hub-and-Spoke, Mesh, and Edge-Cloud Hybrid

    Embedded Systems · 13 min read

    IoT Battery Life Optimization: Engineering Devices That Last Years on a Single Charge

    IoT Engineering · 13 min read

    Time-Series Databases for IoT: InfluxDB vs TimescaleDB vs AWS Timestream

    Security · 14 min read

    Zero-Trust Security for Embedded IoT: Why Your Devices Are Probably Vulnerable

    Embedded Systems · 14 min read

    FreeRTOS on ESP32: Task Scheduling, Queues, and Resource Management for IoT

    IoT Engineering · 12 min read

    Building a Production IoT Gateway with Raspberry Pi and Node.js

    Embedded Systems · 13 min read

    ESP32 vs STM32: Choosing the Right Microcontroller for Your IoT Project

    Mobile Development · 10 min read

    Flutter + WebSocket: Building Real-Time IoT Dashboards That Don't Stutter

    IoT Engineering · 13 min read

    IoT Fleet Management at Scale: AWS IoT Core Device Registry and Provisioning

    IoT Engineering · 11 min read

    MQTT vs HTTP for IoT: Which Protocol Wins in Production?

    IoT Engineering · 12 min read

    ESP32 → MQTT → AWS IoT Core: The Production-Grade Architecture Guide

    Let's Build Together

    Got an IoT challenge?
    We've shipped it.

    Whether you need a fleet to track, a factory to monitor, or a farm to automate — our team has done it before and we'd love to build it with you. Typical response time: under 24 hours.

    No upfront commitment99.9% uptime SLANDA on requestFixed-price options