If it is connected, it can break

Tips to assure quality of connected systems

Srinivas Doppelio

Srinivas Bhagavatula, Co-Founder

The world of IoT is all about connected intelligence. There’s intelligence on the device, and there’s the intelligence of the supersystem on the cloud that the device taps into. It is this collective intelligence that puts the “smart” in smart devices. These devices stay smart through frequent updates that are seamlessly delivered even for complex machines such as cars. But, as developers, how much attention do we pay to the network – that invisible fabric that connects our devices to our cloud apps, and makes all of this happen?

Way back in 1994, Peter Deutsch penned down the 7 Fallacies of Distributed Computing. In 1997, James Gosling added the eighth fallacy. It seems a long time back, but those fallacies are still just as relevant, and, in the context of IoT and moving assets, even more so. The 8 fallacies are:

  1. The network is reliable
  2. Latency is zero
  3. Bandwidth is infinite
  4. The network is secure
  5. Topology doesn’t change
  6. There is one administrator
  7. Transport cost is zero
  8. The network is homogeneous

Here are the eight fallacies as applicable to IoT, and the considerations on the device, and on the cloud:

 

Fallacy #1: The network is reliable

Let’s face it, cellular networks drop, and even undersea cables can get snapped. When you are dealing with devices that are in a noisy environment or devices on the move, reliability is a luxury.

On the device side, you can employ three strategies:

  • Retries: When the cost of sending a message is not critical, consider sending a unique message ID to help with downstream idempotency.
  • Store and forward: When you have a very spotty network, or if retries fail, but messages are critical. When the cost of send is better with larger messages than with many messages, consider sending them as batch, but with reasonable sizes.
  • Internal queuing / fire-and-forget: Design for the case that if the device blocks for sending a message, it may miss sending critical messages in a timely manner.

On the cloud side, design for:

  • Idempotency: Design to account for duplicate messages and ensure that they are not duplicated in systems of record.
  • Burst loads: When multiple devices retry or when batch messages arrive, design to handle burst loads with high concurrency. A well-designed retry mechanism would have a back-off, but that may not be in your control always.
  • Missed messages: Constrained devices may not do a store-and-forward, so messages may be missed. Review for hard requirements of sequence: how do you handle the case when a vehicle sends you trip data but you never received an ignition on message?

 

Fallacy #2: Latency is zero

In local, Gigabit networks, latency may seem to be near-zero, but when it comes to IoT, particularly in remote or movable assets, or when cost considerations require the use of slow networks, this fallacy becomes quite apparent.

On the device side:

  • Queuing / fire-and-forget: Consider a queue so the rest of the device logic is not impeded by a slow message send.
  • Keep connections alive: When connect latency is high. Keep in mind that the connection can drop, and so must be retried.

On the cloud side, design for:

  • Varying timing: Account for messages arriving at varying intervals: a heartbeat every minute may not exactly be every minute.
  • Bunched messages: Account for a set of messages arriving close together, and others arriving far apart.
  • Out of band messages: When devices open a fresh connection every time, some messages can arrive earlier than messages sent later, so you cannot assume that everything is in time sequence.

 

Fallacy #3: Bandwidth is infinite

This might seem true for high throughput local area networks, but for reasons of power usage, cost, or capability, just availability, some IoT devices may use low bandwidth networks.

On the device side:

  • Minimize message size: Prefer compact binary serialization over verbose text-based protocols. Prefer the likes of protobuf over proprietary encoding.
  • For store-and-forward: Balance number of messages sent in batch with message size, based on bandwidth.

On the cloud side, design for:

  • Binary protocols: Design for compact message schemas and binary serialization.

 

Fallacy #4: The network is secure

In IoT, this fallacy is doubly dangerous given that the attack vectors are more, and are augmented by lack of physical proximity. This section hardly scratches the surface when it comes to IoT security – that’s a topic that requires its own dedicated blog post series!

On the device side:

  • Physical security: Ensure physical security of the device itself if possible, lockdown hardware interfaces such as JTAG or debug probes that are otherwise needed during development.
  • Data security: Encrypt sensitive data, particularly device identifiers and certificates; consider trusted platform modules or companion chips.
  • Don’t trust the network: Use PKI to ensure that traffic both from the device and from the cloud is secure and from trusted sources.

On the cloud side:

  • Don’t trust the device: Use PKI to ensure that you only trust your devices and their data.
  • Keep an eye out for unexpected traffic: You know how many devices you have and how they behave, so you know much data and volume to expect. If you see more, then you may have roque devices on your hand.
  • Rotate your keys/certificates: Consider not using long-lived keys or certificates.
  • Keep devices updated: Keep your devices patched for security, keep security patch OTA updates separate from functional ones, so their cadence and uptake is delinked from those of functional updates.

 

Fallacy #5: Topology doesn’t change

As much as we would like this to be true, variability in IoT deployments is a reality we need to contend with: more capable devices directly connect to the cloud, less capable devices connect through a gateway, and sometimes there are additional network elements in between. Even the cloud side isn’t unchanging – a failover may switch an endpoint to a DR data center, or devices may want to connect to a geographically closer endpoint.

On the device side:

  • Use a fixed configuration endpoint with OTA: Devices can start with a set of pre-configured endpoints, but OTA can reconfigure these. A fixed-configuration endpoint can be what the device falls back to if its pre-configured endpoints are not available.

On the cloud side, design for:

  • Design OTA for heterogeneous deployments: OTA updates need to target both gateways and nodes, and gateways need to be able to relay node OTA updates to the right nodes.
  • Don’t assume a fixed topology: Unless you have full control over it, consider cases where your cloud is connected via gateways, or directly. Gateways may have the capability to aggregate and filter data so that potentially changes the shape of your data too.

 

Fallacy #6: There is one administrator

Now, this one I would like your thoughts on! Here is my tweak on it from an IoT perspective, do send in your comments on what this would imply for IoT solutions.

In an IoT context, there are multiple owners of the ecosystem: there are administrators who manage the device hardware, there are others who manage the software lifecycle through OTA, and there are infrastructure administrators who manage the cloud infrastructure and its network. When there are third party devices, the ownership expands considerably. To deal with this:

On the device side:

  • Keep OTA separate from the application: An OTA update may be triggered by a device vendor, but the application update may be triggered by the software vendor. Ensure a validation process wherein a vendor OTA update doesn’t break an application OTA update and vice versa.

On the cloud side:

  • Change management processes: Set up processes to validate infrastructure and application changes relative to each other. Also based on your context, account for varied third party device update cadence, and if that impacts how the device changes its interaction with your cloud setup.

 

Fallacy #7: Transport cost is zero

While typically “cost” in this fallacy refers to the cost of resources, for remote devices, this can also be monetary, for cellular charges.

On the device side:

  • Optimize costs: Review if the cost of batching messages or keeping connections open is lower than individual messages being sent, optimize message size
  • Optimize resource usage: Potentially in tension with the above – keeping a connection open is battery-intensive, so a sleep-send-sleep cycle is typical for constrained devices

On the cloud side:

  • Review the cost of infrastructure: Transport infrastructure – the cost of bandwidth, compute and storage resource for message processing and storage, etc. – all add up to cloud costs. While not seeming a lot, a $0.001 per device per hour cost difference, aggregated over a million devices for one year, amounts to $8.76m

 

Fallacy #8: The network is homogeneous

Particularly true for IoT scenarios where a large number of device models are involved: one can’t assume that all devices have the same capability or the same network. Additionally, the same device can switch between networks.

On the device side:

  • Don’t assume network characteristics: A device on a cellular network may switch from 4G down to 3G or 2G based on coverage, so design for a flaky network.

On the cloud side:

  • Design for heterogeneous device models: Design for feature flags based on advertised device capability.
  • Design for varying network characteristics of devices: Assume the lowest common denominator when designing for network characteristics.

While one builds all these mechanisms for dealing with the vagaries of the network, a fundamental question remains on how you could test for these.

On the device, your tests would need to include tests for:

  • Network conditions: delays, dropped packets, and network unavailability
  • Network switches: Between Wi-Fi, 4G, 2G, for example
  • Store-and-forward scenarios, including extended periods of outage

On the cloud side, your tests would need to be both functional, and load. Your functional tests would include:

  • Tests for idempotency when messages are retried
  • Tests for assumptions of sequences – when a message is expected after another message, but the first message is missed
  • Tests for periodicity assumptions: some periodic messages being delayed, some dropped altogether: after how many missed heartbeats do you deem a device unreachable?
  • Tests for batched messages for when some messages are sent as a batch and some are sent individually
  • Tests for corrupted messages

Your load tests would include:

  • Reduced publish frequency due to varying latency
  • Increased load factor due to device retries

Testing for these scenarios with physical devices and field tests is painstaking, and not reproducible. Testing them in a lab isn’t feasible in terms of simulating all the required scenarios without significant test noise. With Doppelio, you can simulate many different kinds of faults – both data and network. You can then stitch these into sequences to create full test scenarios. You can then take these scenarios and run them at scale, to simulate any number of devices, so you can take care of functional and load tests, all on the same platform. In case if you want to chat up more, do drop in a comment.