The Trouble with Testing for Network Conditions

IoT applications sense the world for a period of time and at the right time act upon the data either by automatically controlling things or by acting on the information received. The broad steps[1] followed by a typical IoT edge device for sense and respond areas are:

Doppelio Network Condition Testing

Loop

    1. Measure/Sense at regular intervals
    2. Make decisions locally based on the rules, and act accordingly
    3. Send the telemetry data to the cloud application
    4. Accept commands/inquiries from the cloud – act on it, respond appropriately
    5. Sleep with a possible priority interrupt – originated locally, remote

This algorithm in many ways is subject to the well-known fallacies of the distributed systems that allude to false assumptions that plague distributed systems. An IoT system is subject to the same fallacies.

The Real Issue:

Network performance is impacted by multiple factors, such as free space path loss, buildings, topological obstacles, weather, mobility characteristics of the receiver, and distance to the mobile radio station. This results in frequent but unpredictable occurrences of packet delays, bandwidth constraints, packet loss, or jitter. These network conditions impact the application’s reliability as messages between the device and the server suffer from (1) delays (2) drops (3) erratic sequencing and (4) duplication.

Experienced IoT architects and developers use different strategies to mitigate the risks involved. The strategies are subject to different constraints and trade-offs like device hardware, system software, data throughput, power requirements, security, envisaged scale, etc.,

Here are some of the approaches used:

  • Selecting appropriate communication protocol – e.g., MQTT vs AMQP vs Raw TCP
  • Selecting best protocol client, server implementation – e.g., Paho vs Mosquito MQTT client
  • Appropriate configurations of the protocol – e.g., MQTT QoS 0/1/2, LWT, retain messages
  • Connection management – e.g., persistent connections, increased keep alive time, server-side throttling
  • Design patterns like pub/sub, and request/response
  • Retry mechanisms along with explicit store and forward

These are good in theory but in practice, problems still occur for the following reasons:

  • All engineering stakeholders do not necessarily understand the implications of the design choices and how it works in corner cases (real field cases!)
  • Assumptions on boundary conditions around the different parameters
  • Combinations of factors impacting each other – e.g., power cycles impacting the algorithm flow for handling network conditions, buffer overflows, and server performance at scale
  • Lack of in-depth knowledge about the workings of the protocol

The gaps arising from these design choices, flawed assumptions, and differing understanding of other engineering stakeholders sneak through into the design and code. The burden falls on the testing teams to fish out these gaps so that a reliable production system can be released.

Testing Limitations:

Simulating real-life network conditions in the lab is extremely challenging. This is one of the reasons why lab-tested systems, often throw up a lot of bugs during field/acceptance tests.

Due to the lack of available options, the natural response is to push a lot of testing to the field thereby heavily impacting the effort, cost, and release schedules. But field tests also suffer the problem of limited scenario coverage and repeatability considering the lack of control on test variables like network bandwidth, latency, packet loss, etc.

Potential Solutions:

The only way out is to build a custom testbed leveraging automation and network emulators for the testers to be able to do high coverage validations in a faster and repeatable manner. Unfortunately, network emulators often require a time-consuming setup, complex configurations, and technical knowledge of networks. The complexities of building and maintaining a custom testbed for functional scenarios and network conditions makes it unviable given the time, skills and costs involved.

That is where IoT application-centric test automation platforms like Doppelio could come in handy helping the test teams achieve comprehensive coverage at a greater speed (10x+). Doppelio can be used for manual and automated testing across the development life cycle – for functional tests including network scenarios, regression testing, load, and performance testing.

 

References:

[1] *Some of the steps could be optional in some systems, but broadly all IoT systems have all of these in some form considering the end-to-end functionalities including device management.

Rajesh K Doppelio

Rajesh K, Co-Founder