Are you load testing exhaustively enough?
System performance under real life load is critical to the success of any large-scale solution. We do a fair amount of load testing and performance benchmarking, but one common situation is when all our load tests pass but post deployment, we run into various performance challenges. This is more pronounced for IoT solutions, given the peculiarities of IoT systems.
Here are a few reasons why all is well in load testing, but not in production:
Incorrect workload modeling and Optimism
The most critical aspect of load testing is workload modeling: how many devices are present, how many come online, how frequently are they sending data, how much data are they sending. While these are the usual suspects, some of these are nuanced, and incorrect modeling of these can be disastrous:
- How many devices are present: If your solution expects many devices to be registered or unregistered as with consumer devices, then you need to ensure that your load testing accounts for both device telemetry, and any APIs for lifecycle management.
- How many devices come online: Devices may either be continuously online or have a specific user-driven ramp-up and ramp-down: consumer devices may follow a user’s wake cycle, a building system may have peaks when many people enter or exit at certain points in time, a telematics solution may follow driver timings. It is important to take these ramp-up / ramp-down patterns into your tests.
- How frequently they send data and how much: Devices could be monitor devices that periodically send data, or they could be alerting devices that could send data on condition. This periodicity translates into overall system load and must be simulated accurately. For condition-based publishes, the underlying conditions and their periodicity needs to be modeled, and for periodic publishes, the publish interval and its configurability need to be considered.
- How concurrent they are: Unlike user-interactive applications such as web apps, devices do not typically have aspects of randomized think time. Variation of load distribution is by virtue of devices starting or stopping at different periods, so modeling concurrency of devices is especially critical. Consider a case when devices wake up at a pre-defined schedule, poll certain sensors, and report them. Unless the schedule is marginally randomized by device, you can expect an extremely high concurrency at specific intervals. This is dramatically different from the “average” load, where we optimistically assume x% concurrency.
Inadequate simulation of true device behavior
IoT devices typically have intelligence in them that can change how they interact with the on-cloud application. They also do not just send one payload, they send multiple – for example, a heartbeat in addition to expected telemetry. They also may have the intelligence to do store and forward when connectivity is lost and retries in case of intermittent failures. Here are a few things to consider:
- Device communication is two-way: Devices send telemetry out, and the application can send commands in. While telemetry modeling is usually done, the distribution of outbound commands could also have a significant in the perceived responsiveness of the solution, and so must also be modeled. Some brokers also have different load specifications for inbound and outbound data.
- Simultaneous messages: Devices can send many telemetry payloads at different intervals: a heartbeat every 1 minute, a periodic location payload every 30 seconds, an alert on condition. When these coincide, you can expect a 3x increase in concurrency in this example. Such intermittent peak loads need to be modeled.
- Data variation: Your device may send different payload sizes based on its context: a cold chain monitor device on a truck may send data for 10 beacons for small containers, or 200 for large containers. This has impacted both the load on the servers, as well as on networks.
Inadequate failure scenario modeling
Murphy’s Law always holds, and we need to be able to do a what-if modeling in our load tests as well:
- The thundering herd problem: As an example, if the application has an outage, all devices may simultaneously retry, bogging down the system even more. This kind of unintentional DoS needs to be accounted for both in design and in load tests, by having all devices get a retry trigger.
- Store and forward behavior: When devices lose network connectivity, they may store all outbound messages and, on resumption of connectivity, may send all stored messages in a burst. Such behavior also needs to be modeled and accounted for.
Infra monitoring and planning:
Infrastructure monitoring and right-sizing –of both the application under test and of the load infrastructure are equally critical. A potential challenge is when the load generators are overloaded and so do not generate the required load, leading to incorrect assumptions about the application’s actual performance. While right-sizing of infrastructure is important, it’s monitoring for both resource utilization and hidden errors is important to ensure that application performance characteristics are not hidden behind other errors.
At Doppelio, we enable users to model their devices and their characteristics. We provide a way to model many different test scenarios, including network and load to test the loads accurately. Having a SaaS model also means that load infrastructure and its monitoring are offloaded to us.
Do let me know what other scenarios you have seen and what helped in avoiding stress during the high system load. What helped you get the “all is well” feeling?
Srinivas Bhagavatula, Co-Founder