
Quality control in IoT: Why we built a testing lab the client never asked for

Take a seasonal product. Smart irrigation, smart heating, anything with a fixed launch window. Miss the season, lose the year.
Now the question every CTO eventually faces: how do you know the firmware works?
Not “tests pass” or “QA signed off.”
You’re shipping to 50,000 homes, each with its own networks, weather, and ways of using the device. Nobody’s patching by hand once it’s out there.
Most teams can’t answer that question with confidence. And there’s a reason for it.
This article walks through that reason, and one project where it stopped being one. Here’s the map:

The gap between prototype and production
A working prototype is roughly 10% of the way to production. The other ninety is everything that surrounds it, and that’s why:

Usually, companies build internal expertise around the first four. Hardware bugs get caught at the assembly line. Supply chain breaks visibly: a vendor’s late delivery is a problem you can put on a chart. But firmware bugs that only show up at scale? Those are different. They land on the wrong team, customer support, and they land months after launch, when there’s nothing graceful left to do about them.
“Scale is half the story. The other half is a sequence. A user opens the app mid-OTA. A gateway drops Wi-Fi during pairing. Two devices re-associate while a third reboots. Conventional testing doesn’t reach this because it’s hard to script. But most of the worst field bugs live in that exact gap.”
Nikita Provatorov, Embedded Team Lead, Head of Engagement, testing stand creator.
Like a race condition in multi-device pairing that only triggers above ten devices on a gateway. Common in real installations, and almost never on three units on an engineer’s desk. A memory leak that takes seventy-two hours of uptime to surface, by which time everyone has gone home. OTA recovery that bricks one device in a thousand – a bootloader without a working rollback path. Sounds small. On a fleet of ten thousand, that’s a hundred warranty replacements and a hundred support calls opening with “I bought this last week, and it doesn’t turn on anymore.”
Unit tests don’t reach these. An external QA contractor running checklists doesn’t either. By the time you find them, they’re moving bugs in your tracker to the posts on Reddit.

What follows is a specific case where that gap got closed in time. Eight months from kickoff to launch. Thousands of units in retail, four firmware engineers on the development side, near-zero firmware-related support tickets in the six months after launch. The point of this writing is to give you enough material to decide if something like this fits your own product.
Why standard testing approaches break on IoT
If you’ve shipped embedded firmware, at least one of the scenes below will look familiar. Maybe all three. They tend to compound. The small team without a process becomes a distributed team – and the distributed team eventually runs into a launch window that won’t move.

All three are symptoms of the same problem. Whichever scene felt closest to your last project, the cause is the same: firmware validation at production scale has a specific shape that the standard firmware QA process – unit tests, external QA, in-office smoke tests – misses in predictable ways. Let’s see how.
What unit tests and mocks miss
Firmware doesn’t run on a mock. It runs on a specific revision of a specific SoC, on top of an RTOS (FreeRTOS, Zephyr, vendor variants), bound to a BSP that talks to that exact silicon, over real radio interfaces, drawing real current from a real battery. None of that is captured by a mock object.
Take a power-consumption regression our team has chased more than once. On a mock device, the sleep/wake cycle looks perfect – the test confirms that enter_low_power_mode() was called and the state machine is in SLEEP. Green across the board. On real hardware, the battery drains 40% faster than spec, and the cause turns out to be a peripheral that stays partially active in low-power mode on a specific SoC revision – a hardware errata. The firmware did what the spec said. The mock did what it was told to mock.

A different shape of the same problem shows up in pairing. Several devices enter pairing mode at once and a couple of them get stuck on handshake – a race condition that hides at low device counts and surfaces only past ten or so. A mock BLE stack models concurrency as deterministic scheduling, but real radio timing depends on signal strength, channel contention, and whatever the user’s neighbor’s microwave is doing.
Unit tests sit at one layer of the stack – they’re excellent for logic in individual functions, protocol parsers, and state machines. Firmware validation needs coverage at every other layer too, including the one where firmware meets silicon and silicon meets the radio. That layer can’t be mocked. It has to be tested on real hardware, in conditions close to real ones, on a dedicated test fixture. What the embedded world calls hardware-in-the-loop (HIL) testing.
Why external QA contractors don’t close the gap
External QA earns its place on UX, regression of stable features, and final acceptance. Firmware testing services that go deeper – embedded firmware QA, in less commercial language – are a different job. Contractors run checklists, test what they’re told to test, and don’t have direct access to firmware internals, OTA infrastructure, or your team’s workflow. The bugs they file reflect that.
If you’ve ever opened a ticket that just said “did pairing, device froze” and spent the next half-day reconstructing what the tester actually did, you already know the failure mode.

The five-day round-trip in the diagram isn’t a worst case. It’s a typical week with one shared QA queue and one big regression suite. The deeper problem is what it forces a developer to do: context-switch off the bug, work on something else, return to it the following week with cold context. Iteration loops longer than a day kill the kind of small-correction work that catches subtle bugs early.
The answer most engineering leaders arrive at, sooner or later, is to shift the first filter into the development team. Validate against realistic conditions before anything goes outside. External QA stays in the picture for what it’s good at, and the bugs that reach it are the ones it can actually close.
Scale-only bugs
Some bugs only exist at scale. They stay hidden at three devices, ten devices, even twenty – and surface only once the system has enough devices, enough hours on the clock, or enough variability in network conditions for the failure mode to emerge. They come in four shapes worth naming.

If you’ve shipped to a real fleet, you’ve met at least one shape above. Probably more than one. The question every engineering leader eventually asks is which of these is hiding in the firmware queued up for the next release.
Scale-only bugs aren’t unique to IoT. Server software has its own version – concurrency issues that surface under load, memory fragmentation that emerges over weeks of uptime. The difference is operational. On the kind of server fleet most CTOs are familiar with from past lives, you can attach a debugger to a running instance, dump core, run profiling, observe state in real time. On a fleet of consumer devices in customer homes, none of that is available. The instrumentation has to exist before the bug appears, and the conditions for the bug have to be reproducible outside production.
Field is the worst place to find these bugs. If you’ve ever shipped a remote update to thousands of devices to fix something the lab missed, you already know what “worst” means here. The rest of this article is about a project where that problem was solved before the launch, not after – and the infrastructure built to solve it.
Eight months and a missing environment
The product was Rachio’s Smart Hose Timer: a small battery-powered device that screws onto an outdoor faucet, opens and closes a bi-stable valve on a schedule, and talks to a Wi-Fi gateway that talks to the cloud. Homeowners run the whole thing from a phone app. Rachio’s broader line sits in residential irrigation, and at company scale they’ve reportedly saved something like 150 billion gallons of water across their installed base.


The stack is multi-layer and multi-vendor: C++ on the low-level side (interrupts, valve control, radio drivers), Python above for higher-level logic and OTA management, CMake holding builds together, MQTT moving messages between gateway and cloud, AWS hosting the backend, Aeris Weather feeding in forecasts. Radio is split between Nordic for BLE (the low-power profile a battery-powered device demands) and Espressif for gateway connectivity and OTA. Initial scope was one gateway with eight peer end-devices, with provisional headroom for more. That number, eight, matters in a moment.
When companies outsource firmware development, the typical contract carves out scope and leaves QA to the client. Rachio’s was different. Four firmware engineers from Sirin Sowtware came in alongside the client’s own people. Eight months from kickoff to retail availability. For a product with a custom valve mechanism, multi-device wireless protocol work, OTA integration, and a cloud backend, that’s a fast schedule. The kind of fast you don’t talk your way out of by promising a hotfix sprint at the end.
The deadline wasn’t moveable either. Smart Timer is a seasonal product. If it wasn’t on the shelf before homeowners started thinking about lawns, the launch effectively rolled to next year. If you’ve worked on a seasonal product yourself, you already know the difference between “the deadline is firm” and “the deadline is actually firm.” This was the second kind.
On paper, the starting point looked solid: working prototype, hardware close to final, pilot batches in hand, a clean roadmap from concept through firmware features. Real data to work with from day one, which is rarer than it sounds. What was missing was an IoT device validation environment to answer four uncomfortable questions before launch.
How do you validate firmware on a full eight-device gateway plus headroom for scale-only behavior, when nothing on a single bench surfaces that behavior? How do you simulate real installations with all the variation that residential RF actually contains? How do you give a distributed engineering team hands-on access to working hardware without shipping devices in padded boxes? And, when the time comes, how do you tell the client with data, not optimism, that the build is ready to put in front of hundreds of thousands of people?
Nothing in the standard development cycle answered those. The firmware team was mature, the hardware was in place. What didn’t exist was the environment to put them together and surface the failure modes that hide at single-bench scale.
So our team made a recommendation. Here are the gaps, here’s what they mean for a seasonal release, here’s the testing infrastructure we’d build to close them. The original contract didn’t include it because nobody had thought to scope it that way, and we didn’t wait for the next round of planning to start building. Some pieces of the project we saw before anyone else did. This was one of them.
The other early decision was to build the booth as general-purpose infrastructure rather than a Rachio-specific jig. Cause any IoT product with a cloud-connected gateway and a fleet of end-devices runs into a similar validation problem. Solving it once for one product was good. Solving it for every product after was better. Three years later that call has shaped how we approach new IoT engagements, but we’ll get to that below.

Design and build went in parallel, and the eventual scope was still being figured out, including the parts the client’s own team would later put to use. That part of the story comes later. Right now there was a product to ship and a season to catch.
A walk around the test fixture
About as tall as the engineer working next to it: 1962 mm at the top edge, roughly two meters. Upper block is 753 mm wide. The working surface below it runs 850 by 400 mm, enough for a laptop, an oscilloscope, and the sort of cable mess that accumulates when you’re actually using a piece of equipment instead of staring at it on a CAD render.

Roller wheels on the base. One person can move it. Loaded weight makes that less casual than wheeling around a desk chair, but it wasn’t built to live in one corner of the office.

Functionally, here’s what sits on it: twenty device slots in the main panel for end-devices arranged in four rows of five, a separate enclosed section for gateways and control electronics on swappable modules, a working surface for instruments and a laptop, and a remote power management module wired through the back-plane. A cloud API runs on top, exposing remote control to anyone on the team with credentials. The rest of this section below is what’s behind each piece.
Designed for the next device
Each slot is two parts. The lower section is universal: JTAG and SWD pins, power lines, a JLink Quick-Release Mount for the debugger. On top sits a device-specific holder that physically grips the unit. The lower section never changes. The holder is what we swap when a new device shows up. Rachio valve, a board from another client, an internally-built test rig: each one gets its own holder while the rest of the slot stays put. Designing that holder is a small mechanical task, not a slot redesign or a rewire.
Power is the part most people get wrong on the first pass.
We didn’t run a programmable bench supply per slot. Instead, three regulated DC-DC rails (1.9 V, 2.3 V, and 3.0 V) with pre-programmed presets, slots grouped by which rail they need. Three rails cover the operating voltages of most low-power embedded devices that anyone is actually shipping. If a future product needs a fourth, adding one is a few hours of work, not a redesign. Each slot connects to its rail through an individually controllable relay, switched remotely through the cloud API. Every device can be powered up, powered down, or rebooted independently. That covers emulating power loss, OTA-during-reboot edge cases, brownouts, and most of the messier failure modes that single-bench setups can’t reach. The presets do something else, too. A Rachio valve runs ten to fifteen months on alkaline before voltage sags into the cutoff range, and we didn’t have a year to wait for that to play out. So the rig walks each slot through the depletion curve via API calls, stepping the regulated rail down over hours instead of seasons. Firmware behavior near end-of-life (radio retries, missed transmissions, the bootloader’s last gasp on a brownout) becomes testable on a Tuesday afternoon instead of a calendar reminder twelve months from now.

The gateway sits in its own enclosed section, away from the slot panel. Independent power. Swappable modules so a gateway revision can be replaced in seconds. The isolation matters for a few tests: measuring mean power consumption across the full product scope without slot noise polluting the reading; simulating instant gateway power loss without yanking cables; running high-noise radio tests on one side without contaminating measurements on the other.
One slot, up close
There’s the device, say a Rachio valve, held in its mounts with the back of its PCB exposed through what the CAD documentation calls the Product Accessibility Window. The window matters. It’s the difference between probing a node with an oscilloscope in five seconds and disassembling the device to get to it. To the right of the device, a narrower opening for the JTAG/SWD debugger, with the JLink already snapped into its quick-release mount. Cable tracts behind the back-plane carry power and signal lines into the slot without crossing the front face.


The point of all that an engineer running a debug session shouldn’t have to fight the rig for access to the hardware. Every minute spent removing a screw to get to a test point is a minute not spent finding the bug.
Day to day
Pull all these ergonomic decisions together and you get a piece of equipment that engineers don’t have to fight. Working surface big enough for an oscilloscope, a laptop, and the soldering iron everyone forgets to put away. Cable management runs through the back-plane, so the front face stays clear of dangling wires. The roller base means the engineer who needs hands-on time pulls it to their desk. None of these decisions are individually clever. They add up to whether the rig gets used or quietly accumulates dust.
Remote access matters more than the ergonomics, actually. A distributed team is a given on any modern firmware project, and infrastructure that only works when an engineer is physically next to it isn’t infrastructure – it’s a bench setup. Each slot exposes its full operational surface through a cloud API: power on/off, debug log streaming, OTA invocation, per-device command endpoints. An engineer in another country can run a regression suite, push an OTA build, watch logs stream back, and triage a failure without anyone in the office knowing they’re working. Physical presence is needed for specific hardware sessions: probing a node with an oscilloscope, swapping a board revision, manually reshaping the radio environment. That’s it. Full automation of those last cases isn’t worth the engineering cost.
The branding and the build
After final assembly, the client asked us to put their branding on the front panel. The whole rig has since shown up in Rachio’s promotional material alongside the product.

The build itself ran in parallel with the firmware work and covered physical fabrication, electronics integration, software bring-up, and branding panels. One of our engineers drove the mechanical and integration work end-to-end, with a designer producing the branding panels.
Building general-purpose infrastructure when a project shape calls for it has become part of how we work. The Rachio engagement gave us the right hardware mix and the right validation problem to design this particular fixture. It’s been earning its place since. We don’t build a dedicated lab for every client. When a project really needs one, it’s a separate engagement, scoped and resourced like any other piece of paid work.
The rig at work

The walk-around showed what the rig is. What follows shows what it does. Software moves through it from commit to slot to validated firmware to a morning dashboard. The same channels that script the testing turn out to be useful for customer support, then for the next project, then for the project after that. Some outcomes we planned for. Some surprised us.
Remote control over every slot
The rig is one MCU as the brain, a chain of shift registers driving the relays, and a thrown-together cloud API on top.
The rest is detail.
Twenty slots, each one needs to be powered up, powered down, or rebooted independently, from a remote terminal, from a test script, or from a phone. Current draw on each voltage rail needs to be readable, both as a sanity check (a slot that just got 5 V instead of 3 V is a problem you want to catch before something starts smoking) and as a metric for power consumption logging during long-running tests. Battery-powered devices like the Rachio valve added a third requirement: emulate a slow battery voltage drop and read current in different operating modes (active, sleep, radio TX). On a single bench you can do all of this with bench equipment and patience. On a twenty-slot rack with multiple engineers running things from different time zones, you can’t.
The brain is an ESP8266 microcontroller. Cheap, WiFi-enabled, drivable through three GPIO pins, with libraries you can pull off Github in a morning. For an in-house tool, fine.
Between the ESP and the relays sits a TXB0104, a bidirectional level converter that bridges the 3.3 V logic of the ESP to the 5 V logic of the shift registers. Direct connection sometimes works. Direct connection plus a long cable run plus a noisy supply often doesn’t. Five dollars of level converter is worth the difference.

The shift registers are six 74HC595 chips wired in a daisy chain. Twenty-four output bits, one per relay, driven by three GPIO pins of the ESP (DATA, CLOCK, LATCH). Writing a 24-bit value across the chain is one operation. Without shift registers we’d have needed either a GPIO expander board or a more expensive MCU with thirty-some accessible pins, both of which cost more and complicate routing. Six chips in a row, three wires, done.
The relays themselves are six 4-channel modules built around the Songle SRD-05VDC-SL-C – 5 V coil, 10 A / 250 VAC contacts. Twenty-four channels in total. Each channel ties one slot to one of the voltage rails. Switching a relay takes a millisecond and an audible click that tells you, in passing, whether the firmware actually did what you asked it to.

Power for the slots comes from three LM2596 step-down regulators set to 1.9 V, 2.3 V, and 3.0 V. Three rails; every slot is grouped under the rail it needs. Adding a fourth rail for an unusual voltage is a few hours of work – a new regulator, a new INA219, a few more relay channels. The architecture doesn’t fight you when product requirements drift.
Current monitoring uses an INA219 on each rail, one per rail, sitting on the I2C bus shared with the ESP. That means current readings cost no GPIO at all, and adding sensors in the future doesn’t push you into a new MCU. Their I2C addresses are pin-configured through A0/A1, so multiple of them coexist on the same bus without conflicts.

The software side is Blynk. Blynk gave us an HTTPS API, a mobile app, and zero maintenance burden in about two days of work. The one thing it didn’t ship was a C library for ESP that worked outside the Arduino IDE. One of our interns wrote it. Test scripts call the API to cut power to a slot, engineers tap a button on their phone for the same effect. Same API, two interfaces.

The whole module sits on its own sheet that slides into the rack like a drawer. Pull it out, swap an ESP, push it back in. A maintenance event takes minutes instead of hours, which matters because in-house tooling that requires hours of downtime to fix tends to acquire workarounds, then dependencies on the workarounds, then nobody wants to touch it.

A word on what we just described. ESP8266 + Blynk + a composite panel is not the architecture you’d put in a product going to market. For an internal R&D rig running on a private subnet, the choices are sized to what the rig has to do, which is flip relays and read current. Picking the right level of engineering for the use case is a judgment call our team takes seriously. A productized version of this stack would require a more capable MCU and a custom backend. This one is a tool. The tool needs to work, not need to scale.
From commit to slot
Standard CI/CD with one important detail: every build artifact stays available, so any engineer can pull any historical version in under a minute.
The firmware CI/CD pipeline itself is unremarkable. GitHub Actions or Jenkins, depending on what the project needs. Custom runners where firmware builds require specific toolchains, like the right ARM GCC version for Nordic chips. Anyone who’s shipped embedded firmware in the last decade has a working build pipeline. The pipeline isn’t where the differentiation lives.
What matters is what comes out of it.
Every successful build pushes a generous bundle to S3: the firmware image, the AWS Jobs document for OTA delivery (or its equivalent), JLink addresses for the right hardware revision, build metadata, dependency versions, toolchain hash. If something can be generated at build time, generate it. If you’ve ever rebuilt a firmware version manually at 2am during an incident because the original artifacts were garbage-collected, you don’t need this argument made twice.
The deployment interface that pushes those artifacts to a slot has to be cheap to use. Pushing a firmware version to a specific device is a single command: device ID, version, done. Everything else gets resolved automatically from what the build pushed.

Caching every build is the second non-negotiable. v2.3.7 from a customer ticket on Tuesday means v2.3.7 flashed to a slot by Wednesday morning, not v2.3.7 reconstructed from a year-old commit by Friday afternoon. Disk is cheap. The toolchain version that built v2.3.7 a year ago may or may not still exist in any clean form, and figuring that out is exactly the work you don’t want to be doing during a customer incident.
What this pipeline produces lands on the test rig through one interface. A test script asks for a version. An engineer types one command at a terminal. Same call, two callers. The pipeline doesn’t care who’s pulling from it, and the test infrastructure doesn’t care how the bundle got there. That separation is what lets the layer above stay sane two years from now.
Black-box tests, every night
One command pushes a firmware version to any device on the rack, a set of scripts walks through every user-facing scenario, and the result is a go/no-go verdict. Feedback cycles are measured in hours instead of days.
The OTA harness has dozens of configuration parameters under the hood. You almost never touch them. You hand it a target equipment ID and a firmware version, and everything else (test mode, retry counts, validation timeouts, pairing parameters) resolves automatically from the build artifacts described in the previous block. The full parameter set is still accessible when a scenario calls for it.
The validation suite around the harness is black-box. For every user-triggerable action, a script invokes the flow and checks the result through cloud logs or indexed records. Go or no-go, no opinions about how the firmware got there. We deliberately stay away from white-box tests on firmware. If you’ve ever watched your test suite turn into a refactor blocker, where engineers start “fixing” the tests because the tests are easier to change than the code, you’ve seen the failure mode we’re avoiding. Black-box tests survive refactors and reflect what the user actually experiences, which is what we’re shipping.
Two examples make the difference.
- Pairing. The script triggers pairing on an end-device through its command endpoint, triggers pairing on the gateway through another endpoint, watches the cloud logs to confirm the two devices have found each other, and times the handshake. Fully unattended, every night, covering most real-world pairing cases. The harder variant of the same test is what we call noisy pairing. While one device is trying to pair, the rig power-cycles the rest of the already-paired fleet to flood the BLE channel with traffic. The interesting number is the delta: how much longer the handshake takes when the air is busy. On this device, it ran up to fifteen seconds longer than the clean-channel baseline, and we extended the mobile app’s pairing timeouts to match. Without the rig, that number would have come from a customer support ticket six months after launch.
- Recovery after connection loss is the example that needed the rig to exist at all. The script powers up the device, confirms it associated with the gateway, then uses the power management API to cut power to the gateway. Wait thirty seconds. Power back on. Confirm the device rejoined automatically and didn’t lose any queued events while disconnected. If you’ve ever stood next to a rack flipping a power switch on a stopwatch trying to reproduce a customer report, you already know why this test only ever runs when the infrastructure runs it for you.
Every test writes execution time, retry count, and small anomalies that didn’t fail the check but didn’t match prior runs, all of it lands in a structured log. Call it observability for in-house testing if you like. That second layer is where drift gets caught early.

A handshake that used to last two seconds and now averages three is still passing the test. The threshold isn’t reached. But something in the codebase shifted, and treating green-but-slower as a soft warning is one of the cheapest early signals you can get. Real bugs often arrive two weeks behind the first drift sample.
Aggregated results land on a dashboard that the team checks in the morning. A quick scroll through last night’s runs replaces what used to be a long standup with everyone reciting status. Anything red gets a name attached and goes on the board.
One channel, many uses
The firmware exposes one endpoint that takes a 4-byte value or a short string key. The handler dispatches to whichever internal action matches: reboot, factory reset, enter pairing mode, force an OTA check-in, dump diagnostic state, change radio TX power. New actions get added by extending an enum and writing a handler. The channel itself doesn’t change.
Of all the pieces in this stack, this is the one we’d build into a product first if we were starting a new IoT project tomorrow.
Cloud-side, the endpoint hangs off the same product channel that a user’s phone app already talks to. Authenticated, scoped per device, no parallel infrastructure to stand up. The credentials a device already has cover the test channel automatically. From a script’s point of view, sending a command looks identical whether the caller is a test runner, a customer support agent, or a developer poking at a bench unit.
That last detail is where the value actually compounds.
When a customer calls in saying their device won’t pair, the support agent triggers pairing remotely through the same command an engineer’s automated test calls. When a complaint comes in about strange behavior, support pulls diagnostic state without shipping the device back. Internal validation and external support run on the same code path, which means a fix or an addition lands in both places at the same time. If you’ve ever maintained two near-identical command APIs, one for QA and one for support, you already know what we’re trying to avoid here.

The graduation in the diagram isn’t a one-off pattern. The diagnostic dump that engineers used during bring-up turns into a “send report to support” button in the customer app. A radio-power switch built for test rigs becomes an “indoor / outdoor mode” toggle once the product team realizes end users have the same calibration problem. Mechanisms built for engineers tend to be useful to power users for the same reasons they’re useful to engineers.
The pattern travels well beyond Rachio. Most cloud-connected embedded products our team has worked on since – smart home firmware development jobs, industrial telemetry, consumer wearables – have gotten a command endpoint baked in from the start, by default. Costs in firmware are negligible: a few hundred bytes of flash and an enum. The cost of not having one shows up later as every test script reinventing its own remote-trigger mechanism, support agents getting blocked on hardware they can’t reach, and customer reports that close as “could not reproduce” because the conditions vanished before anyone could look. If your team is starting a new cloud-connected embedded product, this is the architectural decision we’d ask you to make first.
Build it once, and you’ll spend the next three years finding new uses for it.
Direct results
The Smart Timer reached retail before summer started – the only window that mattered. Miss the season, lose the year. Our success story puts it: “The client successfully launched their product to the market on schedule (in the appropriate season for the product).” And it’s true. Eight months of firmware work for a product of this scope is fast. Without the booth, that timeline meant either a delayed release or shipping a sealed retail box with too many bugs in it.
Customer support saw the result on its end. Firmware tickets stayed close to zero across the first six months on the market. Mostly UX, handled by the client team. Some hardware: replacements, warranty, the occasional broken faucet adapter. Bugs reaching end users were rare enough that support didn’t need a dedicated triage path.
Four firmware engineers on our team closed the full development and validation cycle. The default assumption runs the other way. At hundreds of thousands of units, pre-launch firmware validation gets handed to a dedicated QA team. Mass production firmware testing, in the usual shorthand. Per-slot remote control, regressions running overnight, and a deploy interface that doesn’t punish engineers for using it. Four people maintained coverage a much larger team would normally need.
Internal validation caught most of what could have escaped to external QA. The external pass returned a thin list of issues, which shifted what external QA actually meant on this project. Not the first filter catching missed defects. The final sign-off, which is the role external QA does well anyway.

Bugs found earlier are cheaper to fix. Those surviving to external QA tend to get fixed in a hurry, often badly. Compressing the ratio toward “caught at dev” is what makes a firmware team look organized rather than heroic.
It didn’t stop when the office emptied. Regressions ran through nights, weekends, and the gap between sprints. Nobody had to be there.

Setup time for a fresh test session collapsed to nearly nothing. Configuring hardware for a debug run used to take a meaningful chunk of a working day. Now, one command in a script. Across four people across eight months, the saved hours added up to numbers nobody bothered tallying.
There’s a financial frame worth setting. Rachio Smart Timer ships in tens of thousands of units. Every percentage point of field defect rate translates to RMAs, support call volume, App Store reviews that get ratio’d, and brand erosion that takes a year to walk back.
And the difference between a 1% defect rate and 0.05% reaches millions in support costs. It’s also a measurably different reputation profile twelve months later. The harder figure to set a dollar value on is the second-order effect on team velocity, which compounds across the life of the product.
Side effects we didn’t plan for
Everything in the previous section was deliberate. What follows wasn’t. Within weeks, the client’s engineering team started reserving slots for their own R&D. We’d built it for ourselves but once they saw what we’d seen (reliable, fast, scriptable hardware), they started running experiments on it. Validation infrastructure for our own use, accidentally also a research tool for them.
Bug reproduction time fell sharply. The team felt this one most. A customer reports a bug; an engineer remotes in, scripts the configuration that triggered it, and watches the same failure the customer described. Setup used to eat half a debugging session – finding a unit, configuring it, putting it on a network resembling the customer’s. Now those steps run from a script before the engineer sits down. Anyone who’s chased an intermittent IoT bug knows what shifted here.
The command endpoint infrastructure built for testing turned into customer support tooling. Cases that used to need a return started getting closed remotely on the first call. The agent triggers a diagnostic dump, sees the device’s state, walks the customer through a recovery path or pushes the fix directly. “Wait a week for a return” became “we’ll fix this in five minutes.” Logistics overhead dropped with the return rate. We covered the technical shape in the previous section. Business impact (fewer returns, fewer angry calls, less brand erosion) showed up in client metrics nobody tried to influence directly.
Then there’s the pattern we genuinely didn’t expect. On the next embedded project after Rachio, a new client walked through our office, saw the booth, paused, and came back later asking whether their product could be tested on it. The project after that, the same thing happened. Another client we work with did the same. It now hosts validation for several different embedded products in parallel, none with anything to do with smart irrigation. Designing it as universal infrastructure instead of a Rachio-specific jig was a decision we made years before there was anyone around us else to use it. That decision turned into a recurring entry point for new engagements.
Without anyone selling it.


Days became hours. Compressing the feedback loop changed what kinds of work engineers were willing to attempt. Architectural refactors used to sit in the “someday” pile – too risky, too easy to break something invisible. You know the pile. They became routine because the regression suite would catch what broke before anyone noticed. Engineers got bolder because the cost of being wrong dropped.
Three years later
It’s still in use. What changed is what’s around it.
It’s stopped being “the Rachio lab” and turned into standard validation infrastructure for our embedded work. New projects start with the assumption that they’ll plug into it. The patterns we built (single-command deploys, black-box test scripts, command endpoints) are now the default we apply to every cloud-connected device our team ships. It sits inside our automated QA now. CI/CD routes builds to slot configurations, commit hashes link to regression failures, reports land in the right Slack channel without curation, and failed tests turn into prefilled tickets.
“ Build the testing infrastructure your team would actually use. Everything that doesn’t get used daily doesn’t work. If running a test takes twenty minutes of setup, an engineer won’t run it. Documentation past ten pages goes unread. A rig that requires physical presence stops working for a distributed team. A workflow that needs three people coordinating to push a build is a workflow that pushes one build a week. Pragmatism wins this category every time. The temptation to over-engineer testing infrastructure is real because it’s the one piece of the project where nobody is breathing down anyone’s neck about deadlines.
Resist it. ”
Nikita Provatorov, Embedded Team Lead, Head of Engagement, testing stand creator.
Do not overkill
Plenty of products don’t need this level of infrastructure. A single-device Bluetooth accessory with no cloud, no OTA, and no multi-device interactions doesn’t justify a twenty-slot rack. An MVP that hasn’t found product-market fit is the wrong moment to build permanent test infrastructure – most of it gets thrown away. A simple firmware update once a year for a niche industrial device is fine to validate by hand.
The math changes when several factors stack up: scale, cloud, OTA, multiple interacting devices. Then a distributed team. A deadline that doesn’t move. Plus, a product expected to live in the field across several firmware versions.
In that combination, hand-validation stops being viable somewhere around the second or third release. You can wing the first release. By the second, nothing you debugged manually is reproducible without the original setup. The same hardware-environment problems recur for the third time. The cost of building infrastructure starts to look small compared to the cost of not having it.
The signals
Infrastructure built for real work changes how clients read the team behind it. A demo in a deck is something anyone can produce. A working test rig that handles twenty devices, runs overnight without anyone in the office, and has been catching bugs for several years, and that level of reliability is something you can’t fake. Clients who see it in our office tend to ask about it before anything else. The infrastructure becomes the answer to questions we haven’t been asked yet. That’s why design choices like “make it universal from day one” and “make it usable without being in the office” matter – for the team using it daily and for everyone who sees it in use.
Investments in testing infrastructure rarely show an immediate visible payoff. Most of the value lies in the second half of the project, sometimes after launch. Teams thinking only in terms of the current sprint don’t build them. Teams thinking in terms of product lifetime do. It’s a maturity question more than a technical one.
Planning a high-volume IoT release? We’re around to talk through validation for your specifics. We already have a booth. It already knows how to test embedded devices.
FAQ
Doesn't all this slow the team down at the start?
Yeah, for a few weeks. People are figuring out the setup, nothing feels fast. After that, bug repro alone pays it back, and every refactor gets cheaper. Short-term tax.
What should I ask a firmware vendor before signing?
Three things. Show me a test rig from something you've shipped. How long to reproduce a customer bug on real hardware? Can you pull last year's firmware onto a device next week?
Can't we just use Memfault or Mender instead?
Partly. Memfault gives you observability after launch - crash logs, fleet metrics from real devices. Mender handles OTA delivery and rollback. Both are solid. We use them. Neither replaces the lab side, though. You still need a way to validate firmware on real hardware.
What if multiple engineers need the rig at the same time?
Without a booking layer, twenty slots get messy fast. People run incompatible firmware on the same slot, overnight regressions die because someone rebooted the rail for power-loss testing. We tried a few things and landed on the boring one: a shared dashboard with a calendar view, reserve the slot you want for the hours you want it. A Google Sheet would honestly do most of the work. The rig is also on rollers, so if someone needs it at their desk for a session, they wheel it over.
Isn't ESP8266 plus Blynk a security problem?
For a product going to market, yes. For an internal R&D rig on a private subnet, no. The rig doesn't carry any data beyond the power-relay state. Worst case, somebody powers down a slot. A version of this we'd ship would need an STM32, a backend we own, and a real security review. This isn't that.
How do you handle OTA credentials?
Like a production database credential, because that's effectively what they are. Anyone who pulls them can push firmware to every device in the field. One-time secret links for the initial handoff, scoped per-job credentials at run time, minimum access on every token. A compromised credential is a few minutes of rotation, not a coordination crisis across teams.



