My hardest bug to debug

When I interview developers, one of the questions I sometimes ask is "What's the hardest bug you've ever debugged?". Most people like to regale war stories of a particular missing semi-colon, a hard to use API or their struggles with modifying old, undocumented code. To date, this is my answer to that question.

The Machine

The solution in question was a machine with multiple barcode readers on it. Specifically Microscan's ID40. We'd made eight of these machines, one of which had been in production for a few months in the UK while the remaining seven has just been set up and installed on site in China. The problem arose on one of the seven in China.

The Issue

We received a report claiming that one of the machines had stopped shortly after it went into production. The customer said that there was an error that they couldn't clear. This error was halting production of product and every minute of downtime is lost revenue. This is not a situation we want for any period of time. I remoted in to find that the camera had stopped sending results to the PLC (Programmable Logic Controller) resulting in an alarm on our software user interface. On this machine the ID40s were wired in to output the pass or fail signal as digital IO to the PLC directly. This indicated that the failure was likely nothing to do with software, however I've learned after a few years in this industry that hardware and PLCs are generally rock solid and 95% of the time it's the software at fault, so I always assume it's the software.

This failure literally means that the PLC believes it triggered the camera and it did not receive a digital IO result before it triggered the camera again. This means that either the camera isn't being triggered by the PLC, the camera isn't outputting a pass/fail signal, or the PLC isn't receiving the pass/fail signal. I assumed this was a one off but if that was the case a reset and resume would clear it and the customer reported that this wasn't working. The machine would continue for one product then stop again. After observing this for myself remotely I told them to restart the machine. I had no logical explanation for this, and hoped it'd just be a one off (it's never a one off). A few hours later I get another call with the same issue. This time the customer has disabled the camera that was giving the issue, but now a second (mandatory) camera is exhibiting the same fault.

Debugging Day 1

At this point this is my thinking:

Identical software has been running in the UK for months, albeit on a machine that's slightly different mechanically.
This would normally be a camera configuration issue but if that was the case the issue would present on the first product, not an hour or so after running.
This is occurring on all cameras on that machine, so it's not a hardware failure.

After some discussion in the office we instructed the customer engineers to power cycle the camera. This isn't the easiest of things to ask a customer to do, especially in production. These devices don't have an on/off button so one has to unscrew a collar, disconnect the cable and then reconnect. This seemed to solve the issue, for another 2.5 hours.

With a temporary solution in place we grab one of the control units yet to ship from the workshop and move it into the office. It's identical hardware so we deploy it as though it is the machine exhibiting problems, connect cameras and hack in a pulse generator to simulate digital input from the physical machine producing the product. All this takes about a day and during this time we learn that the problem is reoccurring almost exactly every 2.5 hours. Meanwhile all the other machines deployed in China are up and running 24 hours a day with no issues.

Debugging Days 2-4

We managed to recreate the issue. The camera appears to lock up. the lights are solid on and the trigger light doesn't come on when the pulse generator sends a trigger to it. At this point I'm pretty confident it's not a software problem. We send a support request to Microscan but keep testing in the office. Over the next couple of days we identify that the issue occurs almost exactly every 4.5k triggers. This explains why the customer in China sees the issue repeatedly every 2.5 hours. We also identify that it only occurs at all if the camera is triggered above a certain rate. This explains why this issue only presents itself on one of nine machines.

Debugging Day 5

There's much debate in the office as to whether it's a PLC or camera issue. Neither can be explained with any theory. The director is involved by this point and he suggests disconnecting the camera's Ethernet cable from the machine. We use the Ethernet connection to set up the camera at the start of the batch using Telnet. We send the match string the camera is looking for in the barcode it's reading. We also use the Ethernet connection to get the image seen by the camera as well as the pass or fail result. This is done via web requests. This is independent of the digital IO result direct to the PLC. No Ethernet connection is not a viable solution in production since the user needs to see the image. It renders our software useless since as far as we're concerned, no product inspection is occurring. However, we try it, and it works. The camera runs overnight without locking up.

Debugging Day 6

We repeat the test multiple times, Ethernet connected then disconnected and the results are repeatable. Suddenly it's certain that it's a software issue. The first theory is that the requesting of the image every trigger is causing a problem so we modify the code to remove that functionality and deploy to the test machine. The problem persists. At this point we're stumped. Here's what we know.

If the Ethernet connection is disconnected after the run has started (thereby allowing the match string to be sent) then the camera work fine
Keeping the Ethernet connected causes the cameras to lock up after roughly 4,500 product, plus or minus a hundred.

This makes no logical sense. Once we've sent the match string and we removed the code to request the image, we're not doing anything with the connection.

Debugging Day 7

I install Wireshark on the machine. I have no rational behind why this might work, it's just a guess. We still have the image requesting removed, so what I expect to see is some initial setup of the camera via Telnet and then nothing. What I actually see is a stream of data coming from the camera to us over Telnet. Inspecting the data it's clear it's sending the match results to us over the Telnet connection we opened to configure the camera settings.

This behaviour isn't documented anywhere we can find, and it's certainly not something we intentionally requested. We're not even listening for the responses.

We looked at how we communicate with the camera. We opened a TCP connection the first time it was needed, and left it open for changes to camera settings via the application. We modified this to close the connection once we had sent the required information and open it again if we needed it. We tested this thoroughly over the next few days and it looked solid.

Conclusion

I still don't understand how this caused the camera to lock up. We were receiving the TCP results via Telnet but we weren't reading the stream. Did it just build up in some buffer? How did this cause the camera to lock up? I still can't answer these questions.

We deployed the new version of the software to the customer and it's been running for almost a year at the time of writing with no further changes required. It's produced approximately 10 million products since then.

What I learned from this is the importance of load testing. We tested our machines but not to the level we needed to. We didn't know how many product the customer would be likely to run. We'd run a couple of thousand product in the office and the customer turns it on and wants to run 180k products in a 32 hour batch. We hadn't tested anything near that level of continuous operation.

Since then we ask every customer what the largest batch run they will do is and then we test with double that.