How a timezone error stopped a production system

Introduction

A few months after install of a system we received a report of an issue from the customer. They said that they were inputting the expiry date for the product as one date and that date was being printed on the product, but a different date was being reported to the external endpoints. We refer to these endpoints as the 'upper levels'. The dates were off by a month. These endpoints are used to report the status of products to government level databases to be queried should there be any fault found in the pharmaceuticals later on. So this was a big deal.

I requested more information from the customer and they reported that the date they were inputting was the 1st of April and what was being reported as the 31st of March. As soon as I head this I assumed some kind of timezone error, but I had no idea where it would be, why it would only show up months into production or why it wouldn't occur just after a timezone change.

Timezones are not a new idea for us. We knew from the start that we had a system that dealt with dates a lot. Each product has a unique code applied to it. Then time this code is generated is stored and must correspond (with some tolerance) to the time it was printed. In addition we often use expiry dates or dates of manufacture to print on the products. We were also aware that this product had the potential to be shipped to multiple timezones, be in use for decades and be run 24 hours a day. This meant it needed to cope with timezone changes during production.

I'm a big fan of John Sheet's library NodaTime (https://nodatime.org/) and agree with his blog post entitled "What's wrong with DateTime anyway?" (https://blog.nodatime.org/2011/08/what-wrong-with-datetime-anyway.html). We'd used NodaTime throughout the application to help with some of the difficulties of timezones. I'd hoped that this would have prevented such issues, because NodaTime requires you to be very explicit about what you are doing in places where DateTime makes assumptions.

The explanation below includes the actual code that was used in production at the time.

The good. Obtaining the expiry date

The date is obtained by querying an external system. Since the date is a representation of the business data of an expiry date, there is no time component. As such the external system delivers it to us as the string 190401. This is a non-standard format but we known that it is yyMMdd. It is then held in our system in memory as a NodaTime LocalDate type. NodaTime gives us this struct which has the following description.

LocalDate is an immutable struct representing a date within the calendar, with no reference to a particular time zone or time of day.

This is a prefect description of the information we have when the external system returns us the string 190401. We have no timezone, no offset and no time at all. Technically we also have no information of the calendar in use but it's known to be using Gregorian. So far so good.

The bad. Storing the expiry date

For any product we're producing, we store the data that is to be printed on it. This data varies from product to product based on the country of destination for the product and the local regulations there. For example the U.S destined products have what is known as an NDC (National Drug Code) number. Products for Europe have a GTIN (Global Trade Identification Number) and products for Portugal have an NHRN (National Healthcare Reimbursement Number). All products have an expiry.

Because the values could be of different types, we decided to store them all as a string.

The ugly. Converting a date by manufacturing data

This raises the problem of how to convert a LocalDate into a string. This was done by converting the LocalDate to a NodaTime Instant which is defined as follows

An instant is defined by an integral number of 'ticks' since the Unix epoch (typically described as January 1st 1970, midnight, UTC, ISO calendar), where a tick is equal to 100 nanoseconds. There are 10,000 ticks in a millisecond. An Instant has no concept of a particular time zone or calendar: it simply represents a point in time that can be globally agreed-upon.

Once we have an instant we get it's representation as a long and save it to the database as a string. The problem here is we're manufacturing data in the conversion. What began as a date with no time component and have turned it into a date and time. Below is the code of how we did this. The code uses a new type called a ZonedDateTime which is defined as follows.

LocalDateTime in a specific time zone and with a particular offset to distinguish between otherwise-ambiguous instants. A ZonedDateTime is global, in that it maps to a single Instant.

value here is of type LocalDate

expiry = value;

Get the current timezone set by Windows.

DateTimeZone timeZone = DateTimeZoneProviders.Bcl.GetSystemDefault();

Take the LocalDate called 'expiry' at the time 'Midnight' to get a LocalDateTime object. This is exactly what it sounds like. a LocalDate with a time component. Assign this LocalDateTime to the current local time zone to give a ZonedDateTime object. This is again what it sounds like, we're adding a timezone to the LocalDateTime.

ZonedDateTime x = timeZone.MapLocal(expiry.At(LocalTime.Midnight)).First();

Convert the ZonedDateTime to an instant. This is why all the above was necessary. NodaTime won't let you convert a LocalDate to an instant. That doesn't make any sense because you don't have enough information. Instead we add made up information to allow us to do the conversion.

Instant expiryInstant = x.ToInstant();

Convert the ZonedDate time to the number of ticks since Unix epoch and then convert that to a string. Set the attribute called 'expiry' on the product to the value of that string.

SetAttribute(CodeFormatAttribute.Expiry, expiryInstant.Ticks.ToString());

So we're saving the expiry date as a time which is equivalent to midnight on the expiry date in the timezone Windows is currently set as.

Why is the correct date printed?

So now we know that we are saving the date with some made up time information. With this being at midnight it sounds like a timezone issue because in the example the customer gave the date was decremented by one. If the date and time 2019-04-01 00:00:00 is stored and then somehow a timezone change is applied of minus one then the resulting value would be 2019-03-31 23:00:00. This sounds logical but then why would the correct date be printed on the product? We tested this system for well over six months so we'd have seen issues from changing timezones.

The answer to this is because then the value is accessed it's converted back from a string to a LocalDate based on the local time correctly. The code is below:

Get the string representation of the product attribute called 'expiry'. Store it in the variable 'expiryString'

string expiryString = GetAttribute(CodeFormatAttribute.Expiry);

Execute a function that does what it's name suggests. It gets a LocalDate object from a Unix time using the system timezone.

LocalDate? date = DateTimeHelper.GetLocalDateFromTicks(expiryString);

Validate and return

if (date.HasValue)
{
       return date.Value;
}
return DateTimeHelper.GetLocalDateFromLongTicks(0);

This code works fine as the reverse to the code described previously. A LocalDate is stored as a Unix time based on the local timezone and then retrieved in the same way. This is the code used for retrieving the data to send to the printer so this works without problems.

If the date is converted correctly in both directions, how is the wrong date sent to the upper levels?

We need to be able to recover from a user turning the machine off, a power failure, or the application crashing in a production run without losing data. In order to do this we have to commit the data to persistent storage. All the production order data such as the GTIN, NDC, lot and expiry are saved to a SQL database as the string that they are held as in the volatile memory.

For most of the duration of the project we believed that we only needed to send the unique codes on products to the upper levels endpoints. Close to shipping the machines we learned that we actually had to send a concatenation of all data printed on the product, including the expiry.

The system wasn't architected to support communication between the component which stored production data and the component which reported codes to the upper levels. Because of the short timescale, a less than optimal solution was used where we just grabbed the data from the database, since the component that sent the data to the upper levels already had a connection to that.

This is how the string representing the expiry was pulled from the database and converted into the format required for the external endpoint. This format was the same one that was used for us to receive the expiry, yyMMdd.

Find the string with the product attribute name of 'expiry'. Convert that string to an integer. Convert that integer to an instant. Convert that instant to a string in the format yyMMdd using DateTimeFormatInfo.CurrentInfo. In this case this means UTC.

NodaTime.Instant.FromTicksSinceUnixEpoch(long.Parse(activeRunCodeFormats[CodeFormatAttribute.Expiry])).ToString("yyMMdd", DateTimeFormatInfo.CurrentInfo)

The important point here is that this is done in UTC, not local time. I didn't write this code but personally I'd have expected this to use local time.

This is the issue. We were storing the expiry as a Unix time in local time, but retrieving it as though it was stored in UTC. This was then sent to the upper levels endpoint.

Why now?

This issue occurred in May 2018, over a month since the clocks had gone forward in March. Why hadn't this occurred on all products since March?

Because the local time is irrelevant, we're dealing with the expiry which is just short of a year in the future. The pharmaceutical manufacturers only work in months for expiry dates, so the previous production run had an expiry of 2019-03-01. This production run was the first that had an expiry of 2019-04-01, and when do the clocks go forward in 2019? March 31st, 2019-03-31. The date they were using as the expiry was the first every expiry date that was in BST and not GMT. Since GMT is equivalent to UTC, the defect had no effect. It's only when using an expiry that is in BST that the issue presents.

Conclusion

Reasons for the issue:

Unknown data was manufactured and added to the information we had in order to turn it into a type that could be stored in a certain medium
The retrieval of this data for sending wasn't architected because the requirement was determined late in the day
The code written to take the date from the database didn't consider what timezone that date was saved in

I believe NodaTime is a great library. When we're converting the LocalDate to an instant it's very clear from reading the code what we're doing, and what's wrong with it. The word 'Midnight' is even present making it blindingly obvious. Sadly no one caught the issue at the time. The only ambiguity comes from DateTimeFormatInfo.CurrentInfo which is an inbuilt .NET method.

The fix

Since this specific branch of code is in production it was essential to change as little as possible. As such the solution was to replace this code:

NodaTime.Instant.FromTicksSinceUnixEpoch(long.Parse(activeRunCodeFormats[CodeFormatAttribute.Expiry])).ToString("yyMMdd", DateTimeFormatInfo.CurrentInfo)

With this code:

string expiry = "";
if (activeRunCodeFormats.ContainsKey(CodeFormatAttribute.Expiry))
{
	LocalDate? v = DateTimeHelper.GetLocalDateFromTicks(activeRunCodeFormats[CodeFormatAttribute.Expiry]);
	if (v.HasValue)
	{
		LocalDate d = v.Value;
		expiry = $"{d.YearOfCentury:D2}{d.Month:D2}{d.Day:D2}";
	}
}

We're intentionally reusing the same code as we do to translate it from a string in volatile memory because that's been proven to work. This code uses NodaTime instead of the inbuilt time methods.

In the code destined for future systems we tackled the root of the problem and stopped using Unix time to save a date only value.

Changed we've made since this issue

The entire system for storing attributes as a string has been removed. They're now stored as an objects.
The storage and retrieval of attributes has been simplified a lot and expanded to make it more flexible.
NodaTime is used wherever possible. It's one of our core principles of code we produce. These principles are displayed on the wall and known to everyone.
As a principal we don't add data that we don't have to values in order to make them more convenient to use.
Code reviews are now more rigorous. We have a culture of very detailed code reviews now where points raised are treated as constructive to the code base and not individual criticism.