May 2018 Day One Outage Postmortem

Between May 7-10, 2018, Day One experienced a significant sync outage. We know our users expect a robust sync service, and this event was not up to our standards. We want to inform our users what happened and explain what we will do to address it in the future. Brief summary A hardware failure on […]

Between May 7-10, 2018, Day One experienced a significant sync outage. We know our users expect a robust sync service, and this event was not up to our standards. We want to inform our users what happened and explain what we will do to address it in the future.

Brief summary

A hardware failure on May 7 caused an initial sync outage. In the process of restoring sync service on May 8, a small number of new accounts were accidentally assigned the same account ID as existing accounts due to the use of an incomplete backup. This led to these new accounts gaining access to journals created by the original owner of that account ID. We disabled sync again as soon as we became aware of this. This issue affected 106 users, less than 0.01% of our accounts.

Sync service has now been restored correctly and further journal sharing will not happen. An update to our apps will be released shortly which will automatically remove the content that was accidentally shared from unauthorized devices. We will also be making further adjustments to our systems to prevent a similar issue from occurring even in the case of an incomplete backup, and will evaluate larger changes as well.

Journals using end-to-end encryption were not subject to any accidental disclosure.

We know that any accidental disclosure of any user’s personal data is a serious breach of trust. We believe that being honest with our users is the best policy in this unfortunate situation, and are doing our best to make things right. We will be granting free lifetime Premium memberships to the 106 users whose journals were temporarily disclosed to another user, and we will do our best to contact them individually. We’ll do everything we can to earn your trust as we move forward with our future plans for Day One.

Outage details

On Monday, May 7, Day One employees received notice that there was a hardware issue with one of our database servers. We began a process to remove that server from our database cluster and rebalance the load to the remaining servers.

The rebalancing operation failed. This began our initial sync outage. In the interest of restoring sync service quickly, we decided to build a new database cluster and restore it from a recent backup. A new cluster was provisioned and restored from the backup.

Early on Tuesday, May 8, the new database cluster had been restored from the backup and was ready to go. We enabled the sync servers again, and initially things seemed to be going well. However, within a few hours we received reports that a some users were seeing content belonging to another user. This was unacceptable, and we immediately turned off sync again to prevent further unintentional sharing.

At this point, a message was put out on social media that Day One Sync would be unavailable indefinitely while we investigated the issue and considered our options. We absolutely could not make sync available again until we fully understood what was happening and could guarantee that it would not happen again.

On the morning of Wednesday, May 9, we determined the root cause of the issue. The backup we had used in the restore was incomplete—it contained all the journal data, but was missing some user accounts. Specifically, it was missing all accounts created after March 22. One result of this missing data was that accounts created after that date were unable to log in. Another result was a limited amount of unintentional data sharing.

Each journal record in the database has a “accountID” field, which determines which account has access to that journal. Since all journal data was successfully restored but some user accounts were not, there were journals in the database owned by accounts that no longer existed in that database. (e.g. “My Travel Journal” might be owned by account 123456, even though that account no longer existed.) New user accounts are created with sequential IDs. Since the restored cluster did not contain the newest account IDs, new accounts created on May 8 were receiving lower IDs than expected, which overlapped with existing accounts in the original database. As a result, those new accounts had IDs matching some of the existing journal records, and received access to a few existing journals.

There were 326 accounts created with an incorrect account ID during the brief May 8 availability window. Of those 326, only 106 had existing journals on the server created by another account. This means that 106 accounts created on March 22-23 of 2018 had data that could have been viewed by another account. These accounts had ID numbers between 1104506 and 1104831.

We do not currently have information on how many of those journals used end-to-end encryption, but any such journals would have been protected against disclosure.

After our investigation on Wednesday, we determined that the best course of action was to investigate the cause of the rebalancing failures on the original database cluster and restore it to working order. We discovered a few configuration errors that were causing constantly-increasing load on the database over the course of the rebalance, which was causing it to eventually fail.

On Wednesday evening, we corrected the configuration issues and successfully rebalanced the original cluster. We made the decision to delay restoring sync service until Thursday morning, when our engineering and support staff would be available to address any concerns. On Thursday, May 10, at 8 AM MDT, sync was again activated.

There is a large backlog of outstanding requests to the sync servers, so sync performance may be slightly degraded while the servers catch up with things and we address any lingering concerns.

What will we do now?

We will be releasing an update to our apps shortly (version 2.6.4) which will automatically remove the shared journals from any accounts who should not have access. After installing the update, any journals containing the disclosed content will be removed from the unauthorized user’s device. Affected users will have 30 days after installing to sign into their Day One account, which will allow the app to verify that they are the original owner of that content and will restore access to the journal. The app will notify users if they are affected by this change.

In order to prevent a similar issue from happening again, we will be implementing the following server changes shortly:

  1. When creating a new account ID, we will verify that no journals exist referencing that account ID.

  2. New account IDs will be created with a random two-digit number appended to the primary incrementing ID. This means that even if we were to accidentally start creating account IDs at a too-low number in the future, the chance of any account ID collision would be very small.

  3. We will fix the issue that caused some user accounts to be excluded from the backup.

We sincerely apologize to the 106 users whose data was inadvertently exposed to another account. We will be providing these users lifetime Premium memberships to Day One, and will attempt to contact them directly to address any additional concerns. There was no large-scale breach of user data. No unauthorized party gained access to our database or servers, and we remain confident in the security of the data found on our servers. Nevertheless, for these 106 users, we recognize that this is a serious breach of trust. We recognize that trust is earned, not given, and we hope to have the opportunity to earn your trust in the future. We encourage users to enable end-to-end encryption on their journals, which we released in June 2017 and which protects your data even in the event of other failures.

We appreciate the patience of our users as we have dealt with this unexpected situation. We know many of you rely on Day One to record your most important memories. We commit to do better in the future. We’ll do everything we can to rebuild your trust.

–Paul

Journal from here, there, everywhere.

Download the Day One journal app for free on iPhone, Android, iPad, Mac, and Apple Watch.

IOS
Journal from here there and everywhere mobile devices image.
Android
Android screenshot