Wednesday, August 17, 2005

How Not to Run a Company

From: "David Koretz" <david@bluetie.com>
Date: August 12, 2005 3:45:11 PM EDT
To: <david@bluetie.com>
Subject: Full Report on BlueTie Mail Store & Backup

BlueTie Customer,

I know it has been a challenging week with the email problems that
you encountered, and I want to update you on the history and current
status of the email problems you have experienced this week.

This email is being sent now that service has been completely
restored, and all necessary information on the occurrence was
gathered, as that was our first priority.

I would like to personally apologize for any issues this may have
caused you, and assure you that we have made significant changes and
will continue to make changes to ensure that this can never happen
again.

What happened?
On Thursday August 4 the MCI hosting facility in Elmsford NY lost
both primary and backup HVAC, causing the ambient temperature to rise
to 112 degrees Fahrenheit. Had we been made aware of the condition,
we would have immediately shut down our servers to prevent
overheating. We were not notified by MCI due to a failure in their
process and therefore our servers continued to run and the redundant
RAID array on one Mail Store was damaged. Only customers on this one
Mail Store have been affected.

MCI is producing an RFO (Reason for Outage) document on Friday August
12 that will document the causes, immediate resolution, and long term
prevention of future occurrences of this outage. I will be glad to
share the document with you once we have it.

As a result, on Monday August 8th BlueTie experienced intermittent
delays and two temporary outages affecting only the BlueTie customers
housed on that particular mailstore, which includes your company.

The first was a result of a single drive in the RAID array failing at
approximately 11:45AM EST. After running diagnostics and recovery,
the RAID array was fixed and put back in service at 1:00PM EST and
service was restored to all users.

The second occurred at about 4:00PM EST when 3 drives in the same
RAID array failed simultaneously. Important points about this outage:

All mail received while the mailstore was down was queued. None of it
was lost, and it was delivered when your new mailstore came online.
We worked all through the night to recreate all of your accounts on a
new mailstore. By Tuesday morning, all users were able to login and
send and receive email. All accounts were functioning normally. This
was our highest priority.

The next step was to recover your mail from the backup system and
insert it into your new account. This was completed on Wednesday,
August 10 at approximately 4:00PM EST.

Then we merged the new email (sent or received since 4:00 PM EST
Monday August 8) with the recovered mail. This was started at 8:30PM
EST on Wednesday and was completed at 2AM EST Thursday morning. In
order to preserve data integrity we had to once again queue mail,
meaning that incoming mail would not be delivered to your new account
until the rest of the process was completed.

The final steps were to update the attributes on the mails in the
accounts on the new Mail Store (attributes are used to make email
fast and searchable) and release the queue, allowing mails received
since 8:30PM EST Wednesday to flow into your account.

We expected the entire process to be completed by 9AM EST Thursday
morning.

Between 10am and 1pm, users began to see emails restored into their
accounts. At 1:10pm we released the queue, and now all queued mail
has been delivered.

Further Complications
As we worked through the recovery process, it became increasingly
apparent that there were problems with our current backup process:

Incremental backups were not being successfully created on a nightly
basis
The full weekly backup did not backup email audit and messages more
than 100 days old
The full weekly backup retained items that users had marked as
deleted or as junk.
The full weekly backups after July 20 did not include some of the
most recent messages.

What we restored into your account is much less than perfect.
Because there are no incremental backups, emails sent or received
between the evening of August 3 and 4pm August 8 can not be
recovered. They simply do not exist anywhere.
Significant numbers of emails that should have been on the backup
were not, and therefore they are also lost. Specifically, non-POP3
users will be missing messages more than 100 days old and may be
missing messages sent or received after July 20.
Large numbers of deleted and junk mails were put back into your mailbox.

What is in your account now is all the mail that we were able to
retrieve. There is no other resource or archive that might contain
the data.

What about email audit files?
All customers using email audit should have received their CD/DVD for
the month of May. Because June, July and August through the 8th were
on the failed mailstore and were not backed up, that data is lost.

Can anything be retrieved from the damaged RAID array?
There is less than a 1% chance that data can be recovered based on
what we know about what happened to the drives. However, we have
contracted a data recovery specialist firm to try everything possible
to get information from the drives. I will send you another update
as the work progresses.

What is BlueTie doing to prevent this problem from recurring?
Installing temperature sensing equipment in the MCI facility and
connecting it to our monitoring applications so we can independently
monitor them
Completely reworking the backup policy and process to give you full
real-time replication across multiple boxes, in addition to full
daily backups of your entire mailbox
Improving our disaster recovery process and conducting quarterly
disaster recovery exercises to verify the process
Moving to geographically distributed mirror servers for all
components in the BlueTie architecture. One set will remain in the
MCI facility; the other will be in our data center at our corporate
headquarters in Rochester, NY. These servers are architected to fail
over in less than four seconds in the event of an outage.
A low level device monitoring has been implemented. This will report
disk drive deterioration so preventative action can be taken before a
drive fails.
Adding checksum verification to the backup process to validate that
the backup and mailbox are identical.
Audit data will be backed up until it is burned onto CD or DVD for
delivery to you.

BlueTie is Issuing a Service Credit
BlueTie will be crediting you for the entire month of August. You
will receive a bill marked Paid In Full for the month, along with
another letter from me that outlines the completed improvements we
have made to give you confidence that we have permanently resolved
the problem.

We are also going to have a technical auditor come in and evaluate
our new backup process and redundancy and give us an independent
verification and appraisal to give you further confidence. I am happy
to share this verification with you once we have completed the process.

I appreciate your patience and understanding during this situation,
and am totally committed to ensuring we deliver an incredibly
reliable solution that you are thrilled with.

Best regards,

David Koretz
President & CEO
BlueTie Inc.

This message is intended solely for the individual(s) to whom it is
addressed.
If you are not the intended recipient, any dissemination or copying
is strictly
prohibited. If you believe you received this message in error, please
notify
the sender and delete from your system. Thank you.

No comments: