Page Delivery Issues Observed
Postmortem

What happened?

Between 02:40 and 06:40 PM UTC we experienced a higher than usual error rate in the backend system of our customer sites. As a result we experienced increased delivery times for uncached content. But since the vast majority of the content was cached the availability of customer sites was only marginally impacted.

How did it happen?

Around 02:40 PM UTC a large project went live which lead to a ~5x peak load on our backend system. A cache issue on the new project made the situation worse. This lead to an overload of the version-picker service (causing OOM errors) which in turn caused constant high error rates and slow delivery of uncached content.

What are we doing now?

  1. The cache issue was addressed in the project CDN config which immediately caused the error rate and delivery times to drop to normal levels.
  2. We increased the memory of the version-picker service.
  3. We increased the alerting sensitivity in our monitoring to better and earlier detect overload of backend services.
Posted Apr 08, 2021 - 13:27 UTC

Resolved
This incident has been resolved.
Posted Apr 07, 2021 - 16:13 UTC
Investigating
We are observing issues that are affecting page delivery for Project Helix customers. The issue is under active investigation and we are working with full effort to reach a speedy resolution.
Posted Apr 07, 2021 - 13:53 UTC