Helix Logging unable to publish new sites
Postmortem

What happened?

From May 13th to May 19th, the Helix Publish service wasn’t working as expected and failed to create new log configurations for Fastly service configurations that have not been used with Helix before. The failure of the Helix Publish service lead to Helix CLI aborting the attempt to publish the site, making it impossible to publish new sites. If you have been trying and failing to launch a site on Helix last week, we are sorry and we apologize for making it impossible to start new sites with Helix.

How did it happen?

Helix Publish is using Google Cloud Platform IAM to create service accounts and service account keys on behalf of each Fastly service configuration that is published through Helix. Google Cloud Platform has a limit of 100 service accounts per account, and deleted service accounts count against this quota for up to 30 days after the deletion of the account.

On May 13th our integration tests started failing, indicating that service accounts can no longer be created, due to an exceeded quota. We contacted Google Cloud Platform Support to understand the issue (quota counts deleted accounts, too) and to create a resolution (increase service account quota). Upon resolution of the underlying issue, the service resumed operations and tests completed again.

What are we doing now?

1. establish automated monitoring of the Helix Publish Service

2. give more team members access to the Google Cloud Platform support account

3. use separate accounts for integration tests and production

Posted May 20, 2019 - 13:39 UTC

Resolved
Our quotas have been increased and the service operates normally.
Posted May 18, 2019 - 07:31 UTC
Update
The underlying issue has been confirmed by Google, and the cause is under investigation. Expected time for the next update is by the end of the day tomorrow.
Posted May 14, 2019 - 07:16 UTC
Identified
The Helix Logging service, which is used when running `hlx publish` and which configures logging of HTTP requests is not able to create log databases for new Fastly service configs. Service configurations that have been published in the past are unaffected. The underlying issue is a resource exhaustion due to a configuration change in the database service and we are working with the service provider to resolve the issue.
Posted May 13, 2019 - 17:49 UTC