However, what had been designed as a failsafe to address just such a problem turned around and bit them. When the Logfwdr configuration was unavailable, the failsafe would send logs to all customers. In this case, that five-minute glitch caused a massive spike in the number of logs to be sent, overloading the buffering system, Buftee, and making it unresponsive.
Buftee provides buffers for each Logpush job, containing 100% of the logs generated by the zone or account referenced by that job, so the failure to process one customer’s job will not affect progress on others. It contained safeguards against being overwhelmed by a massive increase in the number of buffers — but those safeguards had not been configured, Cloudflare said.
“A short, temporary misconfiguration lasting just five minutes created a massive overload that took us several hours to fix and recover from,” the blog stated. “Because our backstops were not properly configured, the underlying systems became so overloaded that we could not interact with them normally. A full reset and restart was required.”