As an SRE, what do I do about Alerts caused almost entirely by poor customer communication or misuse of a product?

th3raid0r · 1 year ago

As an SRE, what do I do about Alerts caused almost entirely by poor customer communication or misuse of a product?

Oliver Lowe@lemmy.sdf.org · 1 year ago

Inform and throttle. Think about how your own computer works. If storage reaches its max capacity, you get a signal back saying “filesystem full” (or whatever), not “internal storage error”. If the CPU gets busy, it doesn’t crash; things start slowing down, queued up, prioritised (and many other complicated mechanisms I’m not across!).

You could borrow those ideas, come up with a way to implement the behaviour in your systems, then present them to whoever could allocate the time & money.

Another approach is try to get a small, resource-constrained version of the system running and hammer it by loading heaps of data like those customers do. How does it behave? What are the fatal errors and what can we deal with later?

th3raid0r · 1 year ago

That is exactly what we do. The problem is that as a managed service offering. It is on us to scale in response to these alerts.

I think people are misunderstanding my original post. When I say that customer cluster will go into stop writes, that does not mean it is not functional. It is an entirely intended function of the database so that no important data is lost or overwritten.

The problem is more organizational. It’s that we have a 5 minute SLA to respond to these types of events and that they can happen at any random customer impulse.

I don’t have a problem with customers that can correctly project their load and let us know in advance. Those are my favorite customers. But they’re not most of our customers.

As for automation. As I had exhaustedly detailed in another response, we do have another product that does this a lot better. And it’s the one that we are mass marketing a lot more. The one where I’m feeling all the pain is actually our enterprise level managed service offering. Which goes to customers that have “special requirements” and usually mean that they will never get as robust automation as the other product line.