A bit more context there since you might wonder why customers can cause Sev1’s.

Well, I work for a Database Technology company and we provide a managed service offering. This managed service offering has SLA’s that essentially enforce a 5 minute response time for any “urgent” issue.

Well, a common urgent issue is that the customer suddenly wants to load in a bunch of new data without informing us which causes the cluster to stop accepting write loads.

It’s to the point where most if not all urgent pages result in some form of scaling of the cluster.

Since this is a customer driven behavior, there is no real ability to plan for it - and since these particular customers have special requirements (and thus, less ability to automate scaling operations), I’m unsure if there is any recourse here.

It’s to the point that it doesn’t even feel like an SRE team anymore - we should just instead be called “On-demand scaling agents”. Since we’re constantly trying to scale ahead of our customers.

All in all, I’m starting to feel like this is a management/sales level issue that I cannot possibly address. If we’re selling this managed service offering as essentially “magic” that can be scaled whenever they need then it seems like we’re being setup for failure at the organizational level. Not to mention, not being smart about costs behind scaling and factoring that into these contracts.

So, fellow SRE’s have you had to have this conversation with a larger org? What works for something like this? What doesn’t? Should I just seek greener pastures at this point?

P.S. - Posted c/Programming due to lack of a c/SRE

  • Oliver Lowe@lemmy.sdf.org
    link
    fedilink
    arrow-up
    2
    ·
    8 months ago

    Inform and throttle. Think about how your own computer works. If storage reaches its max capacity, you get a signal back saying “filesystem full” (or whatever), not “internal storage error”. If the CPU gets busy, it doesn’t crash; things start slowing down, queued up, prioritised (and many other complicated mechanisms I’m not across!).

    You could borrow those ideas, come up with a way to implement the behaviour in your systems, then present them to whoever could allocate the time & money.

    Another approach is try to get a small, resource-constrained version of the system running and hammer it by loading heaps of data like those customers do. How does it behave? What are the fatal errors and what can we deal with later?

    • th3raid0rOPA
      link
      fedilink
      arrow-up
      2
      ·
      8 months ago

      That is exactly what we do. The problem is that as a managed service offering. It is on us to scale in response to these alerts.

      I think people are misunderstanding my original post. When I say that customer cluster will go into stop writes, that does not mean it is not functional. It is an entirely intended function of the database so that no important data is lost or overwritten.

      The problem is more organizational. It’s that we have a 5 minute SLA to respond to these types of events and that they can happen at any random customer impulse.

      I don’t have a problem with customers that can correctly project their load and let us know in advance. Those are my favorite customers. But they’re not most of our customers.

      As for automation. As I had exhaustedly detailed in another response, we do have another product that does this a lot better. And it’s the one that we are mass marketing a lot more. The one where I’m feeling all the pain is actually our enterprise level managed service offering. Which goes to customers that have “special requirements” and usually mean that they will never get as robust automation as the other product line.