Failure Injection on Kubernetes with SMI and Linkerd

lemoncucumber 5 years ago

It's worth noting that Istio has built-in support for failure injection (i.e. without needing to run a separate service to return 500s): https://istio.io/docs/tasks/traffic-management/fault-injecti...

As far as I know Linkerd does not (yet) have such a feature though, so this post seems like a reasonable alternative.

adlleong 5 years ago

That's right, and there are some interesting trade-offs. Having failure injection built-in is convenient but running a separate service gives you full control over the error responses. This can be useful if you want to simulate responses with error bodies, for example.

adlleong 5 years ago

I'm the author of this blog post and I'm more than happy to answer any questions people have!

jrockway 5 years ago

So... if I were going to inject failures into my service mesh, it would be my service mesh that I'd be counting on to do the retry after the failure. Does it even make sense to do it in that case?

ihcsim 5 years ago

It will be helpful for testing your retries, and getting insights into your client's behaviour in the event of service failures.

samstave 5 years ago

You know what would be an interesting service:

Chaos monkey/failure injection-as-a-service: in that you define the parameters by which you wanted to be assessed...

King of like pen test contractors...

So OK let me spin up an environ and attack the fuck out of it. Show me where im weak. So that in prod... im good.

0vermorrow 5 years ago

You mean like https://www.gremlin.com/ ?
One of the founders of Gremlin is an Engineer that worked in Netflix and probably worked on Chaos Monkey as well :)
- adlleong 5 years ago
  
  If I understand correctly, one of the limitations of doing application level failure injection with Gremlin is that you need to integrate it into your code: https://www.gremlin.com/docs/application-layer/installation/
  It might be interesting to combine these approaches and use a traffic split to send a percentage of traffic to Gremlin instead of integrating into the code directly.
- barbecue_sauce 5 years ago
  
  Shh, you're going to wake up the TinkerPop Gremlin guy.
jrockway 5 years ago

These kinds of errors are going to happen in production, whether you inject them or let them occur naturally. Any release process that doesn't go perfectly with a drain / rebalance / start new version / rebalance per (backend, proxy) combo is going to have a timeout or broken connection between the proxy and the backend as it restarts. Should you return 502 to your users when that happens? Nope, just retry on a different backend. This lets you test that.