GUY AT DINNER PARTY SAW A DOCUMENTARY

Friends gathered at Tina’s the other night for a Christmas party and were adding finishing touches to the guacamole when Sean announced he saw a documentary. “It really highlighted the disparities,”…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




A CRE walked onto twitter and asked a load of questions!

Adrian’s tweet thread started with this :

He was 100% spot on with his statement but he didn’t leave it at that! He then

Continued to outline a list of things that any ecommerce customer should consider to help them ensure that they are ready for shoppers to buy things and not see a virtual not open sign at arguably the busiest time for online retailers.

First thing: how is your biz going to break? This is the pre-mortem analysis. Think of the possibilities which will break serving: traffic surge, bad binary/config/data push, platform partial outage, query of death… How do you detect them, what’s your reaction?

I’m going to break the response to this particular tweet into three parts starting with how is your biz going to break?

As you get more comfortable with implementing chaos engineering practices you can then start applying the processes to your production environment.

Looking at the next part of the tweet Think of the possibilities which will break serving: traffic surge, bad binary/config/data push, platform partial outage, query of death.

By introducing failures you’ll learn what will break serving and adapt to mitigate this as an ongoing process. Having a scalable resilient architecture is fundamental see above links . Decoupled micro service architectures are better able to cope with partial failure. You need to determine what happens when you meet your scaling maximums are the error codes going to be acceptable or will you implement break glass procedures to scale beyond your max?

And How do you detect them, what’s your reaction?

What can break you that’s out of your control? Platform, 3P dependencies, payment processing etc. How do you detect problems with them: what’s their SLA with you (assume they perform exactly at that level), how do you escalate concerns to them and what happens when you do?

Put in place probes/monitors to provide an indication of whether the SLA’s defined by your partners are being met . This can be as simple as running regular curl requests to your partner end points to replicating user journeys at regular intervals

You should ensure you have a reliable support escalation process . Test it regularly to ensure the people who need to be able to log support calls are authorised.

Create an incident response plan clearly defining how to contact support and who is authorised to do so and how to contact your authorised personnel who are recorded as being authorised to communicate with GCP’s support team .

Keep track of incidents use a bug tracking or helpdesk system. Carry out regular reviews as the insights gained from analysing your problems are useful in helping you iterate and improve processes and the reliability of your applications. Create your SRE/on call/support process. Create standardised post mortem forms

Use the example checklist below to help you formulate your support escalation process .

Example Support process checklist

This list is designed for setting up a support escalation process with GCP but replace GCP with your third party supplier and it is pretty much the same process or should be!

How do you detect problems seen by your customers? Do you have support forums/phone lines/bug queues? How do your support agents spot multiple customers with same problem and escalate to ops/SRE oncall?

I’m going to split the response to this one into two

Firstly How do you detect problems seen by your customers?

Secondly Do you have support forums/phone lines/bug queues? How do your support agents spot multiple customers with same problem and escalate to ops/SRE oncall?

I’m going to assume that you have a decent internal support set up to support your systems with appropriate escalation. If you’d read the response to the previous tweet that implies you will know you need an incident response plan and thus an internal support chain and execution path.

With your automated functional and ui testing and maybe the occasional manual tests you should hopefully be able to spot problems maybe before your customers do and can proactively address .

What’s your criteria for allowing binary/config/data pushes during sensitive period? What’s the lockdown window, who approves exceptional pushes, what’s notification channel for those pushes?

Implement a freeze and test, test and test again the release that will be running your production workload on black friday! Development can carry on using a copy of your development environment and test that hard too it may need to be pressed into service in case that bug you thought could wait actually needs to be fixed and deployed. Using GCP having copies of your environments is easily done to validate changes. As discussed earlier rolling updates and blue/green deployments can all be employed .

Communication channels are important and the internal support process you have set up should be used to communicate

How do you “turn off the faucet”? What advertising or partner drives traffic to your site? How do you deflect excessive traffic coming through a specific url(pattern)? What is your client retry behavior if your site is returning [345]00s?

How have you practiced in advance of the event? Have you run table-top/Wheel-of-misfortune exercises with your oncall ops, dev and support practicing comms and coordination? Have you involved your platform/3P dependents?

So If you’ve been following along by now you should have a resilient architecture; The ability to mitigate against excessive traffic and have alerts and a bug tracking system set up. You should also have a well defined incident response process and escalation path which includes communicating with your partners (which includes GCP) . Well before Black friday ( or whatever event) conduct a fire drill exercise. Talk with your partners explain that you will be conducting a fire drill.Maybe you’ll decide to not give them an exact date just that it will be before the date of the event. Then on the day of the fire drill simulate high loads , introduce errors validate all support channels, Do the right people respond when expected? Are the communication channels working? What failed were you able to keep your systems available? Conduct a post mortem . Action the critical issues . Then run another fire drill . Have things worked better this time round?

Outside of the full blown fire drills smaller more focused exercises should be carried out to ensure that communication and coordination with on call folks works as expected and all personnel have the chance to experience an actual or simulated on call exercise

What’s your exec involvement in incidents? What’s the threshold for waking them up; what defined role do they play; how do you decide when they can do back to bed? Same considerations apply to dev and ops leads.

Your escalation procedures need to include your execs in the support escalation checklist we have an item “Define escalation path” in the checklist earlier . This means all the way up to the top of the decision making chain and in what circumstances they need to be contacted!

Read the SRE books to understand how to ensure the well being of your staff. it’s just as important as the tech!

Most fundamental: what monitoring signals do you look at to detect customer unhappiness? These are your Service Level Indicators (SLIs). Know the levels at which they predict users escalating to your support channels.

There are several types of monitoring signals that you should collect infrastructure component metrics and SLI’s .

I’ve mentioned Stackdriver earlier which has a number of features to address collecting the above including metrics from VPC flow logs which can be exported to Stackdriver logging .

SLIs. Metrics associated with the overall health of the system.These metrics are generally broader than specific infrastructure components and might be represented as something like API QPS rates, orders/min, or cart adds/second.

Read the postmortem from last year’s event. What’s the status of the action items from it? Anything unresolved that could bite you again? If no such PM, even more important to start one for this year. Your future selves will thank you.

Get out that old PM and do as Adrian says. Ensure any unresolved issues that could bite you again are resolved if they haven’t already been addressed.

If you’ve frozen changes, then at some point you unfreeze (post-event). Anticipate higher breakage rate when this happens — backed-up changes build a higher rate of user visible bugs.

Earlier we discussed a robust CI/CD pipeline process the use of rolling updates and blue/green deployment strategies. The blue/green deployment approach in particular gives you the opportunity to continue development and the ability to deploy to a copy of the active environment.

Pay your event oncalls well (vacation and/or cash). Make successful learning/improvement from BFCM a material contribution to promotion/better pay — reward what you want more of.

At some stage it may make sense to give your interns a corporate credit card and hourly budget: get them to do periodic buys of cheap items, and report (with screenshots) any errors or excessive latency direct to your oncall engineers.

Expect the event postmortem to drive a significant amount of engineering tasks in early 2019: leave room in your planning for actions arising from the PM

At this stage though you probably want to have a stiff green tea / espresso and hunker down for the event; it’s too late to change the fundamentals, so set a calendar reminder for August to start planning for next year, and do what you can. Good luck! See you on the other side.

And when it’s time to start planning we have this waiting for you

Adrian provided some great advice and I took on the challenge to reply to his tweets so if you bump into a CRE in a bar you can respond in kind to their questions about your preparedness by giving you more resources to review and prepare.

With work and perseverance you can be successful. While hope is not a strategy getting a little bit of good luck on top of some hard preparation work is very welcome.

Add a comment

Related posts:

How Gentleness Can Work For You in Unexpected Ways

Under both circumstances, water as the basic element essentially remains the same. It is one oxygen and two hydrogen. Adopt the gentle approach in life and see where it leads you. Notice the depth…

The Illegal Wildlife Trade is a Scarcity Problem

The initial idea is simple: Recreate the products for which these animals are killed, this can for example be ivory from elephants or rhino horn. After creating the product, it is sold to those…