Page MenuHomePhabricator

Improve the community relations process for data center switchover
Closed, ResolvedPublic

Description

During previous times working on data center (DC) switchovers, we have observed a lot of room for improvement. The goal of this task is to provide a better quality service.

Here is the improvement plan:

  • Discuss with Erica about having a third person involved in the process. Rationale:
    • We want to add quality checkins, with one person doing the task and the other checking up on it.
    • Since we only work for 1,5 FTE in total, can have vacations and times off, the idea is to have a third person to be a reliable backup and cover these times off.
    • Based on Tech News experience, we prefer not to rely on volunteers to help for now, even if some would be available and skilled.
  • Create a Phabricator template, maintained by us, to request banners for server switches
    • Reuse the current steps we have
    • Add a checkup step for translations (are they done, are links functioning, etc?)
    • Add a checking step for each item, written down in the checklist.
  • Improve messages sent to communities
  • We only have reusable messages, which cover our needs: One for DC switchover
    • Change the date format so that translators don't have to deal with dates translations at each usage of the message (reuse Tech News') (see related translators-l@ thread). But leave space for local times for languages that cover multiple timezones.
    • Decide if we keep the possibility to translate the time of the event, which includes possible mistakes if 1/ the UTC time changes 2/ the local time observes DST.
  • Document publicly the fact that:
    • two people work on a task, the assignee and the checker
    • the assignee does the job, and the checker checks it
    • assignee can change but in this case, the task is reassigned
    • we debrief after each task, and we iterate on the process if needed
    • provide links to the two messages so that translators can find them
  • Document a Q&A process so that another CRS can check a banner configuration.

History
Until March 16 2022, this task covered regular maintenance read-onlys and databases switchover (DC switchover). This process changed (T303605), and the scope of the current task was redefined to only take care of DC switchover.
Regular maintenance read-only times were discussed in this task. The conclusion of the discussion is that we don't need to announce them anymore as the read-only time is minimal.

Details

Other Assignee
sgrabarczuk

Event Timeline

Trizek-WMF updated Other Assignee, added: sgrabarczuk.
Trizek-WMF updated the task description. (Show Details)

Work done
I changed the date format on the Server switches message, so that translating the dates is not needed anymore. It would avoid sending a message with inaccurate dates.
I had to do it twice, since I goofed at my first attempt, by oversimplifying the date format, which lead to an incorrect display of the date for some languages. We learn new things every day, sometimes the hard way.

I changed the Read-only limited announcement to remove the number of wikis affected (one less step to change), and I tweaked the language a little bit. I also edited the date format.

The new date format for both messages it the same as on Tech News. It is supposed to be well known by our usual translators. To avoid issues with less experienced translators, I created several documentation messages, like this one, to warn them.

Open questions regarding Server switches
Regarding server switches, we still have some issues regarding the hour of the switch, especially in translation item #11. We initially choose to provide multiple times, so that we cover every part of the world. However, since some countries observe daylight saving time. So some times would be wrong time to time.

We need to find a solution:

  • only display UTC time, with a link to https://zonestamp.toolforge.org/. It is bulletproof, we can change it on our side and we are sure that users get it. But it requires one extra action for the reader to display their local time. It is the option applied on the Read-only limited announcement.
  • display various times, like it is done now: "14:00 UTC (07:00 PDT, 10:00 EDT, 15:00 WEST/BST, 16:00 CEST, 19:30 IST, 23:00 JST, and in New Zealand at 02:00 NZST on Wednesday 15 September).", with the risk of not having the right time being display, because of DST, or just because we move the UTC hour.
  • only have server switches when we are safe regarding timezones and our usual UTC time, which is kind of fun to schedule but not really practical.

For practical reasons, I'd go for the first option, now a second opinion is welcomed. I kept this part untranslated, and I froze the message while we discuss about it.

Checkup
We have to add the following to the checkup:

  • Since I haven't been able to upgrade all translations, only messages with 100% of translations done must be used.
  • All <tvar> items named "date" must be updated each time we plan to send a new message.
  • Message documentation has to be updated with the right time everytime we plan to send a new message.
  • Any change to a date format must be applied to the qqq documentation, and translators have to be notified, especially the ones using a specific date format (Japanese and Korean, for instance).

Relying on translations of timezones is not possible. If the switch changes from 06:00 to 14:00 and the time is not updated in translations, we will have messages with some incorrect information. It is not acceptable regarding the quality of service we want to provide.

We benefit from https://zonestamp.toolforge.org, which uses Unix time as an URL parameter. As a consequence, it is easy to format our message to use it: https://zonestamp.toolforge.org/{{#time:U|2021-11-11T06:00}}.

As a consequence, I edited the messages so that the time of the maintenance window is linked to zonestamp, with no translations. Translations haven't been updated yet, since the work continues.

As we have a new wave of read-onlys, I'm taking it as an opportunity to work on the template we wanted to create.

I planned to work on a Phabricator form for read-only-s and server switches. The goal is to prefill the task with the different steps we need to go through in order to inform the communities. I thought we had already a template, but I can't find it. If anyone finds it, please let me know.

I'm sorry I stepped on your toes (I missed this ticket.) but with recent discussions (last week). I personally we should just stop announcing them altogether. Hopefully this reduces your work.

Trizek-WMF changed the task status from Stalled to In Progress.Mar 16 2022, 1:26 PM
Trizek-WMF lowered the priority of this task from High to Medium.
Trizek-WMF updated the task description. (Show Details)
Trizek-WMF renamed this task from Improve the community relations process for read-only times and server switches to Improve the community relations process for server switches.Apr 13 2022, 12:59 PM
Trizek-WMF updated the task description. (Show Details)
Trizek-WMF changed the task status from In Progress to Open.May 11 2022, 5:44 PM

I keep this task as a background one, but I'm not focusing on it right now.

I've removed this from the descriptions of my regular duties (both the team page on Meta and the internal table). Last calendar year, unless I'm mistaken, we did this 0 times. To me, it looks like this: should our team receive any request, either Benoît or I will react. But in terms of regular workflows and time allocation, it's been 0%-ish.

Last server switch was on Tuesday 14 September 2021.

If future server switches will happen, then we have a process to cover them. It requires a bit of documentation work to be usable by any CRS when the request comes.

If server switches aren't a thing anymore, then we can decline this task.

@Trizek-WMF @Ladsgroup do you think we are done here, or?

It's a bit confusing in general. What does "server switch" mean in this context? We have a couple of different things:

  • Primary database switchover: That leads to maybe a minute of read-only time (mostly less) in a subset of wikis, e.g. only ruwiki, frwiki and jawiki in case of a s6 switchover. It happens quite often, every other week.
  • Datacenter switchover: This usually takes a lot more read-only time, around five minutes but happens way less often. It supposed to be once a year but currently SRE doesn't have the capacity to do it these days. We will look into doing this later but it doesn't mean it's not happening anymore. It's just stalled for now.
  • x1 primary switchovers, these switchover impact many random features across all wikis. Like notifications, short urls, etc. They don't happen that often but something along the lines of once a year and the readonly time for those features are short.
  • misc services switchovers: For example, mailing lists might get a read-only database. This will be short but also doesn't happen that often. Similar to the one above.

For the first bullet point, my recommendation is to simply stop adding them to the tech news. I have written essays on this topic e.g. T303605: Stop announcing and scheduling primary database switchovers or T313398#8143761. My plan is to push for it again after fixing T314975: Rdbms library is too aggressive in changing to read-only which just got its patch merged so hopefully next week I can test, make sure it works and then get back to getting rid of announcements on this.

For the later ones, we still probably need to have tech news and for the datacenter switchovers, possibly even a banner but they happen way less often so not an issue.

@Ladsgroup, this task is (now) only about Datacenter switchovers, "the big one". Even it it happen less and once a year, it is important for us to have procedures to announce it. :)

Regarding primary database switchover, if you are totally confident that they will be unnoticed, then we can stop announcing them on Tech News. Maybe we should have a communication procedure in the case of a failure, assuming that they will always have the same impact, a longer-than-expected read-only?

@Ladsgroup, this task is (now) only about Datacenter switchovers, "the big one". Even it it happen less and once a year, it is important for us to have procedures to announce it. :)

Sure. The DC ones will happen soonish. Not at the moment due to capacity issues in serviceops and data persistence. The serviceops got better and will get even better soon (but they have mw-on-k8s project which takes a lot of resources) but we are still looking for a DBA, If you know anyone, please refer them!

Regarding primary database switchover, if you are totally confident that they will be unnoticed, then we can stop announcing them on Tech News. Maybe we should have a communication procedure in the case of a failure, assuming that they will always have the same impact, a longer-than-expected read-only?

Yeah, let's pick this up again after the patch is deployed in production and I'm sure it fixes the flood of pre-switchover read-onlys.

Regarding primary database switchover, if you are totally confident that they will be unnoticed, then we can stop announcing them on Tech News. Maybe we should have a communication procedure in the case of a failure, assuming that they will always have the same impact, a longer-than-expected read-only?

Yeah, let's pick this up again after the patch is deployed in production and I'm sure it fixes the flood of pre-switchover read-onlys.

I'm fairly certain that this patch reduces the read-only time to a minimum. I suggest adding a final tech news saying this won't be announced anymore and users could see some short read-only times in 7AM UTC in Tuesdays and Thursdays.

I'm fairly certain that this patch reduces the read-only time to a minimum. I suggest adding a final tech news saying this won't be announced anymore and users could see some short read-only times in 7AM UTC in Tuesdays and Thursdays.

I rely on your expertise. Let's have a final announcement.

If we need to go back to the previous way in the future, we keep the benefit of existing procedures and translations.

Now that we clarified this, I'll keep this task active to finish the documentation work on data center switches.

Awesome. Thanks. if that helps, We are going to have a DC switchover soon (end of this quarter)

Awesome. Thanks. if that helps, We are going to have a DC switchover soon (end of this quarter)

The earlier you set the date and share it with us, the better!
I need to book some time to finish the DC switchover documentation and process so that CRS will be ready to assist you.

Trizek-WMF renamed this task from Improve the community relations process for server switches to Improve the community relations process for data center switchover.Jan 30 2023, 4:30 PM
Trizek-WMF updated the task description. (Show Details)

Awesome. Thanks. if that helps, We are going to have a DC switchover soon (end of this quarter)

The earlier you set the date and share it with us, the better!
I need to book some time to finish the DC switchover documentation and process so that CRS will be ready to assist you.

There you go @Trizek-WMF, T328287

I suggest adding a final tech news saying this won't be announced anymore and users could see some short read-only times in 7AM UTC in Tuesdays and Thursdays.

I saw the announcement in Tech News and just wanted to share a suggestion. Since the switches won't be announced anymore, we could consider having a calendar where the dates on which the switch would happen are conveyed. This might be helpful in case someone notices a downtime on Tuesday / Thursday and wants to confirm if this is associated with the switchover.

If the switchover literally happens every Tuesday / Thursday this might not be particularly necessary. I suppose there are chances for it to be skipped for a few weeks, though. Hence the suggestion :-)

They should be visible in the deployment calendar: https://wikitech.wikimedia.org/wiki/Deployments

That's really nice. The Tech News announcement did not mention this. It might be a good idea to convey this also possibly in the next edition.

That's really nice. The Tech News announcement did not mention this. It might be a good idea to convey this also possibly in the next edition.

Our internal documentation now covers it. Thank you for this suggestion!

I don't want to celebrate this task's second birthday later this year. Please prioritize completion.

Trizek-WMF added a subscriber: UOzurumba.

Things have changed a lot since the creation of this task. We also experienced a new switchover, which was the opportunity to test the process designed the previous time.

The most recent switchover was an opportunity to test the process I setup in real conditions, i.e. without me piloting. @sgrabarczuk and @UOzurumba successfully achieved it. The only missing part they faced (sending a message to bot coordination pages step) is now in the process listed below.

Working on these announcements requires mastering MassMessage, CentralNotice banners, and Extension:Translate, basic tools for #CommRel-Specialists-Support folks.
The how-to is now documented at https://office.wikimedia.org/wiki/Community_Relations_Specialists/Announcing_Data_center_switchovers.

This process is now stable. It can be reused and expanded to similar events.