Page MenuHomePhabricator

mcrouter daemonset on mw-on-k8s
Open, In Progress, MediumPublic

Description

What?

mcrouter is a our memcache router, responsible for the replication and sharding of our memcached data, as well as the reliability of the service. Mcrouter maintains a pool of connections to all servers in our memcache cluster, Currently in MW-on-K8s, mcrouter is one of the many containers in a mediawiki pod. As we have discussed in T277711, mcrouter could be a daemonset.

Why?

First of all, mcrouter is robust, and capable of serving *a lot* of traffic. A few reasons why we would be making better use of mcrouter if it were a demonset:

  • Reduce the mediawiki pod by 2 containers (mcrouter, mcrouter+exporter)
  • there are rarely updates in mcrouter's version and configuration (which requires no reboots), we will avoid starting up yet another container during mw deployments
  • Fast fail, since each mcrouter daemonset will be receiving more traffic, it will failover faster to the gutter pull in case of a memcache server failure
  • In baremetal we have 1 mcrouter per 96 php-fpm workers (avg memory ~135 MB) vs 1 mrouter per 8 php-fpm workers in mw-on-k8s (avg memory ~50 MB per pod)
  • Fewer connections towards the memcache cluster. Each mcrouter maintains a connection pool with each memcached server

Drawbacks

  • Unavailability of the daemonset, will result to either the whole node, and whole mw-* deployments to fail.
    • Given mcrouter's history in terms of causing incidents, it is unlikely to happen due to the software itself
  • Will need extra care when rolling out changes (rare, but still)
  • Run into mcrouter scaling issues we have not seen so far

Future Work

  • If mcrouter will be running as a daemonset, potentially any application using a memcache cluster (any), could use its corresponding service

How?

Roadmap:

  • Create mcrouter chart and namespace
  • Deploy mcrouter as a daemonset, available to each node
  • Create a service where type: ClusterIP and internalTrafficPolicy: Local
    • accessible only within kubernetes
    • this way we will ensure that pods in a node will talk to the local daemonset via "mcrouter-service.mcrouter.svc.cluster.local"
  • Make the $wgObjectCaches['mcrouter']['servers'] an environmental variable we can define in values.yaml T326705: Allow php-fpm to read environment variables from the system, not just from the fcgi request
    • Will help with switching between the in-pod mcrouter and the mcrouter ds, thus testing
    • update php7.4-fpm image to pass env['MCROUTER_SERVER'] in fpm pools
    • update mediawiki chart with the relevant changes
  • point mw-debug to the daemonset
  • update Wikitech
  • update Grafana Dashboards

Details

SubjectRepoBranchLines +/-
operations/mediawiki-configmaster+28 -1
operations/deployment-chartsmaster+0 -1
operations/deployment-chartsmaster+3 -0
operations/deployment-chartsmaster+7 -2
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+3 -0
operations/deployment-chartsmaster+8 -0
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+1 -0
operations/deployment-chartsmaster+1 -0
operations/deployment-chartsmaster+48 -0
operations/docker-images/production-imagesmaster+26 -1
operations/docker-images/production-imagesmaster+6 -0
operations/deployment-chartsmaster+145 -0
operations/deployment-chartsmaster+1 -0
operations/puppetproduction+4 -0
operations/deployment-chartsmaster+82 -126
operations/deployment-chartsmaster+1 K -0
operations/deployment-chartsmaster+89 -109
operations/mediawiki-configmaster+1 -1
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I pressed submit before finishing my comment:

that amounts to setting clear_env = no in php-fpm I think.

Do you recall any other ENV variables that may be present and cause us issues?

Change 961743 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] (WIP) mcrouter: add chart

https://gerrit.wikimedia.org/r/961743

Moving to our inbox. This will require a change in wmf-config for the "mcrouter" BagOStuff instance.

ServiceOps would like this within the next 14 days to avoid delaying k8s work.

Change 973838 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[operations/mediawiki-config@master] [WIP] mc: Read mcrouter servers from an environment variable

https://gerrit.wikimedia.org/r/973838

Change 973838 merged by jenkins-bot:

[operations/mediawiki-config@master] mc: Make it possible to use mcrouter server set by environment

https://gerrit.wikimedia.org/r/973838

Mentioned in SAL (#wikimedia-operations) [2023-11-21T14:04:24Z] <lucaswerkmeister-wmde@deploy2002> Started scap: Backport for [[gerrit:973838|mc: Make it possible to use mcrouter server set by environment (T346690)]]

Mentioned in SAL (#wikimedia-operations) [2023-11-21T14:05:45Z] <lucaswerkmeister-wmde@deploy2002> lucaswerkmeister-wmde and d3r1ck01: Backport for [[gerrit:973838|mc: Make it possible to use mcrouter server set by environment (T346690)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-11-21T14:11:34Z] <lucaswerkmeister-wmde@deploy2002> Finished scap: Backport for [[gerrit:973838|mc: Make it possible to use mcrouter server set by environment (T346690)]] (duration: 07m 09s)

(update): The config change got merged and deployed yesterday. So it's live now. Is there anything else needed from MediaWiki Platform Team side on this ticket?

If nothing, then we can re-assign as needed.

Change 979339 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] deployment_server: add mcrouter service 1

https://gerrit.wikimedia.org/r/979339

Change 979340 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] Add namespace for mcrouter service

https://gerrit.wikimedia.org/r/979340

Change 979107 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mcrouter: add vanila chart

https://gerrit.wikimedia.org/r/979107

Change 979363 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mcrouter: add helmfile

https://gerrit.wikimedia.org/r/979363

(update): The config change got merged and deployed yesterday. So it's live now. Is there anything else needed from MediaWiki Platform Team side on this ticket?

If nothing, then we can re-assign as needed.

Thank you!

Most of the patches are ready to go, the only thing missing is passing the variable to php-fpm. Our options here are either

  • clear_env = no: generally, not recommended
  • pass it via apache: not great either, but it will work and we are already doing it for SERVERGROUP

I think we can also set it in the php-fpm pool conf like

env[MCROUTER_SERVER] = $MCROUTER_SERVER

and set $MCROUTER_SERVER through kubernetes?

Change 982785 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] (WIP 2) mcrouter: add chart

https://gerrit.wikimedia.org/r/982785

Moving to our Radar as our part is done, I believe. Feel free to move to our Inbox anytime.

Change 982785 abandoned by Effie Mouzeli:

[operations/deployment-charts@master] (WIP2) mcrouter: add chart

Reason:

back to the drawing board

https://gerrit.wikimedia.org/r/982785

I think we can also set it in the php-fpm pool conf like

env[MCROUTER_SERVER] = $MCROUTER_SERVER

and set $MCROUTER_SERVER through kubernetes?

I think this is going to be the way to go, given that we generally already pass all config settings via ENV variables already. A quick test on mwdebug was encouraging. Thanks!

Change 994764 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/docker-images/production-images@master] php: add env[MCROUTER_SERVER] variable

https://gerrit.wikimedia.org/r/994764

Change 994789 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mw-debug: set MCROUTER_SERVER variable

https://gerrit.wikimedia.org/r/994789

jijiki changed the task status from Open to Stalled.Feb 1 2024, 6:40 PM
This comment has been deleted.

Change 994764 merged by Effie Mouzeli:

[operations/docker-images/production-images@master] php: add env[MCROUTER_SERVER] variable

https://gerrit.wikimedia.org/r/994764

Change 979107 merged by jenkins-bot:

[operations/deployment-charts@master] mcrouter: add vanila chart

https://gerrit.wikimedia.org/r/979107

Change 961743 merged by jenkins-bot:

[operations/deployment-charts@master] mcrouter: add chart

https://gerrit.wikimedia.org/r/961743

Change 979339 merged by Effie Mouzeli:

[operations/puppet@production] deployment_server: add mw-mcrouter service 1

https://gerrit.wikimedia.org/r/979339

Change 979340 merged by jenkins-bot:

[operations/deployment-charts@master] Add namespace for mw-mcrouter service 2

https://gerrit.wikimedia.org/r/979340

Change 979363 merged by jenkins-bot:

[operations/deployment-charts@master] mw-mcrouter: add helmfile

https://gerrit.wikimedia.org/r/979363

jijiki changed the task status from Stalled to In Progress.Feb 29 2024, 4:24 PM
jijiki updated the task description. (Show Details)

mw-mcrouter ds has been deployed on staging mw-mcrouter staging

We will continue with eqiad, as soon as we come up with some sensible namespace limits. Some interesting numbers (codfw as it is handling more traffic):

  • 144 wikikube hosts + 74 mediawiki (to be reimaged to k8s)
  • ~7 mediawiki pods per k8s node
  • ~1GB max rss memory used by mcrouter containers on a single k8s node
  • ~0.8 max cpu used by mcrouter containers on a single k8s node

Change #1015338 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/docker-images/production-images@master] php7.4-fpm: introa

https://gerrit.wikimedia.org/r/1015338

Change #1015342 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mediawiki: add MW__MCROUTER_SERVER variable in chart

https://gerrit.wikimedia.org/r/1015342

Change #1015338 merged by Effie Mouzeli:

[operations/docker-images/production-images@master] php7.4-fpm: pass the env[MCROUTER_SERVER] variable to php

https://gerrit.wikimedia.org/r/1015338

Change #1015342 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: add MW__MCROUTER_SERVER variable in chart

https://gerrit.wikimedia.org/r/1015342

Change #994789 merged by jenkins-bot:

[operations/deployment-charts@master] mw-debug: set MCROUTER_SERVER variable

https://gerrit.wikimedia.org/r/994789

Change #1020207 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mw-api-int: use mcrouter daemonset on codfw

https://gerrit.wikimedia.org/r/1020207

Change #1020207 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-int: use mcrouter daemonset on codfw

https://gerrit.wikimedia.org/r/1020207

Mentioned in SAL (#wikimedia-operations) [2024-04-17T09:08:42Z] <jiji@deploy1002> Started scap: Switch mediawiki in eqiad to use node-local mcrouter ds - T346690

Change #1020765 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] admin_ng: Bump coredns replicas to 6

https://gerrit.wikimedia.org/r/1020765

Change #1020768 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mediawiki-common: add a dot to the mcrouter url

https://gerrit.wikimedia.org/r/1020768

Change #1020765 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Bump coredns replicas to 6

https://gerrit.wikimedia.org/r/1020765

Change #1020768 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki-common: add a dot to the mcrouter url

https://gerrit.wikimedia.org/r/1020768

Change #1020774 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mediawiki-common: use mcrouter ds only on codfw

https://gerrit.wikimedia.org/r/1020774

Change #1020774 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki-common: use mcrouter ds only on codfw

https://gerrit.wikimedia.org/r/1020774

Change #1020778 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] admin_ng: Bump coredns memory for wikikube

https://gerrit.wikimedia.org/r/1020778

Change #1020778 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Bump coredns memory for wikikube

https://gerrit.wikimedia.org/r/1020778

jijiki changed the task status from In Progress to Stalled.May 15 2024, 1:09 PM

Current status:

  • codfw mw-on-k8s pods use 'mcrouter-main.mw-mcrouter.svc.cluster.local.:4442
  • eqiad is using the local mcrouter container.

blocked for now on T363186

This comment was removed by jijiki.

I have been trying to figure out how much does the dns resolution of 'mcrouter-main.mw-mcrouter.svc.cluster.local.:4442 costs, by using xdgui. I dont think I have found anything terrible, apart from a few μίκροseconds (μs)

eqiad (container):

image.png (1×2 px, 240 KB)

codfw (ds):

image.png (1×2 px, 241 KB)

jijiki changed the task status from Stalled to In Progress.Tue, Jun 4, 2:12 PM

Change #1039197 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/mediawiki-config@master] mc.php: if $_SERVER['MCROUTER_SERVER'] is set, resolve it

https://gerrit.wikimedia.org/r/1039197

Change #1039197 abandoned by Effie Mouzeli:

[operations/mediawiki-config@master] mc.php: if $MCROUTER_SERVER is set, resolve it

Reason:

bad idea

https://gerrit.wikimedia.org/r/1039197

Change #1039197 restored by Effie Mouzeli:

[operations/mediawiki-config@master] mc.php: if $MCROUTER_SERVER is set, resolve it

https://gerrit.wikimedia.org/r/1039197

Change #1047030 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mw-mcrouter: add ClusterIP for eqiad

https://gerrit.wikimedia.org/r/1047030

Change #1047032 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mw-mcrouter: add ClusterIP for codfw

https://gerrit.wikimedia.org/r/1047032

Change #1047043 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mw-debug: point mediawiki to mw-mcrouter's clusterIP

https://gerrit.wikimedia.org/r/1047043

Change #1047030 merged by jenkins-bot:

[operations/deployment-charts@master] mw-mcrouter: add ClusterIP for eqiad

https://gerrit.wikimedia.org/r/1047030

Change #1047050 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mediawiki: switch to using the mw-mcrouter daemonset

https://gerrit.wikimedia.org/r/1047050

Change #1047043 abandoned by Effie Mouzeli:

[operations/deployment-charts@master] mw-debug: point mediawiki to mw-mcrouter's clusterIP

Reason:

will try again

https://gerrit.wikimedia.org/r/1047043

Change #1047050 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: switch eqiad to use the mw-mcrouter daemonset

https://gerrit.wikimedia.org/r/1047050

Mentioned in SAL (#wikimedia-operations) [2024-06-18T11:58:00Z] <effie> Slowly pointing mediawiki in eqiad to mw-mcrouter daemonset - T346690

We attempted to rollout on eqiad, where mediawiki would be using mcrouter;s cluster IP directly, but we started seeing many errors from the mediawiki side, so we had to rollback

Change #1047075 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] mcrouter: Temporarily disable in codfw

https://gerrit.wikimedia.org/r/1047075

Change #1047075 merged by jenkins-bot:

[operations/deployment-charts@master] mcrouter: Temporarily disable in codfw

https://gerrit.wikimedia.org/r/1047075

Change #1047032 merged by jenkins-bot:

[operations/deployment-charts@master] mw-mcrouter: add ClusterIP for codfw

https://gerrit.wikimedia.org/r/1047032

Change #1039197 abandoned by Effie Mouzeli:

[operations/mediawiki-config@master] mc.php: store mcrouter location in apcu

Reason:

https://gerrit.wikimedia.org/r/1039197