Skip to main content

Get the Reddit app

Scan this QR code to download the app now
Or check it out in the app stores
r/programming icon
r/programming icon
Go to programming
r/programming

Computer Programming


Members Online
ADMIN

The Evolution of Code Deploys at Reddit

Share
Sort by:
Best
Open comment sort options

The original original deployment scheme was pretty much "^C the reddit process running in screen on the server."

At least I hope it was in screen...

It was not.

u/Nysor avatar

Thanks for all these informative blog posts. Quite glad that Reddit has and is still evolving and scaling over time.

More replies
u/raldi avatar

I thought the original was that you would type some new Lisp into the REPL that was serving the site, wasn't it?

And sometimes you even remembered to save it to a file later?

Edited

That's right. We ran Reddit out of a long-living CMUCL process. Instead of reloading everything, I (I'm always hesitant to use I instead of We, but it was really just me doing the dev at the time) updated the process one function at a time by pasting in newly updated code. Because of this, the instance of Reddit running in production wasn't necessarily documented in any specific code. It resembled more a well-aged cheese culture—impossible to reproduce.

That's terrifying.

more reply More replies

Pfft. files

u/-registeredLurker- avatar

What are you? Why is your username displayed different (at least in RiF) than a normal user, mod or admin?

u/etecoon3 avatar

Hovering over it shows "admin emeritus". So I would assume a former admin, but no longer. But he still got a cool flair and possibly other benefits/privileges from his time in the position.

[deleted]
[deleted]

Comment deleted by user

More replies
More replies
More replies

Detachtty?

Maybe! The only reason I'm certain it wasn't screen is that I distinctly remember learning about screen thinking, "this will make things so much easier!"

[deleted]
[deleted]

How will you respond to a billion dollar company threatening to dox a user? Is that not against the rules? How can users still feel safe on this platform while this is allowed?

Happy cake day, Alex! Please ban r/the_donald.

[deleted]
[deleted]

Comment deleted by user

More replies
u/FountainLettus avatar

Hey while I have you here, any idea what Reddit rules the_donald hasn't broken yet? This is getting a little out of control

Please provide a means to redditors for unsigning up for alpha/beta testing new features.

The new profile page just doesn't work for me and I haven't found a way to get rid of it.

Sorry to bother you here, the means for contacting the admins doesn't have a category/link for this.

Thanks for reddit, I love it and use it every day.

[deleted]
[deleted]

Comment deleted by user

[deleted]
[deleted]

Comment removed by moderator

more replies More replies
more replies More replies
More replies
More replies
u/alexbarrett avatar

A watch command running in a tmux session is how I "deployed" one of my hobby projects for a year. I did finally - earlier this year - move it to the crontab, but it did the job just fine. Not much downtime at all.

u/kemitche avatar

I'm pretty sure that every single website started as a process running in screen or tmux.

u/alexbarrett avatar

My day job is primarily PHP for which that isn't the case. Most sites are served by PHP-FPM and therefore managed by systemd from the outset.

For other stacks like Python and Node.js you may well be right.

u/Schmittfried avatar

Are you sure? I thought Apache with mod_php is way more popular.

more reply More replies
[deleted]
[deleted]

My first website was a process running on nohup... It was bad, I had to pkill it.

u/rydan avatar

I've never seen a website hosted by screen or tmux. Most things are going to be either Apache, nginx, or Tomcat. All of these will be running in the background typically from the moment the system started.

Nobody is suggesting that this is a good way to host a production website, just that you start developing that way

more replies More replies
More replies
More replies
u/Drunken_Economist avatar

That explains all the outages back then . . .

all the outages

Oh sweet summer child, let me introduce you to my companies product...

More replies
Edited

Why do you hope it was in screen? What would be the downside of not using screen?

The underlying joke is that doing it in screen without some sort of supervising or monitoring process is already pretty janky. Doing it without screen is "wow how did we make it this far" level.

In addition to what the other commenter said, without screen (or something similar), when your ssh connection to the server dies, the server process will too.

There's nohup for that though.

More replies
More replies
More replies
u/half0wl avatar

Loving these blog posts so far.

One question: when did reddit move to AWS and what challenges did the team face? It's not immediately clear from the article, but I assume this was done in 2013 (cloud + autoscale)?

Edited

2009 I think? Somewhere around there.

The migration itself was pretty harrowing but that was mostly our fault because we didn't budget for how long postgres would take to build indices after we migrated the data. I think we'd planned for a couple of hours but it ended up taking until the following morning. We played a lot of bzflag while we waited.

After we finished it turned out that the site was unuseably slow but CPU profiling didn't show any difference. After a lot of debugging we figured out that AWS's network had similar throughput to our physical network but horrible latency. We scrambled to find every network call in a loop (vs batched calls) that previously hadn't mattered but now was a huge problem. We also had trouble getting acceptable performance out of EBS. That saga won't really fit in a single comment but the short version is that of ~20 allocated EBSs one would be inexplicably 10x slower than the others. AWS's response was "yeah, that happens". Our solution was to allocate a bunch at a time, run performance tests, and pick the N fastest then delete the rest. That usually worked but sometimes a good drive would turn into a bad one after a long time in service.

A few weeks after the migration we loaded up the 20 or so physical servers we had into the back of my Saturn. It was too much weight for the poor girl and the wheel wells are still scratched because every little bump would scrape the tyres against them. Since then the pile of the servers has migrated multiple times from office to office and I believe are currently in a garage.

u/raldi avatar

After a lot of debugging we figured out the cause and had to scramble to find every network call in a loop (vs batched calls) that previously hadn't mattered but now was a huge problem.

IIRC we also added a memcache layer on each appserver host, in addition to the authoritative memcache on a nearby host.

Ah yeah, the still venerable StaleCache :)

It's a small additional cache local to each app server with a very short TTL that has a copy of the very most frequently accessed data like the the current r/all listing or the biggest few subreddits. This way a request can say "it's okay if I get an old copy of this data" for where they know it's okay to be a little out of date and not have to go over the network for it.

Also, hi raldi!

u/raldi avatar

Hi! Remember when I used to wear a wristwatch, and you used to wear that beard we've all been passing around for the last ten years?

Is it my turn yet?

more replies More replies
Edited

you used to wear that beard we've all been passing around for the last ten years?

I've tried so hard to forget

More replies
More replies
More replies
u/kayzzer avatar

It's worth noting for people reading this that are not familiar with AWS that EBS is tons better now than it was 7 years ago. The gp2 drives have pretty consistent performance.

More replies
[deleted]
[deleted]

Comment deleted by user

Edited

I'd say to start with a hosted/cloud provider but I think you'd have to do the maths on the costs and your team's skill on both. There are huge upsides and downsides for both. With physical hardware you can probably do it for cheaper in the long run, but you're way less flexible. And I think for a startup in particular flexibility is paramount.

The nice thing about physical hardware is that your costs are mostly fixed and you don't have the "noisy neighbour" problem. You can run custom hardware like fancy GPUs or giant RAID arrays and you can guarantee that your servers are physically and topologically close to each other. You can do things like have a separate dedicated network card with a direct gigabit uplink between your app servers and caches. But the downside is that you have to do all of that stuff. Ordering hardware takes a lot of time so you have to do proper capacity planning. Order too many and now you're stuck with them. You can't scale up and down with load. You'll need to tune your databases and run your own DHCP and solve all of that common networking stuff time and time again.

On a hosted provider you're paying someone else to do all of that. Providers like AWS can host things like RDS or elasticache for you so you don't have to be an expert DBA to get a performant database. You don't ever have to put more RAM in a machine because you can just upgrade to the next instance size. You don't have to sell off your obsolete hardware, you just terminate the instances. You don't ever have to drive down to the datacentre to swap out a failed hard drive, you just delete the EBS and make a new one (but you do still have to plan for them failing, and even more so). Of course the downside is that now AWS is doing all of that stuff for you so you don't have a lot of control over it. If an RDS is misbehaving there's nothing you can really do to fix it, it's all opaque to you. You can't have custom configuration, you can't flush the query cache, and you can't go and look at it to see which warning light is blinking on the front of the machine or read the dmesg to see whether it's using the right driver. Paying AWS to do this stuff for you is also really expensive, and dealing with their support when something on their end is broken can be very, very painful.

If you can do it better and cheaper on the team you have, do that instead. But on a small team, I bet you can't.

I'll add 2 more hidden gotchas to on-prem DIY: Power and HVAC. depending on the size of your infrastructure how fast you need to grow and how well you plan, power and HVAC can sneak up on you and bite you in the ass. I have been in too many situations where they made all these purchasing decisions, project deadlines, and customer commitments only to realize that UPS is would be at 120% capacity, the generator can't handle full load, there aren't enough outlets, and you don't have enough cooling to keep the massive new infrastructure at recommended operating temperatures... Or they skate just under those limits and the moment a surge hits that fancy generator is pumping power to a dead UPS... or an ac unit goes down and causes a cascading failure as the others try to pick up the slack. 130 degree datacenters suck...

Edited

Definitely. I've mostly dealt with colos rather than fully on-prem but I'm sure that has a whole range of different headaches.

reddit's old colo was one around the corner from the then-famous 365 Main one that housed most of the companies you've ever heard of. The massive HVAC couldn't keep up with SF Summers so when the Summer got to peak heat we would start to lose servers to thermal shutdown. We could tell when it was happening because they would shut down in order: the physically highest first and the others in turn. I remember considering moving the less vital ones to the top but I don't think we ever got around to it.

One company I worked for with real on-prem had a server closet in a converted bathroom. (That wasn't the problem with it; it was a good conversion but a little funny to have tile walls and a faucet.) The problem was that the HVAC couldn't keep up with the heat load. I think the HVAC equipment itself was okay but the external venting couldn't support the air flow. For a long time we had to leave the door open with a box fan pointing out. It always sounded like WWWHHHHIIIIIRRRRRRRRR so the desks near it emptied out pretty quickly.

more reply More replies
More replies

I may be biased, but I agree with the tldr. When your company grows, and you realize you need some servers in Asia, it becomes a problem. Suddenly you need to buy more hardware, find a DC, hire staff overseas, comply with additional regulation, etc. And that's all on top of the technical challenges of running multiple DCs in different geographical areas.

u/anttirt avatar

Providers like AWS can host things like RDS or elasticache for you so you don't have to be an expert DBA to get a performant database.

Just a note on this, RDS runs on default settings and tuning them is up to you. You might still need that DBA (or at least have someone learn to be one part-time).

The other stuff like turnkey multi-az failover is just a blessing though.

More replies
[deleted]
[deleted]

Comment deleted by user

u/Chii avatar

what we need is more commoditization of utility computing - i hear that amazon makes amazing margins on their AWS prices. I my books, that shouldn't be allowed, and only competition can prevent it. Atm, google and microsoft is trying to get into the game, but AWS has defacto monopoly over this (which is why they can set their prices thru the nose, and you can't do anything but pay).

What if there's some kind of standard, so that your software stack can seamlessly be migrated between these utility services?

u/featherfooted avatar

I my books, that shouldn't be allowed, and only competition can prevent it.

I'll believe it when I see it. Microsoft Azure is coming closer, largely helped by the monumental decision to support Linux in 2015. Meanwhile, Google Cloud remains a lolcow.

AWS enjoys its huge market share and margins because the competition sucks. The other products aren't as good, aren't as reliable, and aren't cheap enough to warrant jumping ship.

more replies More replies
u/jacques_chester avatar

What if there's some kind of standard, so that your software stack can seamlessly be migrated between these utility services?

What you do here is pick a platform to abstract away the infrastructure provider. I'm most familiar with Cloud Foundry, which can be run on AWS, GCP, Azure, OpenStack, whatever VMWare calls theirs this month, RackHD and I forget what else.

I work for Pivotal, which does a lot of the work on Cloud Foundry. We run large clouds with it on AWS, GCP and Azure ourselves. We have customers who have migrated from one provider to another. As I understand it, the stickiest bit is usually when you have apps that hardcoded paths to blobstores and databases. Which they're not supposed to do, that's what service binding is for.

k8s runs on a bunch of infra too and is the building block for platforms like Deis and OpenShift.

Having dealt with each of AWS, GCP and Azure, my personal favourite is GCP. It is cheaper, faster and easier to use.

Well, there is. For cloud container stuff, there’s Kubernetes, and you can always rent dedicated servers (2-3 OoM cheaper than AWS) and run K8s on them.

[deleted]
[deleted]

Comment deleted by user

more reply More replies
u/robertgentel avatar

Nearly always cloud. Especially in the beginning when you are not yet sure what you will even need (and when buying the hardware outright is to incur the cost up front).

Something I personally can recommend is renting dedicated servers and throwing Kubernetes at them. A lot cheaper than actually using EC2 instances, and you don’t have to deal with most issues either.

[deleted]
[deleted]

Google has a pre setup kubernetes though. No need to touch master's. Throw your deployments at it and go. Free for up to 5 nodes. 5 32 core nodes can get a new business pretty stinking far.

more replies More replies
more replies More replies
More replies
u/badmonkey0001 avatar

find every network call in a loop (vs batched calls)

No persistent connections? Or was that eating up connection bandwidth with a big pile of tiny transactions?

Edited

We keep persistent connections, it's not the 3-way handshake that's the issue. It's just the regular old packet round trip time.

For instance, let's say you want the rendered versions of links A, B, and C. You can send requests that look like this:

link_a = cache.get('A')
link_b = cache.get('B')
link_c = cache.get('C')

So that turns into network traffic that looks like:

link_a = cache.get('A')  |  > GET A
                         |  < data for A
link_b = cache.get('B')  |  > GET B
                         |  < data for B
link_c = cache.get('C')  |  > GET C
                         |  < data for C

But what's invisible here is that you have to flush the socket in between each of those lines for the result to be in the link_a variable before you go on to get link_b. That means that you're sending one packet containing len("GET A\n") bytes, waiting for it to go through the OS buffers and go out the wire and for them to send you an ACK and then for them to send you a len("data for A\n")-byte response and for you to send them an ACK back again. Then you do all of that two more times for the other two items. Doing the requests async/in parallel helps a little but still carries huge overhead: the TCP packet header alone is ~20 bytes and our example requests are only 6!

On a really fast network all of this additional overhead probably doesn't matter much but if you're on a shared wire with thousands of other AWS customers and your packets go through a bunch of switches and whatever network virtualisation AWS runs, it adds a huge amount of latency even if the throughput is the same.

Instead what you'd rather do is a flow more like:

cache.get_multi(['A', 'B', 'C'])  |  > GET A B C
                                  |  < data for A
                                  |  < data for B
                                  |  < data for C

Then you send one packet, and receive one packet, instead of doing a bunch of unnecessary round trips.

The thing is, the naive version of it with all of the round trips is really easy to write without realising it. You've probably written something like this in an ORM before

class MyOrm:
    @staticmethod
    def get_thing(cls, key)
        cached = cache.get(key)
        if cached:
            return cached
        return cls.get_from_db(key)

So without having a batch interface, you have to call it like

def do_whatever(keys):
    for key in keys:
        thing = MyOrm.get_thing(key)
        thing.do_something(...)

do_whatever here is probably talking about things in terms of your application. In the case of reddit, it's all about Links and Votes and Comments. Network particulars just aren't at the forefront of your mind at that point. Even if they were the details are non-obvious: they're all hidden down the call stack far away from where you even think of things on that level.

Writing it with batching to start isn't any harder, what's harder is turning all of your loops inside out in a program that's already pervasively very loopy. A good rule of thumb is that if your function (or any of its callees) can do I/O, it should probably take a list instead of a single item. It's similar to making matrix/scientific code easier to vectorise.

A very similar thing happens with DB transactions as you alluded to. In something like MySQL or Postgres UPDATE/INSERT is actually pretty cheap, it's the COMMIT that's expensive. So lots of tiny transactions has a very similar effect

u/badmonkey0001 avatar

UPDATE/INSERT is actually pretty cheap, it's the COMMIT that's expensive.

Ah, the joy of record/table locking. Great answer! Thanks for the detail.

In something like MySQL or Postgres UPDATE/INSERT is actually pretty cheap, it's the COMMIT that's expensive.

Postgres partially mitigates that due to the way it handles MVCC, where the Vacuum is the expensive part, and actual transactions are basically free. But yes, with MySQL this is a major issue.

more reply More replies
More replies
More replies
u/beefsack avatar
Edited

Latency really highlights inefficient database querying. I've actually started simulating fairly heavy latency in dev environments from the beginning of projects, usually on par with AWS cross AZ latency (this is a reasonable latency to expect in high availability applications.)

Starting around then we started printing every SQL query we run in development mode and that made a big difference at the time. If it adds enough log spam that you can't find what you're looking for, you're probably making too many queries.

More replies

OH man, I used to play bzflag a lot. I wonder how much that's still active.

An unnamed member of the team had secretly modified his bzflag game client to show him where the best weapons were on the map. He also seemed to win every round. A coincidence, I'm sure

Almost certainly. Can't imagine why that would help.

Related: I found that it still exists and has some small number of people playing. I also learned that I am still awful at it.

More replies
More replies

Since then the pile of the servers has migrated multiple times from office to office and I believe are currently in a garage.

I'm sure r/homelab would be very happy to take those off your hands

u/ameoba avatar

Shit, I've worked places where importing a Dev db was an overnight job. Doing that on prod ain't bad.

More replies
Edited

The move to AWS was in 2009. I started here a year after that, so I'll let others answer that part of the question!

More replies
u/kemitche avatar

deploy timings are down to 7 minutes for around 800 servers despite the extra waiting for safety.

Hot diggity, that warms my heart to hear that speed has been kept reasonable even with more services and servers!

I still remember the time I accidentally pasted the deploy password into IRC...

Yeah! Gotta keep rollin'. Hope things are going well!

u/kemitche avatar

Sure are - miss y'all though! Sounds like you're still having a good time :)

cough don't know if you saw the hiring thing. nudge nudge wink wink know-what-I-mean

You get a punch card when you come back. Free small sundae on third rehire!

u/rram avatar

You've never given me a small sundae!

how did you hear about the sundae??

more replies More replies
More replies
u/kemitche avatar

It's tempting. There's something seductive about reddit...

u/rram avatar

It's me

more reply More replies
More replies
More replies
u/rram avatar

Hello Mr. Spicy. We meet again.

u/kemitche avatar

And now we get drinks, right?

u/rram avatar

I haven't stopped drinking since last time

more reply More replies
More replies
More replies
More replies
More replies
[deleted]
[deleted]

I still remember the time I accidentally pasted the deploy password into IRC...

You guys didn't keep them in source control at first, have a bunch of meetings about managing secrets, and then never do anything and stick with source control?!

More replies
More replies
[deleted]
[deleted]

Comment deleted by user

u/jacques_chester avatar

More modern platforms make it easier to limit the blast radius of misbehaving app code, which makes operations more comfortable with unblocking the gate. I've seen it with Cloud Foundry because that's what I know best, but the dynamic is the same for any of the major platforms.

I work for Pivotal. We have customers who went from thinking about months-per-deployment to deployments-per-hour.

u/Freakin_A avatar

Recent Pivotal Cloud Foundry customer at a large enterprise. We went from production deploys taking 6 months and 70+ steps to multiple deploys per day. Capacity adds took just as long before but now we can autoscale up and down on a daily basis.

More replies
More replies
[deleted]
[deleted]

Hey we run a multi million dollar business and still deploy more than 60 times a week

u/Freakin_A avatar

At Etsy, one of their mandates is that all new developers will write code and deploy to production on their first day.

[deleted]
[deleted]

Comment deleted by user

Probably with more of an emphasis on not being totally fucking stupid, like that one poor guy's place was.

u/Freakin_A avatar

That company must be filled with morons. I felt so badly for that guy that he was blamed and fired for their incompetence

[deleted]
[deleted]

Yep we used to do this but it's not so practical now

More replies
More replies

"It’s important to keep an eye on where you are evolving to so that you keep moving in a useful direction."

[deleted]
[deleted]

Comment deleted by user

Thanks!

u/damiankw avatar

TIL I learnt reddit uses an IRC bot to manage internal systems, just like I do!

If it works, it works!

More replies
[deleted]
[deleted]

Comment deleted by user

Hey, it's a bit better now! :P

u/senatorpjt avatar

I'm just curious what's in these 200 commits a week, because the site isn't perceptibly different to me than it was nine years ago.

Edited

Take a peak in r/changelog to see some of the stuff going on. Plenty of projects happening, but not all of them are gonna jump out at you on the front page like if we added clippy or something. There's also a fair amount that has to happen behind the scenes just to keep a site like this going as it grows: both in infrastructure and to support the community.

It looks like you've asked for upvotes, would you like me to delete your comment?

~ ClippySnu

More replies

They are likely committing every atomic unit of change. So, a typo fix is a commit as much as a feature is and a feature may have a huge number of commits before being made live.

Edited

We do indeed try to keep distinct changes in separate commits, but the numbers in the blog post are about the number of deploys each of which may be shipping many commits of changes. But yeah, a single feature might be made up of many smaller deploys to get pieces out in feature flagged or A/B test form.

More replies

Are you in management by any chance?

u/senatorpjt avatar

No, but I'm old and don't like change, so it makes sense that the answer seems the mobile site/apps which I don't use. What's weird is that since day one people have been complaining about how the search sucks, and they still do.

There's so much to Reddit under the hood that you and I don't know about. I'm gonna throw out a random number and say 90% of the code is probably agnostic to whether it's used by the mobile app, website, or web app. There's an insane amount of work that has to go towards maintaining a site of this scale.

We're also not even considering things like event tracking, A/B testing, performance improvements, site reliability improvements, etc. You can't just launch a website at the scale of Reddit and walk away and let it collect money. It's a very full time operation.

More replies
More replies
More replies

I read this hoping I could sing a song for you guys (because I want to work for reddit), but I don't really get it. Great work, though.

I applied last year and meant it. My resume was no-joke built for the positions (both Brand Strategist and Senior Designer). You all never got back to me :( I assumed I wasn't a good fit.

u/sawyerwelden avatar

Rip. I understand when companies reject someone, but it sucks not getting a notification of rejection or a notification that the position has been filled.

more reply More replies
More replies

An admin giving an admin gold is cheating.

u/gooeyblob avatar

Is it?

Damn you.

More replies
More replies

Would you ever consider offering remote work positions?

u/gooeyblob avatar

Yep! Depends on the position though. Doesn't hurt to apply!

[deleted]
[deleted]

Comment deleted by user

u/gooeyblob avatar

I'd say again, depends on the position quite a bit, as well as your experience and many other factors. Please apply if you'd like to get that conversation started!

more reply More replies
more reply More replies
u/Missionmojo avatar

This is why I haven't applied. This is the work I do in my current position.

More replies
u/media_guru avatar

Not a fan of remote devs, eh?

u/gooeyblob avatar

Not true! It depends on the position, but we're more open to remote work these days than in the past.

u/jaxspider avatar

A yes that time reddit gave the ultimatum to move to SF or they'd get fired.

Pepperidge Farm remembers.

u/gooeyblob avatar

No one's denying that or trying to rewrite history (that's how I ended up in San Francisco in fact) but it's a much different company than it was at that time. I feel pretty confident that we're able to support remote work much better these days and we already have a few folks working remotely for some time.

More replies
more replies More replies
More replies
More replies
More replies
u/djolord avatar

I love this article. I'm currently trying to come up to speed with AWS and understand the options when setting up the various services and interactions. Being the only SW engineer employee at my company I am also the de facto DevOps guy and I'm struggling with the "proper" way to make the best use of what AWS has to offer.

Would you be willing to share the general service layout you use in AWS or give some guidance? Thanks in advance!

Thanks! We've done a few AMAs in the past that might give more broad-scope info on how we do things here https://www.reddit.com/r/sysadmin/comments/3h0o7u/were_reddits_ops_team_aua/.

u/djolord avatar

Awesome! Thanks for sharing.

No prob! Also, whoops that's a pretty old one. Here's a more recent AMA: https://www.reddit.com/r/sysadmin/comments/57ien6/were_reddits_infraops_team_ask_us_anything/

More replies
More replies
u/tarxzf avatar

I haven't used this personally, but you may find AWS's CloudFormation useful: https://aws.amazon.com/cloudformation/?hp=tile&so-exp=below

More replies
u/tHEbigtHEb avatar

I'm sort of in the same position as you (sole back-end guy) consulting with a couple of companies, and I found ansible to be a good system for managing deployments. I can write a playbook for one app and use it for others easily. It takes away all the hairyness from deployments, especially for multiple hosts. You should check it out, the documentation is really great and there are many tutorials too. Hit me up if you've got any doubts.

More replies
[deleted]
[deleted]

Comment deleted by user

u/cloudadmin avatar

Spinnaker looks great. If you don't mind me, what does it buy you over using Jenkins Pipelines? I'm not quite sure I see the difference in use cases between the two tools.

More replies

Thanks for your hard work! We just recently switched to Spinnaker at work and it's been so much better compared to what we used to use (Salt)

u/seesure avatar

Curious about the reasons why you've not gone the blue/green deploy route. Can you comment on this?

We do no downtime deployments at the process level with Gunicorn and Einhorn. There are no immediate benefits to us doing them at the infrastructure layer

u/jacques_chester avatar
Edited

Which to me would count as a blue/green: start the replacement, wait for liveness, redirect, drain, kill the old.

I think the details depend a little bit on how happy you are with accepting the process as the unit of isolation. You can get the same speed of replacement with a general PaaS like OpenShift or Cloud Foundry, you can get heavier VM- or raw metal-level guarantees from a tool like BOSH.

Disclosure: I work on Cloud Foundry, though not on BOSH.

More replies
[deleted]
[deleted]

Blue green can be annoying at scale. Start 1000 new servers? Why?

More replies
More replies

Sounds a lot like what I’ve been working on for the last 10 years. I develop on and manage a team of python developers doing backend work for a major cloud company. Among other things, I wrote the script, which upgrades our production environments. Our upgrade process deploys code to over 1000 servers in various datacenters all around the world. The deployment is a little complicated due to security and what not. Two years ago we found the key to zero downtime on a wsgi stack. Since then, we can make a build, deploy changes and restart all the servers in about 20 minuets with no disruptions. The biggest improvement we’ve made in the last couple of years is integrating our development process with both Gerrit and Jenkins. This let us streamline our deployment process, allowing us to check in code, get it reviewed, QA’ed and in a deployable state in a matter of hours. I’d like to optimize it even more, but at this point, it’s not even worth trying to improve the system due to all the approval processes and bureaucracy we have to deal with. I’m interested to know if you use any sort of git management software or branching workflow for your developers?

Thanks for the quick rundown of your system. That'd be cool to hear more about.

We use GitHub for most of our dev work, including using its review systems (with some chatbot stuff overlayed on it to keep everyone in the loop). The flow is a pretty standard fork-and-branch model where people fork the source repos, push topic branches to their forks, and pull request onto master of the source. CI is currently done with Drone.

Do you use any code analysis tools like pylint or pychecker in your pipeline?

Yup! flake8 and pylint feature pretty heavily. You can see the template we use for a new, empty, service here: https://github.com/reddit/baseplate-cookiecutter/tree/master/baseplate_cookiecutter/%7B%7Bcookiecutter.project_slug%7D%7D

more reply More replies
More replies
More replies
More replies
u/scalator2 avatar

So, are you guys happy with the current deployment stack (rollingpin Einhorn, gun icon, etc). Did you look into container orchestrators like Swarm, Kubernetes, or Mesos?

Edited

There are already teams at reddit that use Mesos for orchestration. We're also exploring containers with Kubernetes and carefully moving in that direction.

Our current stack is stable and we mostly don't get woken up in the middle of the night. We are fairly happy with it but there is always room for improvement. As reddit grows, we need to provide tools to maintain our velocity while making sure we're not compromising security or stability of our services. Our infrastructure will evolve to support that .

u/NovaP avatar

K8s is fun. Always good to know you can hit one of the servers with a sledgehammer and everything still works

u/jacques_chester avatar

My understanding is that upgrading etcd is the main hairy part.

u/NovaP avatar

Using kubectl makes it a lot easier. I've got 3 servers running a K8s cluster to mess around with it. Minecraft is my stand in application.

Edit: sorry I meant kubeadm.

u/jacques_chester avatar

Yeah, I can't say I know. I may be conflating what I've heard about Kubo from Google and Pivotal engineers with what I've heard at Pivotal about etcd and consul.

tl;dr it hasn't been kittens and unicorns for us.

more replies More replies
More replies
More replies
More replies
[deleted]
[deleted]

Hey an admin I haven't seen before. May I ask what your role is at reddit?

More replies

We're definitely checking out different options, no set plan at the moment for where the production infrastructure's going next. There are definitely things that suck about this system but it does generally do its job at getting code onto servers in a reasonable timeframe. The main problems we're focusing on now are more about what we can do to help the growing engineering team get more stuff out the door quickly and safely.

More replies
More replies
u/atrommer avatar

Fantastic post. How do you manage automated QA coverage in this pipeline?

Also, how are changes to the data model pushed out safely as part of releases?

How do you manage automated QA coverage in this pipeline?

This is something we're trying to get better at. We currently have Sentry for error reporting and Drone for CI but the monolith is pretty complex and doesn't have a terribly great test suite so a lot of it is just trying your best and reverting if you made a mistake. We're pushing for better test coverage in the newer generation of services and making sure it's all checked out in CI to help keep up quality as we go.

Also, how are changes to the data model pushed out safely as part of releases?

The main data models are vaguely schemaless (they have a fixed schema that allows for a lot of flexibility) so actual schema changes are pretty rare. The most common schema operation is the addition of a new table in our Cassandra clusters, in which case schema migration isn't a problem because it's new.

More replies
u/doterobcn avatar

I loved the article, it provides a good picture, without getting into too many tech details. Fantastic!

More replies
u/rad_as_hell avatar

Have you considered third party offerings? Since you're in AWS, did you take a look at CodeDeploy?

u/SikhGamer avatar

Every engineer writes code, gets it reviewed, checks it in, and rolls it out to production regularly. This happens as often as 200 times each week and a deploy usually takes fewer than 10 minutes end-to-end.

Awesome. We can get code to live in about fifteen minutes. We are just doubling the amount of production/live servers we have. Right now our deploys don't really scale. It is something we are currently looking at solving.

Over time, the number of servers needed to serve peak traffic grew. This meant that deploys took longer and longer. At its worst, a normal deploy took close to an hour. This was not good.

This is where we are at right now.

Really great blog post. Thanks u/spladug & u/foklepoint!

[deleted]
[deleted]

Waiting for the update when Reddit is 100% hosted on AWS Lambda

u/rad_as_hell avatar

Play with the math on cost per execution time for EC2 and Lambda, while thinking about what kinds of execution models EC2 and Lambda excel at :)

More replies

Interesting post. Why did you switch from uWSGI to Gunicorn?

But nobody talks about automatic database deployments.

Edited

Oh some people do... but you should run screaming. Stateful services are a whole different thing.

More replies
u/frymaster avatar

Side node, I hate these blog sites where all you see on the whole first screen is a (very blurry) image. I closed it twice because I thought I'd accidentally clicked on a random image link instead of the blog post

u/xiongchiamiov avatar

This was originally done with an eye to being more open source-friendly, but it ended up being very useful shortly after.

A good example of how designing for open-source leads to better-designed software. It does take a bit more work, but I really believe it's worth it.

And for something like a deploy tool, hey, every company should be open-sourcing that; there's no advantage you're giving your competitors.

u/ViralInfection avatar

No configuration management tools like Chef or Puppet? Eeep

Edited

We use Puppet for configuration management. This post was focused solely on application code deployments. You can read a bit more about the rest of our infrastructure here.

u/ViralInfection avatar

👍 Thanks

[deleted]
[deleted]

Comment deleted by user

We're on v3.4 so it's pretty rough. It gets the job done though. The language looks like it got a lot nicer in newer versions. I do like the declarative model for configuration management.

[deleted]
[deleted]

Comment deleted by user

More replies
More replies
More replies
[deleted]
[deleted]
Edited

On the moon of zort, the zortians were having a big party. They were eating zortian food and drinking zortian drinks. The zortians were so happy that they didn't even notice that their leader, the zortian queen, had been replaced by a fake one.

In the land of flumplenook, the flumplenooks were having a big celebration. They were eating flumplenook food and drinking flumplenook drinks. The flumplenooks were so happy that they didn't even notice that their leader, the flumplenook king, had been replaced by a fake one.

Meanwhile, in the land of blerp, the blerpians were having a big argument. They were arguing about whether or not to eat blerpian food. Some people thought it was delicious and others thought it was disgusting.

In the land of wugglepants, the wugglepants were having a big sale. They were selling wugglepants for 50% off and everyone was buying them like they were going out of style. The wugglepants factory was working overtime to keep up with the demand.

On the planet of gleeble, the gleebleians were having a big party. They were playing gleeble games and eating gleeble food. The gleebleians were so happy that they didn't even notice that their leader, the gleeble emperor, had been kidnapped by aliens.

In the land of jimjam, the jimjams were having a big celebration. They were eating jimjam food and drinking jimjam drinks. The jimjams were so happy that they didn't even notice that their leader, the jimjam king, had been replaced by a fake one.

On the moon of zort, the zortians were having a big party. They were eating zortian food and drinking zortian drinks. The zortians were so happy that they didn't even notice that their leader, the zortian queen, had been replaced by a fake one.

In the land of flumplenook, the flumplenooks were having a big celebration. They were eating flumplenook food and drinking flumplenook drinks. The flumplenooks were so happy that they didn't even notice that their leader, the flumplenook king, had been replaced by a fake one.

The wugglepants factory was working overtime to keep up with the demand for wugglepants. The blerpians were having a big argument about whether or not to eat blerpian food. The jimjams were having a big celebration and eating jimjam food and drinking jimjam drinks.

The zortians were having a big party on the moon of zort and eating zortian food and drinking zortian drinks. The gleebleians were having a big party on the planet of gleeble and playing gleeble games and eating gleeble food.

The flumplenooks were having a big celebration in the land of flumplenook and eating flumplenook food and drinking flumplenook drinks. The blerpians were having a big argument about whether or not to eat blerpian food.

The wugglepants factory was working overtime to keep up with the demand for wugglepants

[deleted]
[deleted]

I'm curious it seems you still use git to update your servers. Have you considered buncling all your code and dependancies into an artifact to speed up deployments?

This has been talked about within the team. We do want to move to an artifact based build systems. One of the biggest advantages to this is rollbacks are faster as you don't have to repeat the work of doing a build again.

We use debian packages for most service dependencies and those are managed by our config management system (puppet). These usually aren't in the scope of a normal deploy

More replies
u/solatic avatar

It seems like a lot of issues you guys had revolved around never having packaged the Reddit monolith, and reinventing the wheel inside your deployment tooling to make up for that.

  • Adding new servers? Just point them to your private Reddit package repo, no need to add them to the list of hosts in push.

  • Bad deploy? Revert to the older version in the package repo. Because the package fully defines how to safely uninstall, the package manager takes care of everything for you.

  • Packages must contain versioning info, so there's never any confusion over what "master" is

  • Rolling custom deployment scripts to deal with speed and orchestration issues - again, taken care of by proper package management

Containers are a type of packaging, so if Reddit engineering is going down the container path, then it'll help to solve the problem... but these problems were solved long before containers became the new way to package software.

u/jacques_chester avatar

I kinda blame Heroku for misleading a generation of devs into thinking that version control is the same as deployment.

Adding to the confusion is that for Ruby, their initial focus, the unit of source and the unit of execution are the same thing. So too for other scripting languages. Whereas ecosystems like Java, .NET, C/C++ and Golang have the distinction and consequently have different mental models for "where does the source live?" and "where does the executable live?"

(I thought about this a lot when I worked on Cloud Foundry Buildpacks)

More replies
u/ay90 avatar

What about testing before deploying?

I wrote a little about that over here.

More replies
u/AnimalFarmPig avatar

Assuming Reddit is still built on Python, why are you even bothering to push code to multiple servers? Just have each server mount a shared volume containing the application code (over NFS). When you want to deploy, update the code (once) on the shared volume.

Having a single instance of state rather than copies of state on every single application node solves the problem of "what if new nodes come online and we don't know to push the new code to them?" Any new node that starts will always run the most recent code.

[deleted]
[deleted]

Hi, i don't work for reddit but i work for a pretty large website. We used to use NFS to host our codebase but we moved away from it for a few reasons:

  • Performance is terrible. NFS introduces extra latency since it runs over the network, not a dedicated storage network.

  • it's a single point of failure. NFS can be clustered but it is still hard to make it highly available. When it goes down your whole website is down.

  • changes can't be rolled out gradually. This is useful, as pointed out in the blog post, to help identify errors early and roll back.

More replies
u/rad_as_hell avatar

Sounds like a wonderful single point of failure. NFS goes down, the process dies, manager tries to restart and finds there isn't any code to run anymore.

More replies
u/wangofchung avatar

What if there's a bug in the code?

More replies
More replies
u/DrakesOnAPlane avatar

Gotta be honest, I don't get any of this yet, but I'm learning on my own from the ground up, and this all seems extremely AMAZING. This must be extremely hard, and we all take this for granted. Much appreciated, Reddit Team!

More replies
[deleted]
[deleted]

Comment deleted by user

Edited

(paraphrased summary of question since it was accidentally deleted)

why deploy via git+ssh rather than building images and launching new servers?

We considered switching to an AMI based workflow a few times over the years, but 1) it would be a big change from how it already worked so we'd have to have a good reason to do it, and 2) building an image and launching instances based on it takes longer than what we were doing so we didn't see much benefit.

u/Quteness avatar

I accidentally deleted the wrong comment and deleted the parent. Thanks for the answer!

u/Missionmojo avatar

Ya the time to build the image with packer would cost a bit but the benefit is nice. That immutable image is easy to work with as a build artifact passed through to higher environments. It also make scaling with and ASG nice and reverting to a previous version is also fast (because you already have the image)

More replies
u/seriouslulz avatar

Have you ever considered Fabric?

u/Missionmojo avatar

So my main jobs is working on our AWS deployment framework. We supports the deployment of hundreds of microservices using aws simple workflow service + cloudformation. Every micro service lives in an ASG. We use immutable Amis and configuration management on the hosts. So all the config management, alerts, and infrastructure are all in scm and beside the code and go though the same code review process. I like the blog post.

u/piratemurray avatar

The big issues facing us today are twofold: improving engineer autonomy while maintaining system security in the production infrastructure, and evolving a safety net for engineers to deploy quickly with confidence.

Would love to hear more about this. Can we get a sneak peak?

I feel developer autonomy come with trust in both the developer and the system. I want to trust that developer doesn't do bat shit insane things. Also I want to trust that the system around him / her doesn't allow them to blindly make bad decisions without catching it early. Where is that balance though? Especially given the security angle?

More replies

Curious as to why you guys didn't consider any famous config management (puppet/chef/ansible etc) in your stack. Why the rewrite?

Also, gunicorn has some graceful restart also some max_requests and jitter to auto restart.

More replies
u/african_cheetah avatar

git push $h:/home/reddit/reddit master

should be git pull

That command was run on the local machine (where deploys were run from) so it did indeed do a git push.

More replies
u/the_evergrowing_fool avatar

That website design is completely misleading, I though the whole post was that top header screenshot for a minute until a read the title again and my curiosity made noticing the scroll bar to scroll down.

Pretty cool story, keep it up. I'd be interested to know the customer-facing impact: I remember maintenance hours, before, and fewer of them now. How has that changed over time with the evolution?

If you ever want to take those 800 servers down to, oh, 28 or so (a 1/30th reduction), you know what technology to look at

Incidentally, we also switched from uWSGI to Gunicorn around this time for various reasons. This didn’t really make a difference as far as deploys are concerned.

why? Do you see uWSGI less superior ?

u/kwekie1993 avatar

Seems like a tough read for me who's not that familiar with this field. Is this considered to be under DevOps? So I know what to read up on to learn more.

u/i0908k avatar
More replies
[deleted]
[deleted]

Good bot

[deleted]
[deleted]

Good bot