* No, using a colo saves you way more than just 20%. In one of our facilities, our org has maybe ~15k servers under management and dc costs are ~ 1 million/mo. Build out of the cages and racks wasn't incredibly expensive, I think ~2 million. Power isn't insane either.
* I have no idea what he's talking about re: the fiber statement. We have a blend of different backbone providers that give us ~100 gigabit for ~$0.60 per gig. Compared to bandwidth costs for AWS, it's dirt cheap.
* You don't need ultra fast hot swappable robotic arms if you're not FAANG. With a handful of dc techs and a simple monitoring system, you can swap out disks just as easily, at the expense of a little more opex.
* PCI requires no consultants, its terms are pretty black and white - you will need an ISA though. HIPAA and can be more of a bear, but your org will need this consultation regardless of whether you own a dc or put it in the cloud. AWS does not give you automatic compliance.
* Yes I can totally believe that working at Google, you'd give TCOs that make Google look good.
Literally none of us except Lyft's operations knows what's good for Lyft. 100 million a year is a ton of cash, but gives them presence around the world. My company has datacenters around the world and it is very operationally complex to maintain them. It's definitely not for everyone, but this guy is making it sound worse than it actually is.
I guess what bothers me the most about the original tweet(s) is the false dichotomy of "build your own datacenter" or "use AWS". How about co-locating at one of the many, many Tier 3 certified datacenters all over the world? There are tons of providers with locations all over the country and world. Or beyond that, what about letting someone else (e.g. Compass Data Centers) build a datacenter for you?
Very true. For example, we're running most of our log aggregation and time series handling on rented dedicated servers. It took more work and skill to get to work compared to something like AWS elasticsearch, but the setup is dirt-cheap for what it does including redundancy in case of hardware failures. Or, my last employer used to run a lot of their data warehousing and BI on rented physical systems for similar reasons - high IO demand, built-in redundancy.
There are many steps you can take before buying concrete to build a DC. Even before you have to buy your own rack.
And to build a ride sharing company. First Lyft should have built an expertise in infrastructure? Should they also learn how to build cars to because it might save them money in the long run?
And even with current expenses, building out a core competency in infrastructure that has nothing to do with your domain of expertise can be a distraction.
reductio ad absurdum, lyft is a taxi dispatch company, outsource mobile and web app, customer service to off shore, drivers/cars err never mind, analytics outsource to hortonworks cloudera ml consultancy . Done none of those are core
I have no idea if that’s how they operate or not or if you’re being facetious, but that’s a very valid business model. I know companies that did something similar starting out and then brought the expertise in house.
In fact, I was offered a job as a “digital transformation consultant”/“enterprise architect” to be the face of a company that did just that - take over a complete project for funded startups or other companies who had an idea but no in house expertise. Of course I would be the face of the company where most of the work would be “rural sourced” and outsourced.
> First Lyft should have built an expertise in infrastructure?
The alternative being that they should have build an expertise in using AWS instead?
I'm not particularly trying to argue either way; I'm just saying that claiming that using AWS doesn't need expertise is a false premise. AWS isn't special here, either.
If instead you use some sort of abstraction and/or tooling, now you need expertise in that tooling.
You only need a few people on prem with AWS expertise. You would be surprised how much infrastructure and management grunt work you can outsource to cheaper managed service providers who can then outsource some of their grunt work overseas.
I think you can make that argument when you're just starting out. But when you're an established company at Lyft's size... yeah, you can have some expertise in infrastructure.
It might make some sense for Lyft to wait for their scale to stabilise so that they don't waste money investing in infrastructure that is at the wrong scale.
Because they were convinced VCs it was a good idea? But a company losing billions of dollars a year is not exactly a great model for sound business practices....
It depends too on your workload and how efficient your software is. If you have a bursty or unpredictable workload then you’re buying servers for your peak, and paying for wasted hardware. Every time I’ve done or seen the calculations in those circumstances it never makes sense to buy. Bursting into the cloud comes up, but that’s gets complicated very fast too.
Agreed, this guy make no sense at all. Why would anyone build there own data center? Talk about the extreme, did he forget you can also rent space inside of data centers???
It's better to rent a datacenter then to build one for companies that reach a certain threshold. At certain amount of loads AWS becomes to limited and expensive. Certainly not true for the typical AWS sites or companies.
Even Amazon uses their own separate, dedicated systems for a lot of stuff - not public AWS I heard.
Any company that owns the infrastructure management tech like AWS would be stupid not to use it themselves. It's likely that what you heard was a misstatement—maybe they use their own isolated instances of the AWS tech, possibly to dogfood new features.
I think it's because of base compensation of employees, which still holds true still, because Microsoft pays way less then FANG.
Microsoft is the most valuable company in the world. When it comes to stock Facebook probably would be not in FANG anymore because of their drastic stock drop over last year.
It’s been used in finance (as well as FANG) for a long time at this point. Often investors buy all of these companies at the same time to reduce volatility and get exposure to big tech.
Maybe I read the OP wrong but I've never heard of building out and/or holding data center real estate assets. I worked for a large data center and even they leased their buildings. Their customers included government, big banks, large tech (even FAANG level) and they either colo'd or leased space.
There are dedicated REITs that hold only high-tech/data center real estate.
No, going to a doctor's appointment isn't what an ambulance is for. The comment specifically said "non-urgent". But when patients miss appointments because they can't get there or afford $20 for an Uber, the health system loses a lot more than $20, so it's worth it financially for them to partner with Uber/Lyft to get patients to appointments.
Ah yes, I guess the specific Uber program mentioned is for non-urgent cases, I apologize. But anecdotally I know of many cases where Uber is being used instead of ambulances.
I understand the economic principle of increasing revenue, and I agree that this partnership is worth it financially; but honestly, who the fuck cares about the health care system not getting revenue?
Providing transportation for patients is so that __people can get healthcare__! A hospital/HCSP not making a few hundred/thousand is completely insignificant when compared to the real problem: People can't get healthcare or can't get to it.
I just take issue with this viewpoint where the concern for the dollar is placed before people's needs.
Ambulances are absolutely not for non-urgent, non-critical situations and being able to auto-schedule a ride at time of appointment creation is a positive addition.
All this sounds like a distraction for a ride sharing company. They aren’t an infrastructure company. Why would they want to both try to build out a ride sharing and infrastructure at the same time? Why fight a battle on two fronts?
On top of that you have to hire people to manage it.
I don't think the title is a fair representation of what's going on. Hear me out.
The way it's phrased, it's like each ride directly translates into 14 more cents for Amazon. But really, the figure comes from averaging Lyft's Amazon costs over Lyft's total rides.
Now, in some cases, that would be valid accounting, if all (or nearly all) of your costs (with respect to that service) are variable. For example, most of the cost of sending a letter is the truck/plane/employee, which are variable, so it makes sense to divide total letters over total truck etc costs.
But would Lyft pay twice as much to Amazon if they served twice as many rides? Would they have to double every AWS service they're buying? I don't think so. They might have to add more of one or two kinds of server/services, but most of that 14 cents figure comes from amortizing fixed costs, which will go down if they scale up further.
This theory has unfortunately never panned out for me. To the extent that I get a little anxious about my employer getting too aggressive with PR. Organic growth always plays out better for the engineers in the short to medium term.
IT's version of the Laws of Thermodynamics is that nearly every interesting action you can do programatically is going to take at least O(log n) time, where n is the amount of data you have (at rest, in flight, or both).
If I have 16x as much traffic as I used to deal with, my load isn't 16x higher, it's 64x higher - if I'm fortunate and we've engineered for scale. But it could easily be three orders of magnitude higher (16164) if I'm not.
For the 64x scenario I tell my boss to buy more hardware and I put medium to high priority tickets in the backlog to improve the worst cases. In the 3 orders situation, those tickets bump all of our other priorities, and there are more of them. Performance is now a feature. It sucks up a significant and very visible fraction of our engineering budget, which comes with its own sort of opportunity costs.
If you try to scale vertically, we all know that except at the low end, buying a server that's twice as beefy costs far more than twice as much. If you go horizontal, there are plateaus where your network topology has to get more complex (hardware cost, maintenance cost, speed, pick two). Personally, I think the biggest lie in cloud computing is that 10G ethernet fixes all of your network topology problems (ie, it's treated as magic that you don't have to worry about). As disks and PCIe get faster over the next couple years I think that'll be back on people's radars.
I'll say, I'm so used to seeing "log" denote either lg or ln (depending on whether the context is closer to maths or computing) that seeing the claim that log(16)=1.2 threw me for a loop.
This despite it being fundamentally meaningless because of course it needs to be multiplied by some time-dimensioned value anyway.
Still, that means doubling your traffic adds a fixed amount of computation due to the log n component. That doesn't turn 16x into 64x. It turns 16x into 16x+y, where if you 16x again it turns into 256x+2y.
You aren't off base as this is a snapshot in time so it's no way to know if those costs are linear and scale with more rides or not without seeing historical figures on AWS spend vs revenue/# of rides.
But it could very likely be true that doubling the number of rides would have a potentially material impact on additional AWS spend for Lyft.
Their model should scale linearly, because everything can be clustered/sharded by geo so nicely, and the scope of a typical transaction is one passenger vs a few dozen potential drivers. That does not significantly change as they grow.
Unlike someone like FB, with their many-to-many everything, Lyft/Uber is a data architect's dream.
There's definitely something that has to scale linearly as they serve more rides; the question is how much of that 14 cents the linear-scaling stuff accounts for.
I would think the bulk of it. Given their ride volume, they should be way past the fixed costs for stuff that has to be there before the 1st ride takes place. API load should be mostly linear as well. Back office, internal reporting, etc anything that deals with data in aggregate can be non-linear, but negligible in the grand scheme of things.
In theory, yes, but you might be underestimating how much technical debt can be left behind in a scale-up. For all we know, there's a bunch of stuff that just hasn't been cleaned up. As other comments note, I can totally imagine someone saying "yeah, not worth the risk of breaking something as long as it only comes out to 14 cents a ride".
Edit: And, for that matter, they may be using AWS for things other than serving rides, e.g. dev tools and internal service that scale with the number of office employees, not rides.
Yeah that $0.14 per ride may sound reasonable, until you realize (as someone pointed out earlier) that it's essentially a EC2 t3.xlarge for one whole hour. Now, imagine that your model requires a 4CPU, 16GB RAM server to handle just one ride per hour - is that really efficient?
... and of course, linear is both a blessing and a curse. Lets you predict/project and buy reserved capacity nicely - yet you never truly get to enjoy the efficiencies from scale.
The more customers I have the more shards I have to run.
Managing 5 shards is not the same amount of work as managing 50. For one the surface area for cross shard communication goes up. What happens if London or Manhattan are too big for one shard? Multi-tenant providers have been wrestling with the Big Customer problem forever, and sharding is only a bandaid.
On a traditional model (ie building a datacenter or colocating in a datacenter) then I think what you are saying is 100% accurate. You would have fixed costs and variable costs which make it difficult to break overall costs into a "per-ride" expense. This is because, as you mentioned, some fixed costs in their current state may be able to handle higher volume than they are currently utilized on. Thus, as rides go up, not all costs would go up linearly.
However, in the case of AWS and other cloud service providers, virtually all products and services are charged on a per-usage basis. Meaning that if the average Lyft ride requires 60 database api calls (say a 10 minute ride with location updates every 10 seconds = 60 location updates/db writes) then AWS charges you for 60 api calls. If the average ride takes 11 minutes, then it would take 66 api calls and so forth. In this case, this cost is a variable cost and maps directly to a specific ride. If that ride had not existed, then that cost would not have been incurred.
Take a look at AWS pricing pages and you will see that all prices are broken down into very specific per-usage costs. Some services break this up very deeply, on ingress usage, egress usage, storage usage, api usage, etc all together calculate the cost that the service uses.
Looking at the Lyft Case Study that AWS published (you can view here: https://aws.amazon.com/solutions/case-studies/lyft/), it appears they primarily use Redshift, Kinesis, DynamoDB, and EC2/ECR services. All of these are heavily usage based. The least usage-based service they use would be their EC2 servers. Since a server is rarely running at max-capacity it means that one extra lyft ride doesn't necessarily mean another server. So the EC2 costs for that ride would be the same if it had or had not existed. But it looks like they are heavily reliant on using ECR to auto-scale their servers up and down based on usage. So they are probably running many clusters of smaller servers that move up and down as rides increase. This means that even their EC2 servers are fairly closely mapped to actual usage of their platform. During a slow period they will be running far fewer EC2 servers than during a peak period. So even in this case their costs are relatively variable.
All in all, I would say that you actually could extrapolate the total cost of AWS on a per-ride basis, because that is the variable that effects their costs. If they had 1 less ride this month, their bill would ACTUALLY be 14 cents lower (or somewhat close to it). They do incur a direct cost with each individual ride. If there are fixed AWS costs in there, it probably accounts for less than 1 penny of that 14 cents. For example there are a few fixed costs if they are using AWS DirectConnect or AWS Private Link to connect to AWS Datacenters. But this accounts for probably a few thousand dollars a month at most and in total would have minimal effect on thier per-ride cost at the current scale that Lyft is at.
Just to clarify, I would agree that 14 cents is an accurate figure even if a single ride didn't translate to 14 cents, but doubling the rides increased the spend by an average of 14 cents.
My suspicion is that a bulk of those costs aren't things that scale with more rides coming in.
I've worked at startups with AWS technical debt that ended up costing them an amount way out of line with what they really needed to use.
With that said, I'm definitely confused by (the evidence introduced in) the GP's response (linking to the Lyft/AWS video). I mean, both AWS and Lyft are trumpeting the setup as an efficient use of AWS, and it still comes to $0.14/ride? That doesn't add up.
People seem have forgotten the productivity loss of running their own data center. When I was in my previous company, most of things are just excruciating compared with what I enjoyed in Netflix. Dockers did not come with persistent volumes, provisioning machines involved lengthy approvals and sometimes long waiting time; building and deploying services that involved multiple clusters that talked to each other required serious devop-fu; production clusters got broken because infra upgraded system dependencies using a background job and mitigating such simple mistakes may take hours if not days. Deploying any non-trivial service required multiple meetings with some infra teams, for resources, for concerns with stability, or whatever. PS, did I mention that we built tens of thousands of Puppet files that very few people understood and any updates to my system took half an hour to refresh yet the early members of the infra team thought Puppet was the best f@#$ tool in the world? The list can go on and on. PPS, did I mention that we had a 500-person infra team and later probably more than a thousand for a fraction of traffic/complexity compared with what Netflix had to deal with, while Netflix didn't even have an infra team (to be fair, Netflix had a build team and a platform team and a monitoring team, which had fewer than 30 people combined)?
In contrast, EC2 alone would have to save us lots of pains.
You may say that my previous company was not technically great. Maybe. If that's the case, how many companies can be really better? I think the fundamental problem is that companies tend to underestimate the effort to build a world-class infrastructure from scratch as well as the opportunity cost due to loss of productivity. Netflix's leaders were truly wise by claiming that Netflix didn't want to work on undifferentiated heavy lifting early on.
This thread only compares ROI of hosting in cloud with building own datacenter and laying own oceanic fiber(!!). The option of using a colo and magistral ISP somehow fell through the cracks.
Of course building one datacenter and own oceanic fiber to host your own services is a bad option for almost every business - not every business can fill a full DC, and 1 DC is not enough to guarantee global operations.
The orignal author mentioned how good AWS worked for netflix, but did not mention that netflix rolled their own content delivery system in part because of high traffic cost with all the major cloud providers.
My understanding is that most commercial CDNs (Limelight, Akamai, etc) are also more cost-effective than serving directly from AWS. However, commercial CDNs don't know a-priori what files are going to be popular, and how to distribute them in different regions and on different types of storage to maximize efficiency. But that's one of the things that Netflix's CDN team can predict. See https://medium.com/netflix-techblog/distributing-content-to-...
> E.g. if you are +30% of the internet traffic (nflx) it doesn't make sense to pay rent to telcos any more and feed their margins. You have the volume and stable demand to justify ownership. For the rest, cloud is where they'll live and die.
And really it’s only 12amps! 120v! What the heck are you going to do with 42U and 1,440 watts?!
I couldn’t believe it was a 15A circuit for the whole cabinet when I went there.
This was several years ago but the other issue I had was DDoS against someone else there collaterally taking down our network. Ended up moving to 55 S Market in San Jose and have been very happy. Those racks are at least 40 amps.
Lyft is horribly inefficient in terms of using their AWS resources. I know this first hand. They don't have to build their own DS. They just need to build a better system. Possibly take advantage of Lambdas because a lot of compute heavy operations like geo en/decoding are done in services that run at like 25% CPU in production.
Not to pick on you specifically as others are saying more or less the same thing... But if you're such a genius that you know how to slash Lyft's AWS spend without knowing basically a thing about their infrastructure, you should be knocking loudly on their door because I'm sure they have an extremely well-paid job waiting for you.
Otherwise, maybe consider voicing opinions like these about things you actually know something about.
ADDED: Meant to respond to a different comment that didn't suggest any inside knowledge. (Though I'd still point out that even insiders often can't understand how their own company has so many employees in X department, is so inefficient, etc. And this is more or less the case everywhere.
"I know this first hand" implies they do know something about their infrastructure.
Given that they work for "one of few >10b unicorns" per their comment history, I have a suspicion you may be telling a Lyft employee they don't know anything about Lyft.
Well, that's factual - you can get a couple hours time on an entire AWS instance for $0.14. That certainly seems to point to an cost-inefficient infrastructure.
I don't really have an opinion on how efficient or inefficient their infrastructure is. AWS has lots of services and I'm sure Lyft is using a heck of a lot more of them than just a bunch of EC2 instances.
Some quick math: last I heard, Lyft was doing 1M rides a day. That's 1M transactions per day, or "only" 11 transactions per second. Of course there's lot more (assuming 10x) supporting transactions to enable one ride - search, matching, status updates. Let's say that adds another 100 transactions per second. Rounding up to 150/sec. What am I missing?
In terms of data usage and storage, matching is by geography, meaning their data is easy to shard.
Your math isnt wrong but the load isnt simply averaged out over the entire day. There will be peak capacity times way way higher than the lowest times. I would guess orders of magnitude larger. You have to build for peak capacity. If their target response time is 100ms or something then that adds in lots of complexity.
I agree the load will fluctuate - but will mostly follow predictable patterns daily and weekly. However, that's what the cloud's elasticity is for - and AWS makes it easy (and inexpensive) to expand and contract when implemented correctly. Man, I would love an opportunity to optimize something like that!
Right?! The cost is way too high for their use case, even considering a large padding to OPs quick maths.
Part of my team one month stopped all development and solely focused on reducing AWS costs and we cut our monthly bill by 50% (probably around 4 dev's yearly salary)
It's not easy, but optimizing AWS resources is an absolute must.
Idk, but I will say that there is definitely stuff going on beyond the ride. I.e. for every n people who open the app only a small number will book a ride. Billing services, driver location tracking, analytics stuff, etc are all going to balloon the amount of computing that the service uses.
There are also location updates happening every few seconds for all active drivers, which is a lot of updates per second. Again, though, easily shardable.
Also remember the drivers are sharing location constantly even when not currently driving a client. And I would guess your transactions per ride is off by at least one order of magnitude, especially if you count a transaction between services (e.g. gps to push, payment to user table, whatever)
10x might be a bit low. During a ride the app is probably constantly in touch with the server updating its position and getting new maps, updates on the positions of others etc. Could be more like 10,000x? And when you are just walking around with the app in your pocket they may be tracking you - dunno.
they must be running some gargantuan machinelearning/particle detecting/alien signal analyzing operation, otherwise its impossible to imagine where these levels of petaflops are being wasted
Are we talking just about machines? A huge chunk of AWS is actually enterprise SAAS.
Analytics, authentication, support etc. you don't know what Lyft uses exactly here from AWS. For example if they use Auth0 instead of Cognito they pay that part to Auth0. AWS prices are very competitive most of the time.
AWS is the Walmart of enterprise IT, they sell you everything else too.
These posts always make it like "cloud is better vs building datacenters on mars".
There are middlegrounds, like colocation & dedicated servers. If you get dedicated servers 4x cheaper than cloud-shared-vps with remote-hdd, then you can overprovision.
Especially now that hardware is getting bigger you need even less space (assuming your software scales vertically).
And they ALWAYS make it like the next hour you will have 10x requests and your database will autoscale that quickly.
- only parenthetically mention colocation, as if it isn't absolutely normal to rent a rack/cage/suite as needed from an existing DC operator (which in addition would make datacenter rent an OpEx and actual server equipment be amortized over 2-3 years, not the 10 for real property)
- somehow present "intercontinental traffic" as something both necessary and tremendously expensive, as if all the major public clouds aren't charging 10x what the market rates for bandwidth are
- imply that "the cloud" is immune to outages, as if GCP didn't have multiple major global- or region-wide outages over the past few years
I mean, I understand why most companies don't go on-prem despite all that, but this series of tweets is borderline FUD.
I'm not sure the article OP is being disingenuous. It's a thread _specifically_ about Lyft's costs - and Lyft is clearly at a scale where they would NOT be colocating.
Of course they would. They are a taxi service, not a Dropbox or YouTube. They do neither have obscene demand for storage, nor for bandwidth, they need some servers tracking their users coordinates, running a "who is nearest" search in case someone wants a ride, some GPS navigation/pathfinding during the ride and some billing code after it.
If you don't blow this up to stupid proportions, that should be able to run on modern hardware in a few racks at a colo for millions of users per day, especially since you can neatly shard the load geographically, thus distributing load (and rented racks) over multiple DCs, ideally with failover in place for emergencies. The only thing they really need to merge is the billing data at the end, but handling billing data of a global userbase of tens or hundreds of millions of users in a single system is a solved problem nowadays and does not even constitute a case worthy of the overused "big data" moniker.
Colo can mean "rent a suite from DRT in one location and a building in another" in addition to space in a rack. Lyft's probably around that size (and it's a good idea to do some colo to gain experience before trying to build out your own anyways)
As this post makes clear, it's not the things you can control in a dedicated/colo environment you should be worried about. You can always hand them more dollars for more storage and compute.
You can never hand them enough dollars to get their SLAs/latency on par with the cloud. There will be outages. There will be delays. There will be scale problems. There And you'll have to address these somewhere else, either in your tech stack, your operating model, or your PR department.
Paying a 2-5% TCO premium to have a throat to choke, 5 9s redundancy, GDPR compliance, and the law of large numbers on your side of the court is a pretty fair trade.
14 cents per ride is nothing! Maybe 1% of the cost of a ride.
Maybe they could save a bunch by colo or something else. But would 14 cents per ride really matter at all for their competitiveness. I’m not going to notice a 14 cent difference even if I do bother to price compare Uber with Lyft.
This is a VC fueled market. It isn’t really about small margins of this size.
Most traditional companies are based on having revenues higher than costs.
dotcom v3.0 companies are all about the potential and cornering the market. Amazon was exactly the same - it was founded in 1994 but didn't make a profit until 2001.
For Lyft having a taxi anywhere I want one with low wait times at low cost is going to secure their success. Not new features in their app, not even latency to the data center.
There's no reason they need to be spending such crazy amounts on servers - ostensibly to allow faster iteration. A new version of their app just isn't going to move the needle. Signing up new drivers will though.
I fail to see how the benefits AWS provides are so important they need to spend such crazy amounts.
They need to be seen to be doing something other than bleeding money.
Every extra driver costs them money, every lift costs them money. If they can hold out the illusion of "we know we can save money here when we've got time and have won the market", maybe it keeps the money rolling in.
More like 5% if they cost costs to 0. But realistically they can cut costs by maybe half.... so you're taking significant risk for cutting losses by... 2.5%?
Maybe they could have a reasonably efficient setup instead?
Not sure how you can possibly pay 0.14$ in computing for a single ride (if that's accurate). That's more than 3 hours of a t3.medium instance for example...
The tweet took Lyft's total AWS costs and divided it by total rides. It's not a literal accounting of the compute cost for a single ride.
AWS costs that wouldn't really fit into a per-ride accounting:
* Redundancy of instances (regions/AZs)
* Data-storage/duplication/backups
* Non-ride related AWS costs (hosting/processing of analytics, test & automation infrastructure, etc.)
I'm sure there are others. I'm not saying that there's no way they could get their AWS costs down, btw.
Suggesting lyft needs 5-9s is ridiculous for 1000s of servers is ridiculous. Its entire event stream comes from a mobile network which is probably less reliable.
It's not that simple - if Lyft won't connect and everything else works, riders and drivers will turn to competitors. Once they've installed the app and signed up for an account with Uber, they're a lot less likely to stick exclusively or primarily with Lyft.
Well, 5 9's are 5 minutes of disruption an year. Drivers won't be able to get into Uber on that timeframe. 4 9's would be 50 minutes an year. Unless it is all concentrated on a couple of events it may not even make the clients time-out.
I always see the motivation for the cloud mainly because of financial reasoning but is there a benefit to stay on-site in terms of knowledge retention (for the lack of a better word)? For instance, there is a big push to outsource manufactured parts and good mainly due to "cost savings". However, the are long term cost savings when we keep the parts made in-house which management always overlook. Sure, it cost money to maintain the machines, inventory, and tools but this allows greater flexibility and also retain the knowledge within the company. In addition, it allows the engineers to drive technology within the company faster. Is this the case for data center infrastructure?
You are a bit more flexible in the data center. Like, if you really need very low latency and high bandwidth between two machines, you could buy an infiniband thingy. Expensive AF, but may less expensive than setting a horde of software engineers on it for two years to deal with the higher latency.
But, all in all, there aren't all that many practical use cases that really benefits from stuff like that. Most projects are quite happy with some storage, compute, network, and a few managed services. Most companies don't do rocket science with their software and infrastructure.
And there are things that you simply can only afford at scale, like security management for your supply chain, dedicated automation for updating firmware of every tiny controller that's in your hardware, and so on.
My understanding is that the biggies - Google, Amazon, Facebook, etc. - don't bother with individual drive replacements (and certainly not with an expensive robotic arm system). They wait for an entire rack to fault over a certain percentage and then just swap out the whole thing.
Depends on the case. Facebook at least generally replaces drives as they go. In a pipelined fashion rather than as they fail, though. So, a given drive failure doesn't lead to immediate emergency replacement, but sometime in the next few days as someone makes a drive-replacement pass through that area.
Robotic tape libraries (like the StorageTek/Sun/Oracle SL8500) are standard.
However, the robotic part is not to swap parts in case of failure - instead, it is to let ~64 tape drives access ~100k individual tapes within the library and ~inf tapes in off-site storage (as the data stored is not, and _shouldn't_ be, directly accessible).
Our (small) company has a $500-$1000 AWS bill. About 2 months ago it was double that. This is a 3-4 developer company. When we hire another person we'll probably increase our EKS instances by one more box. We already paid down our debt - and it definitely cost us about 2 months with half our dev staff. But our goal wasn't to reduce the AWS cost, it was to reduce the time it takes to replicate an environment in K8s with one script. I'm guessing Lyft has not taken a path like that.
I think what's interesting to know is of that 14 cents, how much of that 14 cents does Amazon keep as profits and how much goes directly to costs. Some analysis suggests AWS has 25% profit margin which means of that 14 cents, Amazon really only pockets ~2-3 cents per Lyft ride.
Another interesting question is whether that profit margin includes the original capex of when AWS was not reporting the revenue/profit of AWS in its early years. There's also all sorts of creative accounting methods to hide capex such that I wouldn't be surprised if Amazon is selling at cost to buy marketshare in a way somewhat analogous to Uber subsidizes rides to gain early marketshare.
I am surprised no one is talking about Lyft's risk assessment of their business. Specifically, this line:
Our results of operations vary and are unpredictable from period-to-period, which could cause the trading price of our Class A common stock to decline.
Our results of operations have historically varied from period-to-period and we expect that our results of operations will continue to do so for a variety of reasons, many of which are outside of our control and difficult to predict.
I find the term "period-to-period" rather vague. It could mean quarter-to-quarter or even year-to-year.
Still what they are effectively saying is that Lyft cannot be trusted to provide any growth assessment.
A really elaborate (and admittedly interesting) way of repeating the constant refrain: Things that work for the extremely large-scale usually aren't ideal for the small- or medium- scale. This applies to everything from codebase size to infrastructure size, and the opposing force tends to be engineers' inherent enthusiasm for engineering things. Be honest about your actual needs and know when not to chase the N=∞ case.
Well, that's the question. The guy basically makes the case that it isn't, for this purpose, and that Netflix and Dropbox are. Good insight into the hairy details, but the overall thesis is pretty simple.
I would say he's making the case that Lyft doesn't have a single component with ridiculous scale like Netflix CDN or Dropbox storage. That could just mean Lyft's costs are more evenly distributed.
If you were to say how much you get paid, you would in general say you make x an hour. You wouldn't say "Well, after taxes, rent, food, gas, insurance, etc, I make x-y an hour."
This is a bunch of FUD.
* No, using a colo saves you way more than just 20%. In one of our facilities, our org has maybe ~15k servers under management and dc costs are ~ 1 million/mo. Build out of the cages and racks wasn't incredibly expensive, I think ~2 million. Power isn't insane either.
* I have no idea what he's talking about re: the fiber statement. We have a blend of different backbone providers that give us ~100 gigabit for ~$0.60 per gig. Compared to bandwidth costs for AWS, it's dirt cheap.
* You don't need ultra fast hot swappable robotic arms if you're not FAANG. With a handful of dc techs and a simple monitoring system, you can swap out disks just as easily, at the expense of a little more opex.
* PCI requires no consultants, its terms are pretty black and white - you will need an ISA though. HIPAA and can be more of a bear, but your org will need this consultation regardless of whether you own a dc or put it in the cloud. AWS does not give you automatic compliance.
* Yes I can totally believe that working at Google, you'd give TCOs that make Google look good.
Literally none of us except Lyft's operations knows what's good for Lyft. 100 million a year is a ton of cash, but gives them presence around the world. My company has datacenters around the world and it is very operationally complex to maintain them. It's definitely not for everyone, but this guy is making it sound worse than it actually is.
I guess what bothers me the most about the original tweet(s) is the false dichotomy of "build your own datacenter" or "use AWS". How about co-locating at one of the many, many Tier 3 certified datacenters all over the world? There are tons of providers with locations all over the country and world. Or beyond that, what about letting someone else (e.g. Compass Data Centers) build a datacenter for you?
Very true. For example, we're running most of our log aggregation and time series handling on rented dedicated servers. It took more work and skill to get to work compared to something like AWS elasticsearch, but the setup is dirt-cheap for what it does including redundancy in case of hardware failures. Or, my last employer used to run a lot of their data warehousing and BI on rented physical systems for similar reasons - high IO demand, built-in redundancy.
There are many steps you can take before buying concrete to build a DC. Even before you have to buy your own rack.
Thank you, that was my point exactly. There are lots of options between "do nothing" (AWS) and "do everything" (literally build datacenters).
What do you use for your log aggregation and time series?
We run a fairly classical elastic stack for our log aggregation and telegraf/influxdb for time series. And grafana/kibana pulling it all together.
So use a bunch of tier 3 providers. Who is going to coordinate all of this? With AWS you have one company to call.
The guy you pay with the millions of dollars saved.
And to build a ride sharing company. First Lyft should have built an expertise in infrastructure? Should they also learn how to build cars to because it might save them money in the long run?
You don't need an expert when you're starting out.
But, uh, wasn't this about their current expenses? Or their expenses at some point after their AWS bill got bigger than 1-2 employees?
And thinking likes this can kill a company.
And even with current expenses, building out a core competency in infrastructure that has nothing to do with your domain of expertise can be a distraction.
A distraction to who? You're not supposed to take people away from your core product to do this.
If you can't find anyone to even handle the hiring, let me stay far away from this company that teeters the verge of collapse when someone gets sick.
reductio ad absurdum, lyft is a taxi dispatch company, outsource mobile and web app, customer service to off shore, drivers/cars err never mind, analytics outsource to hortonworks cloudera ml consultancy . Done none of those are core
I have no idea if that’s how they operate or not or if you’re being facetious, but that’s a very valid business model. I know companies that did something similar starting out and then brought the expertise in house.
In fact, I was offered a job as a “digital transformation consultant”/“enterprise architect” to be the face of a company that did just that - take over a complete project for funded startups or other companies who had an idea but no in house expertise. Of course I would be the face of the company where most of the work would be “rural sourced” and outsourced.
> First Lyft should have built an expertise in infrastructure?
The alternative being that they should have build an expertise in using AWS instead?
I'm not particularly trying to argue either way; I'm just saying that claiming that using AWS doesn't need expertise is a false premise. AWS isn't special here, either.
If instead you use some sort of abstraction and/or tooling, now you need expertise in that tooling.
You only need a few people on prem with AWS expertise. You would be surprised how much infrastructure and management grunt work you can outsource to cheaper managed service providers who can then outsource some of their grunt work overseas.
I think you can make that argument when you're just starting out. But when you're an established company at Lyft's size... yeah, you can have some expertise in infrastructure.
Lyft was 4x smaller 2 years ago ....
It might make some sense for Lyft to wait for their scale to stabilise so that they don't waste money investing in infrastructure that is at the wrong scale.
> Should they also learn how to build cars to because it might save them money in the long run?
Uber's doing that and they're a comparable company.
And how is that working out for their bottom line?
i ve asked this before. Lyft doesnt own cars, drivers , servers, and it even outsources its software. what is lyft's expertise?
how are they outsourcing software?
my bad . it should be support instead of software
how do you explain why Uber invested tons of money for automatous driving technology?
Because they were convinced VCs it was a good idea? But a company losing billions of dollars a year is not exactly a great model for sound business practices....
It depends too on your workload and how efficient your software is. If you have a bursty or unpredictable workload then you’re buying servers for your peak, and paying for wasted hardware. Every time I’ve done or seen the calculations in those circumstances it never makes sense to buy. Bursting into the cloud comes up, but that’s gets complicated very fast too.
Right, I'd expect Lyft's utilization, and that of any user facing service really, to fluctuate pretty significantly.
Uber has invested heavily in making their own in house serverless system to address these fluctuations too.
!? serverless lol
it operates on magic now :p
Agreed, this guy make no sense at all. Why would anyone build there own data center? Talk about the extreme, did he forget you can also rent space inside of data centers???
It's better to rent a datacenter then to build one for companies that reach a certain threshold. At certain amount of loads AWS becomes to limited and expensive. Certainly not true for the typical AWS sites or companies.
Even Amazon uses their own separate, dedicated systems for a lot of stuff - not public AWS I heard.
Any company that owns the infrastructure management tech like AWS would be stupid not to use it themselves. It's likely that what you heard was a misstatement—maybe they use their own isolated instances of the AWS tech, possibly to dogfood new features.
You are in for a surprise.
https://jatins.gitlab.io/me/amazon-internal-tools/
I'm pretty sure Amazon uses AWS. There's probably still legacy applications running on pre-AWS infra though.
They dont, there are internal MAWS movements though, meaning Move to AWS. https://jatins.gitlab.io/me/amazon-internal-tools/
On an unrelated note when did FAANG before a thing? I don't recall seeing this used anywhere a mere year ago.
Also, why is there no M for Microsoft in that abbreviation?
probably lack of growth in their stock prices?
MSFT is up 25% YoY and 200% over the last 5. They’re at an all time high.
I think it's because of base compensation of employees, which still holds true still, because Microsoft pays way less then FANG.
Microsoft is the most valuable company in the world. When it comes to stock Facebook probably would be not in FANG anymore because of their drastic stock drop over last year.
A for Amazon is well-known for its terrible work env for the employees.
It’s been used in finance (as well as FANG) for a long time at this point. Often investors buy all of these companies at the same time to reduce volatility and get exposure to big tech.
Since at least 2015. Folks sometimes use FANG and overload the 'A' for Apple and Amazon.
They’ve been saying it for years on Chinese boards.
Lyft is US and Canada only. Not around the world.
Maybe I read the OP wrong but I've never heard of building out and/or holding data center real estate assets. I worked for a large data center and even they leased their buildings. Their customers included government, big banks, large tech (even FAANG level) and they either colo'd or leased space.
There are dedicated REITs that hold only high-tech/data center real estate.
also this thinking seems very one-size-has-to-fit-all, cloud/datacenter focused. why would lyft need HIPAA compliance??
Both Lyft & Uber have begun partnering with healthcare providers to provide non-urgent patient transportation.[1]
[1] https://www.fiercehealthcare.com/tech/lyft-announces-integra...
Dystopian af
Care to explain why?
Probably because that’s what an ambulance is for but taking one will bankrupt the average citizen.
No, going to a doctor's appointment isn't what an ambulance is for. The comment specifically said "non-urgent". But when patients miss appointments because they can't get there or afford $20 for an Uber, the health system loses a lot more than $20, so it's worth it financially for them to partner with Uber/Lyft to get patients to appointments.
Ah yes, I guess the specific Uber program mentioned is for non-urgent cases, I apologize. But anecdotally I know of many cases where Uber is being used instead of ambulances.
https://www.nytimes.com/2018/10/01/upshot/uber-lyft-and-the-...
https://slate.com/technology/2018/02/when-should-you-uber-to...
The comment[1]'s only line was indicating that this was for non-urgent cases.
But yes, it was disturbing when Tesla was using Ubers for people who had exposed bone.
[1] https://news.ycombinator.com/item?id=19400560
I understand the economic principle of increasing revenue, and I agree that this partnership is worth it financially; but honestly, who the fuck cares about the health care system not getting revenue?
Providing transportation for patients is so that __people can get healthcare__! A hospital/HCSP not making a few hundred/thousand is completely insignificant when compared to the real problem: People can't get healthcare or can't get to it.
I just take issue with this viewpoint where the concern for the dollar is placed before people's needs.
Ambulances are absolutely not for non-urgent, non-critical situations and being able to auto-schedule a ride at time of appointment creation is a positive addition.
Why is that? Because existing services don’t cover these patients or scenarios?
All this sounds like a distraction for a ride sharing company. They aren’t an infrastructure company. Why would they want to both try to build out a ride sharing and infrastructure at the same time? Why fight a battle on two fronts?
On top of that you have to hire people to manage it.
Lyft doesn't need a global presence, they only operate in the US and a few Canadian cities.
What type and model of networking do you use on 15K servers? Can you share some specs?
Lyft doesn't need presence around the world - they only operate in US.
I think they'd be much better off just renting some space out in a datacenter and run the show themselves!
Why does Lyft need around the world presence? What in a primarily app driven carsharing service is particularly latency-sensitive?
Data locality laws, for starters.
I don't think the title is a fair representation of what's going on. Hear me out.
The way it's phrased, it's like each ride directly translates into 14 more cents for Amazon. But really, the figure comes from averaging Lyft's Amazon costs over Lyft's total rides.
Now, in some cases, that would be valid accounting, if all (or nearly all) of your costs (with respect to that service) are variable. For example, most of the cost of sending a letter is the truck/plane/employee, which are variable, so it makes sense to divide total letters over total truck etc costs.
But would Lyft pay twice as much to Amazon if they served twice as many rides? Would they have to double every AWS service they're buying? I don't think so. They might have to add more of one or two kinds of server/services, but most of that 14 cents figure comes from amortizing fixed costs, which will go down if they scale up further.
Am I off base here?
This theory has unfortunately never panned out for me. To the extent that I get a little anxious about my employer getting too aggressive with PR. Organic growth always plays out better for the engineers in the short to medium term.
IT's version of the Laws of Thermodynamics is that nearly every interesting action you can do programatically is going to take at least O(log n) time, where n is the amount of data you have (at rest, in flight, or both).
If I have 16x as much traffic as I used to deal with, my load isn't 16x higher, it's 64x higher - if I'm fortunate and we've engineered for scale. But it could easily be three orders of magnitude higher (16164) if I'm not.
For the 64x scenario I tell my boss to buy more hardware and I put medium to high priority tickets in the backlog to improve the worst cases. In the 3 orders situation, those tickets bump all of our other priorities, and there are more of them. Performance is now a feature. It sucks up a significant and very visible fraction of our engineering budget, which comes with its own sort of opportunity costs.
If you try to scale vertically, we all know that except at the low end, buying a server that's twice as beefy costs far more than twice as much. If you go horizontal, there are plateaus where your network topology has to get more complex (hardware cost, maintenance cost, speed, pick two). Personally, I think the biggest lie in cloud computing is that 10G ethernet fixes all of your network topology problems (ie, it's treated as magic that you don't have to worry about). As disks and PCIe get faster over the next couple years I think that'll be back on people's radars.
If your complexity is O(log n), an increase of 16x only increases your time by log(16)=1.2 (constant, not a multiple).
I'll say, I'm so used to seeing "log" denote either lg or ln (depending on whether the context is closer to maths or computing) that seeing the claim that log(16)=1.2 threw me for a loop.
This despite it being fundamentally meaningless because of course it needs to be multiplied by some time-dimensioned value anyway.
You have a way to process 16x as much traffic without using any new resources?
How does it follow that with O(log n) complexity you get 64x more work with 16x more data?
it's implicitly O(n log n) because you need to do O(log n) for an interesting operation on each request.
Still, that means doubling your traffic adds a fixed amount of computation due to the log n component. That doesn't turn 16x into 64x. It turns 16x into 16x+y, where if you 16x again it turns into 256x+2y.
Logs grow slowly.
You aren't off base as this is a snapshot in time so it's no way to know if those costs are linear and scale with more rides or not without seeing historical figures on AWS spend vs revenue/# of rides.
But it could very likely be true that doubling the number of rides would have a potentially material impact on additional AWS spend for Lyft.
Their model should scale linearly, because everything can be clustered/sharded by geo so nicely, and the scope of a typical transaction is one passenger vs a few dozen potential drivers. That does not significantly change as they grow. Unlike someone like FB, with their many-to-many everything, Lyft/Uber is a data architect's dream.
There's definitely something that has to scale linearly as they serve more rides; the question is how much of that 14 cents the linear-scaling stuff accounts for.
I would think the bulk of it. Given their ride volume, they should be way past the fixed costs for stuff that has to be there before the 1st ride takes place. API load should be mostly linear as well. Back office, internal reporting, etc anything that deals with data in aggregate can be non-linear, but negligible in the grand scheme of things.
In theory, yes, but you might be underestimating how much technical debt can be left behind in a scale-up. For all we know, there's a bunch of stuff that just hasn't been cleaned up. As other comments note, I can totally imagine someone saying "yeah, not worth the risk of breaking something as long as it only comes out to 14 cents a ride".
Edit: And, for that matter, they may be using AWS for things other than serving rides, e.g. dev tools and internal service that scale with the number of office employees, not rides.
Yeah that $0.14 per ride may sound reasonable, until you realize (as someone pointed out earlier) that it's essentially a EC2 t3.xlarge for one whole hour. Now, imagine that your model requires a 4CPU, 16GB RAM server to handle just one ride per hour - is that really efficient?
Oh, no, I agree it's unreasonable for what they're getting; I'm saying that it might seem small relative to the business priorities.
It didn't sound reasonable to me before, but you really gave me some perspective here. What the the hell are they doing? So much geo processing?
... and of course, linear is both a blessing and a curse. Lets you predict/project and buy reserved capacity nicely - yet you never truly get to enjoy the efficiencies from scale.
The more customers I have the more shards I have to run.
Managing 5 shards is not the same amount of work as managing 50. For one the surface area for cross shard communication goes up. What happens if London or Manhattan are too big for one shard? Multi-tenant providers have been wrestling with the Big Customer problem forever, and sharding is only a bandaid.
On a traditional model (ie building a datacenter or colocating in a datacenter) then I think what you are saying is 100% accurate. You would have fixed costs and variable costs which make it difficult to break overall costs into a "per-ride" expense. This is because, as you mentioned, some fixed costs in their current state may be able to handle higher volume than they are currently utilized on. Thus, as rides go up, not all costs would go up linearly.
However, in the case of AWS and other cloud service providers, virtually all products and services are charged on a per-usage basis. Meaning that if the average Lyft ride requires 60 database api calls (say a 10 minute ride with location updates every 10 seconds = 60 location updates/db writes) then AWS charges you for 60 api calls. If the average ride takes 11 minutes, then it would take 66 api calls and so forth. In this case, this cost is a variable cost and maps directly to a specific ride. If that ride had not existed, then that cost would not have been incurred.
Take a look at AWS pricing pages and you will see that all prices are broken down into very specific per-usage costs. Some services break this up very deeply, on ingress usage, egress usage, storage usage, api usage, etc all together calculate the cost that the service uses.
Looking at the Lyft Case Study that AWS published (you can view here: https://aws.amazon.com/solutions/case-studies/lyft/), it appears they primarily use Redshift, Kinesis, DynamoDB, and EC2/ECR services. All of these are heavily usage based. The least usage-based service they use would be their EC2 servers. Since a server is rarely running at max-capacity it means that one extra lyft ride doesn't necessarily mean another server. So the EC2 costs for that ride would be the same if it had or had not existed. But it looks like they are heavily reliant on using ECR to auto-scale their servers up and down based on usage. So they are probably running many clusters of smaller servers that move up and down as rides increase. This means that even their EC2 servers are fairly closely mapped to actual usage of their platform. During a slow period they will be running far fewer EC2 servers than during a peak period. So even in this case their costs are relatively variable.
All in all, I would say that you actually could extrapolate the total cost of AWS on a per-ride basis, because that is the variable that effects their costs. If they had 1 less ride this month, their bill would ACTUALLY be 14 cents lower (or somewhat close to it). They do incur a direct cost with each individual ride. If there are fixed AWS costs in there, it probably accounts for less than 1 penny of that 14 cents. For example there are a few fixed costs if they are using AWS DirectConnect or AWS Private Link to connect to AWS Datacenters. But this accounts for probably a few thousand dollars a month at most and in total would have minimal effect on thier per-ride cost at the current scale that Lyft is at.
edit: spelling
Just to clarify, I would agree that 14 cents is an accurate figure even if a single ride didn't translate to 14 cents, but doubling the rides increased the spend by an average of 14 cents.
My suspicion is that a bulk of those costs aren't things that scale with more rides coming in.
You think their fixed AWS costs are more than 4 million per month, even with almost zero users? I sure hope not.
I've worked at startups with AWS technical debt that ended up costing them an amount way out of line with what they really needed to use.
With that said, I'm definitely confused by (the evidence introduced in) the GP's response (linking to the Lyft/AWS video). I mean, both AWS and Lyft are trumpeting the setup as an efficient use of AWS, and it still comes to $0.14/ride? That doesn't add up.
People seem have forgotten the productivity loss of running their own data center. When I was in my previous company, most of things are just excruciating compared with what I enjoyed in Netflix. Dockers did not come with persistent volumes, provisioning machines involved lengthy approvals and sometimes long waiting time; building and deploying services that involved multiple clusters that talked to each other required serious devop-fu; production clusters got broken because infra upgraded system dependencies using a background job and mitigating such simple mistakes may take hours if not days. Deploying any non-trivial service required multiple meetings with some infra teams, for resources, for concerns with stability, or whatever. PS, did I mention that we built tens of thousands of Puppet files that very few people understood and any updates to my system took half an hour to refresh yet the early members of the infra team thought Puppet was the best f@#$ tool in the world? The list can go on and on. PPS, did I mention that we had a 500-person infra team and later probably more than a thousand for a fraction of traffic/complexity compared with what Netflix had to deal with, while Netflix didn't even have an infra team (to be fair, Netflix had a build team and a platform team and a monitoring team, which had fewer than 30 people combined)?
In contrast, EC2 alone would have to save us lots of pains.
You may say that my previous company was not technically great. Maybe. If that's the case, how many companies can be really better? I think the fundamental problem is that companies tend to underestimate the effort to build a world-class infrastructure from scratch as well as the opportunity cost due to loss of productivity. Netflix's leaders were truly wise by claiming that Netflix didn't want to work on undifferentiated heavy lifting early on.
The thread is much more interesting than the initial tweet.
https://threadreaderapp.com/thread/1102401615263223809.html
This thread only compares ROI of hosting in cloud with building own datacenter and laying own oceanic fiber(!!). The option of using a colo and magistral ISP somehow fell through the cracks.
Of course building one datacenter and own oceanic fiber to host your own services is a bad option for almost every business - not every business can fill a full DC, and 1 DC is not enough to guarantee global operations.
The orignal author mentioned how good AWS worked for netflix, but did not mention that netflix rolled their own content delivery system in part because of high traffic cost with all the major cloud providers.
My understanding is that most commercial CDNs (Limelight, Akamai, etc) are also more cost-effective than serving directly from AWS. However, commercial CDNs don't know a-priori what files are going to be popular, and how to distribute them in different regions and on different types of storage to maximize efficiency. But that's one of the things that Netflix's CDN team can predict. See https://medium.com/netflix-techblog/distributing-content-to-...
Exactly, AWS is perfect for running the search and UI features but you have to have the content at the edge or get killed on bandwidth.
He did mention that about netflix:
https://twitter.com/MohapatraHemant/status/11024101732036608...
> E.g. if you are +30% of the internet traffic (nflx) it doesn't make sense to pay rent to telcos any more and feed their margins. You have the volume and stable demand to justify ownership. For the rest, cloud is where they'll live and die.
Hurricane Electric currently have a 42U cabinet and 1 gigabit of bandwidth for $400pcm. That's the cost of about 6 AWS t2.large VMs
That's 50 cents per hour to colocate dozens of servers, and there's 324TB of data thrown in for free.
Obviously you have the tin to buy and maintain which will cost you. Leasing costs for a typical HP server are say $100/month.
10G transit on it's own is around $2k/month with H.E, or 1 cent for 16GB if it's fully saturated.
> That's 50 cents per hour to colocate dozens of servers, and there's 324TB of data thrown in for free.
That deal actually only includes 15A of power, which isn't enough to fill the cabinet if your machines aren't idle.
And really it’s only 12amps! 120v! What the heck are you going to do with 42U and 1,440 watts?!
I couldn’t believe it was a 15A circuit for the whole cabinet when I went there.
This was several years ago but the other issue I had was DDoS against someone else there collaterally taking down our network. Ended up moving to 55 S Market in San Jose and have been very happy. Those racks are at least 40 amps.
My bays tend to have 2x 32A provision, but we run grown up voltages in Europe.
I've never heard the phrase "magistral ISP" and I can't make the connection to https://www.merriam-webster.com/dictionary/magisterial, could you explain that?
I meant backbone network provider. As to oppose to consumer ISP.
Great app, didn't know it exists! Thank you.
And if mods feel it's appropriate, then I don't mind if the link is swapped.
The twitter desktop page shows the whole thread already.
The link shows the thread.
Lyft is horribly inefficient in terms of using their AWS resources. I know this first hand. They don't have to build their own DS. They just need to build a better system. Possibly take advantage of Lambdas because a lot of compute heavy operations like geo en/decoding are done in services that run at like 25% CPU in production.
Not to pick on you specifically as others are saying more or less the same thing... But if you're such a genius that you know how to slash Lyft's AWS spend without knowing basically a thing about their infrastructure, you should be knocking loudly on their door because I'm sure they have an extremely well-paid job waiting for you.
Otherwise, maybe consider voicing opinions like these about things you actually know something about.
ADDED: Meant to respond to a different comment that didn't suggest any inside knowledge. (Though I'd still point out that even insiders often can't understand how their own company has so many employees in X department, is so inefficient, etc. And this is more or less the case everywhere.
"I know this first hand" implies they do know something about their infrastructure.
Given that they work for "one of few >10b unicorns" per their comment history, I have a suspicion you may be telling a Lyft employee they don't know anything about Lyft.
I actually meant to respond to a different comment which merely said this was a few hours of an AWS instance.
Well, that's factual - you can get a couple hours time on an entire AWS instance for $0.14. That certainly seems to point to an cost-inefficient infrastructure.
I don't really have an opinion on how efficient or inefficient their infrastructure is. AWS has lots of services and I'm sure Lyft is using a heck of a lot more of them than just a bunch of EC2 instances.
I know exactly how they _can_ improve things but the engineering organization is not well run to do big infra-only projects like that.
Some quick math: last I heard, Lyft was doing 1M rides a day. That's 1M transactions per day, or "only" 11 transactions per second. Of course there's lot more (assuming 10x) supporting transactions to enable one ride - search, matching, status updates. Let's say that adds another 100 transactions per second. Rounding up to 150/sec. What am I missing?
In terms of data usage and storage, matching is by geography, meaning their data is easy to shard.
So, how's this $8M per month?
edit: fixed a typo
Your math isnt wrong but the load isnt simply averaged out over the entire day. There will be peak capacity times way way higher than the lowest times. I would guess orders of magnitude larger. You have to build for peak capacity. If their target response time is 100ms or something then that adds in lots of complexity.
I agree the load will fluctuate - but will mostly follow predictable patterns daily and weekly. However, that's what the cloud's elasticity is for - and AWS makes it easy (and inexpensive) to expand and contract when implemented correctly. Man, I would love an opportunity to optimize something like that!
Right?! The cost is way too high for their use case, even considering a large padding to OPs quick maths.
Part of my team one month stopped all development and solely focused on reducing AWS costs and we cut our monthly bill by 50% (probably around 4 dev's yearly salary)
It's not easy, but optimizing AWS resources is an absolute must.
Idk, but I will say that there is definitely stuff going on beyond the ride. I.e. for every n people who open the app only a small number will book a ride. Billing services, driver location tracking, analytics stuff, etc are all going to balloon the amount of computing that the service uses.
There are also location updates happening every few seconds for all active drivers, which is a lot of updates per second. Again, though, easily shardable.
Also remember the drivers are sharing location constantly even when not currently driving a client. And I would guess your transactions per ride is off by at least one order of magnitude, especially if you count a transaction between services (e.g. gps to push, payment to user table, whatever)
10x might be a bit low. During a ride the app is probably constantly in touch with the server updating its position and getting new maps, updates on the positions of others etc. Could be more like 10,000x? And when you are just walking around with the app in your pocket they may be tracking you - dunno.
they must be running some gargantuan machinelearning/particle detecting/alien signal analyzing operation, otherwise its impossible to imagine where these levels of petaflops are being wasted
>> So, how's this $8M per month?
Microservices? :-)
Are we talking just about machines? A huge chunk of AWS is actually enterprise SAAS.
Analytics, authentication, support etc. you don't know what Lyft uses exactly here from AWS. For example if they use Auth0 instead of Cognito they pay that part to Auth0. AWS prices are very competitive most of the time.
AWS is the Walmart of enterprise IT, they sell you everything else too.
One could say: AWS is the... Amazon.com of enterprise IT, they sell you everything else too :D
These posts always make it like "cloud is better vs building datacenters on mars".
There are middlegrounds, like colocation & dedicated servers. If you get dedicated servers 4x cheaper than cloud-shared-vps with remote-hdd, then you can overprovision.
Especially now that hardware is getting bigger you need even less space (assuming your software scales vertically).
And they ALWAYS make it like the next hour you will have 10x requests and your database will autoscale that quickly.
Agreed. It's disingenuous of the article OP to:
- only parenthetically mention colocation, as if it isn't absolutely normal to rent a rack/cage/suite as needed from an existing DC operator (which in addition would make datacenter rent an OpEx and actual server equipment be amortized over 2-3 years, not the 10 for real property)
- somehow present "intercontinental traffic" as something both necessary and tremendously expensive, as if all the major public clouds aren't charging 10x what the market rates for bandwidth are
- imply that "the cloud" is immune to outages, as if GCP didn't have multiple major global- or region-wide outages over the past few years
I mean, I understand why most companies don't go on-prem despite all that, but this series of tweets is borderline FUD.
I'm not sure the article OP is being disingenuous. It's a thread _specifically_ about Lyft's costs - and Lyft is clearly at a scale where they would NOT be colocating.
Of course they would. They are a taxi service, not a Dropbox or YouTube. They do neither have obscene demand for storage, nor for bandwidth, they need some servers tracking their users coordinates, running a "who is nearest" search in case someone wants a ride, some GPS navigation/pathfinding during the ride and some billing code after it.
If you don't blow this up to stupid proportions, that should be able to run on modern hardware in a few racks at a colo for millions of users per day, especially since you can neatly shard the load geographically, thus distributing load (and rented racks) over multiple DCs, ideally with failover in place for emergencies. The only thing they really need to merge is the billing data at the end, but handling billing data of a global userbase of tens or hundreds of millions of users in a single system is a solved problem nowadays and does not even constitute a case worthy of the overused "big data" moniker.
Colo can mean "rent a suite from DRT in one location and a building in another" in addition to space in a rack. Lyft's probably around that size (and it's a good idea to do some colo to gain experience before trying to build out your own anyways)
As this post makes clear, it's not the things you can control in a dedicated/colo environment you should be worried about. You can always hand them more dollars for more storage and compute.
You can never hand them enough dollars to get their SLAs/latency on par with the cloud. There will be outages. There will be delays. There will be scale problems. There And you'll have to address these somewhere else, either in your tech stack, your operating model, or your PR department.
Paying a 2-5% TCO premium to have a throat to choke, 5 9s redundancy, GDPR compliance, and the law of large numbers on your side of the court is a pretty fair trade.
Change the TCO to 2x-5x then you're right.
Good luck with the throat choking.
Also what does GDPR have to do with renting vps in cloud ?
Amazon got 2 cents from me typing this note and reading ycombinator for an hour.
I love when indirect costs are naively calculated for marginal cost estimates.
HN runs on a dedicated box colo'd at M5 Hosting :-P
A single box ?
Yes.
https://news.ycombinator.com/item?id=11916168
that is awesome
A single core inside 1 server!
14 cents per ride is nothing! Maybe 1% of the cost of a ride.
Maybe they could save a bunch by colo or something else. But would 14 cents per ride really matter at all for their competitiveness. I’m not going to notice a 14 cent difference even if I do bother to price compare Uber with Lyft.
This is a VC fueled market. It isn’t really about small margins of this size.
It may be 1% of the cost of the ride, but it's likely a substantially larger part of the profits of the ride.
Lyft lost a billion dollars on $2B of revenue in 2018.
So saving $8m a month, or $100m a year, isn't going to turn that round.
Cutting 10% of losses is hardly something to sneer at.
If you're still losing $100m a month you have bigger problems then saving a couple million in infrastructure costs.
Moving from AWS, or doubling AWS spending, will make no difference to the company's viability, it's not worth the time in meetings to discuss it.
Man this industry is crazy. Most traditional companies fight hard to gain even 1% in net margin.
Most traditional companies are based on having revenues higher than costs.
dotcom v3.0 companies are all about the potential and cornering the market. Amazon was exactly the same - it was founded in 1994 but didn't make a profit until 2001.
For Lyft having a taxi anywhere I want one with low wait times at low cost is going to secure their success. Not new features in their app, not even latency to the data center.
There's no reason they need to be spending such crazy amounts on servers - ostensibly to allow faster iteration. A new version of their app just isn't going to move the needle. Signing up new drivers will though.
I fail to see how the benefits AWS provides are so important they need to spend such crazy amounts.
They need to be seen to be doing something other than bleeding money.
Every extra driver costs them money, every lift costs them money. If they can hold out the illusion of "we know we can save money here when we've got time and have won the market", maybe it keeps the money rolling in.
More like 5% if they cost costs to 0. But realistically they can cut costs by maybe half.... so you're taking significant risk for cutting losses by... 2.5%?
You wouldn't be able to cut all of it. Also making those major infrastructure changes are really expensive.
Maybe they could have a reasonably efficient setup instead?
Not sure how you can possibly pay 0.14$ in computing for a single ride (if that's accurate). That's more than 3 hours of a t3.medium instance for example...
The tweet took Lyft's total AWS costs and divided it by total rides. It's not a literal accounting of the compute cost for a single ride.
AWS costs that wouldn't really fit into a per-ride accounting:
* Redundancy of instances (regions/AZs) * Data-storage/duplication/backups * Non-ride related AWS costs (hosting/processing of analytics, test & automation infrastructure, etc.)
I'm sure there are others. I'm not saying that there's no way they could get their AWS costs down, btw.
Well, each driver needs a mobile+plan, at ~$100/month fully amortized. Definitely does <1000 rides/month.
APPL, VRZN each get 5c/ride - just for showing up.
I started this comment as a joke. But now i am not so sure.
And while i am at it...
Suggesting lyft needs 5-9s is ridiculous for 1000s of servers is ridiculous. Its entire event stream comes from a mobile network which is probably less reliable.
It's not that simple - if Lyft won't connect and everything else works, riders and drivers will turn to competitors. Once they've installed the app and signed up for an account with Uber, they're a lot less likely to stick exclusively or primarily with Lyft.
Well, 5 9's are 5 minutes of disruption an year. Drivers won't be able to get into Uber on that timeframe. 4 9's would be 50 minutes an year. Unless it is all concentrated on a couple of events it may not even make the clients time-out.
OK, but that's just the AWS expense. How much per ride does Lyft spend on it's own devs (and admin staff) and boxes (and electric and RE costs)?
That + business costs are pretty much the rest of the total cost, right?
I always see the motivation for the cloud mainly because of financial reasoning but is there a benefit to stay on-site in terms of knowledge retention (for the lack of a better word)? For instance, there is a big push to outsource manufactured parts and good mainly due to "cost savings". However, the are long term cost savings when we keep the parts made in-house which management always overlook. Sure, it cost money to maintain the machines, inventory, and tools but this allows greater flexibility and also retain the knowledge within the company. In addition, it allows the engineers to drive technology within the company faster. Is this the case for data center infrastructure?
You are a bit more flexible in the data center. Like, if you really need very low latency and high bandwidth between two machines, you could buy an infiniband thingy. Expensive AF, but may less expensive than setting a horde of software engineers on it for two years to deal with the higher latency.
But, all in all, there aren't all that many practical use cases that really benefits from stuff like that. Most projects are quite happy with some storage, compute, network, and a few managed services. Most companies don't do rocket science with their software and infrastructure.
And there are things that you simply can only afford at scale, like security management for your supply chain, dedicated automation for updating firmware of every tiny controller that's in your hardware, and so on.
"...you'll never have hot-swappable everything managed by ultra fast robotic arms that replace hard drives in seconds."
Does anyone know anything more about these robotic hard drive replacers?
I've seen it done for tape libraries.
My understanding is that the biggies - Google, Amazon, Facebook, etc. - don't bother with individual drive replacements (and certainly not with an expensive robotic arm system). They wait for an entire rack to fault over a certain percentage and then just swap out the whole thing.
Depends on the case. Facebook at least generally replaces drives as they go. In a pipelined fashion rather than as they fail, though. So, a given drive failure doesn't lead to immediate emergency replacement, but sometime in the next few days as someone makes a drive-replacement pass through that area.
Robotic tape libraries (like the StorageTek/Sun/Oracle SL8500) are standard.
However, the robotic part is not to swap parts in case of failure - instead, it is to let ~64 tape drives access ~100k individual tapes within the library and ~inf tapes in off-site storage (as the data stored is not, and _shouldn't_ be, directly accessible).
Our (small) company has a $500-$1000 AWS bill. About 2 months ago it was double that. This is a 3-4 developer company. When we hire another person we'll probably increase our EKS instances by one more box. We already paid down our debt - and it definitely cost us about 2 months with half our dev staff. But our goal wasn't to reduce the AWS cost, it was to reduce the time it takes to replicate an environment in K8s with one script. I'm guessing Lyft has not taken a path like that.
I think what's interesting to know is of that 14 cents, how much of that 14 cents does Amazon keep as profits and how much goes directly to costs. Some analysis suggests AWS has 25% profit margin which means of that 14 cents, Amazon really only pockets ~2-3 cents per Lyft ride.
Another interesting question is whether that profit margin includes the original capex of when AWS was not reporting the revenue/profit of AWS in its early years. There's also all sorts of creative accounting methods to hide capex such that I wouldn't be surprised if Amazon is selling at cost to buy marketshare in a way somewhat analogous to Uber subsidizes rides to gain early marketshare.
I am surprised no one is talking about Lyft's risk assessment of their business. Specifically, this line:
Our results of operations vary and are unpredictable from period-to-period, which could cause the trading price of our Class A common stock to decline.
Our results of operations have historically varied from period-to-period and we expect that our results of operations will continue to do so for a variety of reasons, many of which are outside of our control and difficult to predict.
I find the term "period-to-period" rather vague. It could mean quarter-to-quarter or even year-to-year.
Still what they are effectively saying is that Lyft cannot be trusted to provide any growth assessment.
A really elaborate (and admittedly interesting) way of repeating the constant refrain: Things that work for the extremely large-scale usually aren't ideal for the small- or medium- scale. This applies to everything from codebase size to infrastructure size, and the opposing force tends to be engineers' inherent enthusiasm for engineering things. Be honest about your actual needs and know when not to chase the N=∞ case.
Is lyft considered extremely large scale?
Well, that's the question. The guy basically makes the case that it isn't, for this purpose, and that Netflix and Dropbox are. Good insight into the hairy details, but the overall thesis is pretty simple.
I would say he's making the case that Lyft doesn't have a single component with ridiculous scale like Netflix CDN or Dropbox storage. That could just mean Lyft's costs are more evenly distributed.
This is why Twitter is terrible for having real conversations. This was painful to read because of formatting and structure.
There's the threadreader version https://threadreaderapp.com/thread/1102401615263223809.html
But also terse. I like that
I dunno, hetzner and contabo sell boxes for $20/mo
I still find it weird thaf lyft needs so much processing per ride, it just doesn’t sound efficient
And a lot of ppl seem to find this normal. Its not, this could be money that pays the driver more
Certification has nothing to do with DC right? It depends on the product. You have to get pci done even if you use AWS
Something to consider in all estimates - if cloud provider could run into conflict of interest with you.
Their payment processor also takes a big cut. This is just a cost of doing business.
how much per ride goes in taxes?
How are they paying this much?
Why would you need to build underwater fiber????????? I don't understand.
Especially for Lyft, where you can be pretty sure most stuff won’t cross oceanic boundaries: “Raoul is arriving in a white Cunard QM2-G32”
Look out for ‘Uber for Ferries’ in this summer’s YC batch!
The title is clickbaity.
They get paid that ~14 cents but they don't 'make' that much.
What they make is what is left after costs.
If you were to say how much you get paid, you would in general say you make x an hour. You wouldn't say "Well, after taxes, rent, food, gas, insurance, etc, I make x-y an hour."