How we decreased GitLab repo backup times from 48 hours to 41 minutes

about.gitlab.com

593 points by nayak 6 months ago

divbzero 6 months ago

The performance improvement that GitLab contributed to Git is slated to be released with v2.50.0:

https://github.com/git/git/commit/bb74c0abbc31da35be52999569...

Nemo_bis 6 months ago

Nice that GitLab is active upstream! Do I see correctly that the most active contributors to git are currently GitLab employees?
- dfc 6 months ago
  
  I dont know how you came to that conclusion. Just looking at the top 4 contributors over the last 2 years[1] it looks like one works for google(gitster), two work at Github and one works at GitLab(pks-t).
  [1]: https://github.com/git/git/graphs/contributors?from=6%2F3%2F...
  
  Nemo_bis 6 months ago
  
  I had looked at the monthly activity. Indeed on a longer timespan it looks different, thanks.
  
  dfc 6 months ago
  
  That's weird I looked at the monthly and did not see any Gitlab employees. Were you looking at a different repo?
  
  nayak 6 months ago
  
  OP here. As of recently, GitLab has a dedicated Git team [1]. So our contributions to the project will hopefully increase a lot more :)
  [1]: https://handbook.gitlab.com/handbook/engineering/infrastruct...

LgWoodenBadger 6 months ago

IME, it has always turned out to be the correct decision to eliminate any n^2 operation in anything I’ve written.

I don’t write exotic algorithms, but it’s always astounding how small n needs to be to become observably problematic.

EdwardCoffin 6 months ago

Bruce Dawson says: I like to call this Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.
https://bsky.app/profile/randomascii.bsky.social/post/3lk4c6...
- paulddraper 6 months ago
  
  The second law is that O(n * log n) is for practical intents and purposes O(n).
  
  sn9 6 months ago
  
  Skiena has a great table in his algorithms book mapping time complexity to hypothetical times for different input sizes.
  For n of 10^9, where lgn takes 0.03 us and n takes 1 s, nlgn takes 29.9 s and n^2 takes 31.7 years.
  
  swyx 6 months ago
  
  more from table please?
  
  johnisgood 6 months ago
  
  I would rather have the table and related content. Name of the book?
  
  EdwardCoffin 6 months ago
  
  It's probably The Algorithm Design Manual 2ed by Steven S. Skiena, figure 2.4
  The second table on this [1] page is pretty similar, though not the same.
  [1] https://a1120.cs.aalto.fi/notes/round-efficiency--bigoh.html
  
  EdwardCoffin 6 months ago
  
  To be clear though, that isn't his second law, at least as of two months ago, according to https://bsky.app/profile/randomascii.bsky.social/post/3lk4c6...
  
  to11mtm 6 months ago
  
  Fair, but `n log n` definitely is the historical "good enough to actually sleep at night" in my head, every time I see it I think of the prof who taught my first CSC course and our data structures course due to how often it came up.
  Also, the wise statement that 'memory is fairly cheap compared to CPU for scaling'. It's insane to see how often folks would rather manually open and scan a 'static-on-deploy' 20-100MB Json file for each request vs just parsing it into structures in memory (where, for most cases, the in memory usage is a fraction of the json itself) and just caching the parsed structure for the length of the application.
  
  hinkley 6 months ago
  
  Not often but occasionally I will chose the nlogn algorithm which obviously has no bugs over the O(n) algorithm with no obvious bugs.
  Less brittleness is worth paying a few percent. Especially if it unmuddies the waters enough for someone to spot other accidental (time) complexity.
  
  refulgentis 6 months ago
  
  Considerably more than a few percent, IMHO. :)
  But I also don't dabble in this area nearly enough to know whether there's years of tears and toil finding out repeatedly that O(n) is ~impossible to implement and verify :)
  | n | n log n | | 5 | 8.0472 | | 10 | 23.0259 | | 25 | 80.4719 | | 50 | 195.6012 | | 100 | 460.5170 |
  
  Someone 6 months ago
  
  Depends on the constants and on the value of n. If the constant for the O(n log n) algorithm is five times that of the O(n) algorithm, the O(n) algorithm is faster for n < 100.
  If you expect that n < 100 will always hold, it may be better to implement the O(n) algorithm and add a logging warning if n > 250 or so (and, maybe, a fatal error if n > 1000 or so), instead of spending time to write both versions of the algorithm and spend time finding the cut off value for choosing between the two.
  
  hinkley 6 months ago
  
  Fatal errors tend to blow up in production rather than test.
  One of the simplest solutions for detecting cyclic graphs is instead of collecting a lookup table or doing something non-concurrent like marking the nodes, is to count nodes and panic if the encountered set is more than an order of magnitude more than you expected.
  I came onto a project that had done that before and it blew up during my tenure. The worst case graph size was several times the expected case, and long term customers were growing their data sets vertically rather than horizontally (eg, ever notice how much friction there is to making new web pages versus cramming more data into the existing set?) and now instead of 10x never happening it was happening every Tuesday.
  I was watching the same thing play out on another project recently but it got cancelled before we hit that threshold for anything other than incorrect queries.
  
  refulgentis 6 months ago
  
  Just wanted to say you're one of my favorite posters. Can't put an exact reason on why, but at some point over the last 15 years I learned to recognize your name simply from consistent high quality contributions. Cheers.
  
  hinkley 6 months ago
  
  This is magic thinking about how C, memory hierarchies, networking, and system calls work.
  
  rcxdude 6 months ago
  
  It's also often in the range where constant factors can make a big difference over a wide range of n
  
  brucedawson 6 months ago
  
  Should it be my second law?
  https://bsky.app/profile/randomascii.bsky.social/post/3lr24s...
  
  paulddraper 6 months ago
  
  Yes, that isn't actually Dawson's second law.
  
  hinkley 6 months ago
  
  But sometimes a big enough C can flip which solution helps you hit your margins.
  
  hansvm 6 months ago
  
  In my mind, that's always been the point in dropping log factors. The algorithms are comparable enough that the actual implementation starts to matter, which is all we're really looking for in a Big-O analysis.
- hinkley 6 months ago
  
  I made the “mistake” in an interview of equating two super-quadratic solutions in an interview. What I meant was what Dawson meant. It doesn’t matter because they’re both too ridiculous to even discuss.
  
  dieortin 6 months ago
  
  They’re too ridiculous… unless a more optimal solution does not exist
  
  hinkley 6 months ago
  
  Absolutely not.
  If the cost of doing something goes above quadratic, you shouldn't do it at all. Because essentially every customer interaction costs you more than the one before. You will never be able to come up with ways to cover that cost faster than it ramps. You are digging a hole, filling it with cash and lighting it on fire.
  If you can't do something well you should consider not doing it at all. If you can only do it badly with no hope of ever correcting it, you should outsource it.
  
  Retric 6 months ago
  
  Chess engines faced worse than quadratic scaling and came out the other side…
  Software operates in a crazy number of different domains with wildly different constraints.
  
  crabmusket 6 months ago
  
  I believe hinkley was commenting on things that are quadratic in the number of users. It doesn't sound like a chess engine would have that property.
  
  tux3 6 months ago
  
  They did make it sound like almost anything would necessarily have n scale with new users. That assumption is already questionnable
  There's a bit of a "What Computational Complexity Taught Me About B2B SaaS" bias going.
  
  delusional 6 months ago
  
  All of modern Neural Network AI is based on GEMM which are O(n^2) algorithms. There are sub-cubic alternatives, but it's my understanding that the cache behavior of those variants mean they aren't practically faster when memory bound.
  n is only rarely related to "customers". As long as n doesn't grow, the asymptotic complexity doesn't actually matter.
  
  saagarjha 6 months ago
  
  The GEMM is O(n^3) actually. Transformers are quadratic in the size of their context window.
  
  hinkley 6 months ago
  
  I read that as a typo given the next sentence.
  I’m on the fence about cubic time. I was mostly thinking of exponential and factorial problems. I think some very clever people can make cubic work despite my warnings. But most of us shouldn’t. General advice is to be ignored by masters when appropriate. That’s also the story arc of about half of kung fu movies.
  Did chess solvers really progress much before there was a cubic approximation?
  
  delusional 6 months ago
  
  > I read that as a typo given the next sentence.
  Thank you for the courtesy.
  > I think some very clever people can make cubic work despite my warnings.
  I think you're selling yourself short. You don't need to be that clever to make these algorithms work, you have all the tools necessary. Asymptotic analysis is helpful not just because it tells us a growth, but also because it limits that growth to being in _n_. If you're doing matmul and n is proportional to the size of the input matrix, then you know that if your matrix is constant then the matmul will always take the same time. It does not matter to you what the asymptotic complexity is, because you have a fixed n. In your program, it's O(1). As long as the runtime is sufficient, you know it will never change for the lifetime of the program.
  There's absolutely no reason to be scared of that kind of work, it's not hard.
  
  hinkley 6 months ago
  
  Right but back up at the top of the chain the assertion was that if n grows as your company does then IME you’re default dead. Because when the VC money runs out you can’t charge your customers enough to keep the lights on and also keep the customers.
  
  saagarjha 6 months ago
  
  I mean, the general advice is that if you actually understand what you're doing, then you don't need general advice.
  
  crote 6 months ago
  
  That only matters when the constants are nontrivial and N has a potential to get big.
  Not every app is a B2C product intending to grow to billions of users. If the costs start out as near-zero and are going to grow to still be negligible at 100% market share, who cares that it's _technically_ suboptimal? Sure, you could spend expensive developer-hours trying to find a better way of doing it, but YAGNI.
  
  hinkley 6 months ago
  
  I just exited a B2B that discovered they invested in luxury features and the market tightened their belts by going with cheaper and simpler competitors. Their n wasn’t really that high but they sure tried their damnedest to make it cubic complexity. “Power” and “flexibility” outnumbered, “straightforward” and even “robust” but at least three to one in conversations. A lot of my favorite people saw there was no winning that conversation and noped out long before I did.
  The devs voted with their feet and the customers with their wallets.
  
  patrick451 6 months ago
  
  There are two many obvious exceptions to even start taking this seriously. If we all followed this advice, we would never even multiply matrices.
  
  marcinzm 6 months ago
  
  > no hope of ever correcting it
  That's a pretty bold assumption.
  Almost every startup that has succeeded was utterly unscalable at first in tons of technical and business ways. Then they fixed it as they scaled. Over-optimizing early has probably killed far more projects and companies than the opposite.
  
  hinkley 6 months ago
  
  > That’s a pretty bold assumption.
  That’s not a bold assumption it’s the predicate for this entire sidebar. The commenter at the top said some things can’t be done in quadratic time and have to be done anyway, and I took exception.
  >> unless a more optimal solution does not exist
  Dropping into the middle of a conversation and ignoring the context so you can treat the participants like they are confused or stupid is very bad manners. I’m not grumpy at you I’m grumpy that this is the eleventeenth time this has happened.
  > Almost every startup
  Almost every startup fails. Do you model your behavior on people who fail >90% of the time? Maybe you, and perhaps by extension we, need to reflect on that.
  > Then we fixed it as we scaled
  Yes, because you picked a problem that can be architected to run in reasonable time. You elected to do it later. You trusted that you could delay it and turned out to be right.
  >> unless a more optimal solution does not exist
  When the devs discover the entire premise is unsustainable or nobody knows how to make it sustainable after banging their heads against it, they quickly find someplace else to be and everyone wonders what went wrong. There was a table of ex employees who knew exactly what went wrong but it was impolitic to say. Don’t want the VCs to wake up.
  
  account42 6 months ago
  
  Not all n's grow unbounded with the number of customers. If anything, having a reasonable upper bound for how high a n you have to support is the more common case - and you're going to need that with O(n) as well.
  
  endgame 6 months ago
  
  I feel this is too hardline and e.g. eliminates the useful things people do with SAT solvers.
  
  hinkley 6 months ago
  
  The first SAT solver case that comes to mind is circuit layout, and then you have a k vs n problem. Because you don’t SAT solve per chip, you SAT solve per model and then amortize that cost across the first couple years’ sales. And they’re also “cheating” by copy pasting cores, which means the SAT problem is growing much more slowly than the number of gates per chip. Probably more like n^1/2 these days.
  If SAT solvers suddenly got inordinately more expensive you’d use a human because they used to do this but the solver was better/cheaper.
  Edit: checking my math, looks like in a 15 year period from around 2005 to 2020, AMD increased the number of cores by about 30x and the transistors per core by about 10x.
  
  IshKebab 6 months ago
  
  That's quite a contortion to avoid losing the argument!
  "Oh well my algorithm isn't really O(N^2) because I'm going to print N copies of the answer!"
  Absurd!
  
  hinkley 6 months ago
  
  What I’m saying is that the gate count problem that is profitable is in m³ not n³. And as long as m < n^2/3 then you are n² despite applying a cubic time solution to m.
  I would argue that this is essentially part of why Intel is flagging now. They had a model of ever increasing design costs that was offset by a steady inflation of sales quarter after quarter offsetting those costs. They introduced the “tick tock” model of biting off a major design every second cycle and small refinements in between, to keep the slope of the cost line below the slope of the sales line. Then they stumbled on that and now it’s tick tick tock and clearly TSM, AMD and possibly Apple (with TSM’s help) can now produce a better product for a lower cost per gate.
  Doesn’t TSM’s library of existing circuit layouts constitute a substantial decrease in the complexity of laying out an entire chip? As grows you introduce more precalculated components that are dropped in, bringing the slope of the line down.
  Meanwhile NVIDIA has an even better model where they spam gpu units like mad. What’s the doubling interval for gpu units?
  
  Tainnor 6 months ago
  
  Gaussian elimination (for square matrices) is O(n^3) arithmetic operations and it's one of the most important algorithms in any scientific domain.
  
  hinkley 6 months ago
  
  I’ll allow that perhaps I should have said “cubic” instead of “quadratic” - there are much worse orders in the menagerie than n^3. But it’s a constraint we bang into over and over again. We use these systems because they’re cheaper than humans, yes? People are still trying to shave off hundredths of the exponent in matrix multiplication for instance. It makes the front page of HN every time someone makes a “breakthrough”.
  
  Tepix 6 months ago
  
  So, how would you write a solver for tower of Hanoi then? Are you saying you wouldn't?
  
  hinkley 6 months ago
  
  As a business? Would you try to sell a product that behaved like tower of Hanoi or walk away?
koala_man 6 months ago

Good call. O(N^2) is the worst time complexity because it's fast enough to be instantaneous in all your testing, but slow enough to explode in prod.
I've seen it several times before, and it's exactly what happened here.
- david422 6 months ago
  
  We just had this exact problem. Tests ran great, production slowed to a crawl.
  
  bcrl 6 months ago
  
  I was just helping out with the network at an event. Worked great in testing, but it failed in production due to unicast flooding the network core. Turns out that some of the PoE Ethernet switches had an insufficiently sized CAM for the deployment combined with STP topology changes reducing the effective size of the CAM by a factor of 10 on the larger switches. Gotta love when packet forwarding goes from O(1) to O(n) and O(n^2)! Debugging that in production is non-trivial as the needle is in such a large haystack of packets so as to be nearly impossible to find in the output of tcpdump and wireshark. The horror... The horror...
  
  hinkley 6 months ago
  
  First big project I worked on a couple of us sped up the db initialization scripts so we could use a less trivial set of test data to stop this sort of shenanigans.
  Things like inserting the test data first and turning on constraints and possibly indexes afterward.
esprehn 6 months ago

Modern computers are pretty great at scanning small blocks of memory repeatedly, so n^2 can be faster than the alternative using a map in cases for small N.
I spent a lot of time fixing n^2 in blink, but there were some fun surprises:
https://source.chromium.org/chromium/chromium/src/+/main:thi...
For large N without a cache :nth-child matching would be very slow doing n^2 scans of the siblings to compute the index. On the other hand for small sibling counts it turned out the cache overhead was noticably worse than just doing an n^2. (See the end of the linked comment).
This turns out to be true in a lot of surprising places, both where linear search beats constant time maps, and where n^2 is better than fancy algorithms to compensate.
Memory latency and instruction timing is the gotcha of many fun algorithms in the real world.
- kevincox 6 months ago
  
  This is true. But unless you are sure that current and future inputs will always be small I find it is better to start with the algorithm that scales better. Then you can add a special case for small sizes if it turns up in a hot path.
  This is because performance is typically less important for the fast/small case and it is generally acceptable for processing twice as much to be twice (or slightly more than twice) as slow, but users are far more likely to hit and really burned by n^2 algorithms in things you thought would almost always be small and you never tested large enough sizes in testing to notice.
  I wrote more on this topic here https://kevincox.ca/2023/05/09/less-than-quadratic/
- hinkley 6 months ago
  
  A lot of computations are really higher complexity order functions with stair steps at certain intervals based on hardware trying to pretend they are constant time. All operations cost the same amount until n doubles again and then it’s slower. If you zoom out toward infinity, the stair steps smooth out into a logarithmic curve. It becomes logarithmic in the former case and square root in the latter. Even dividing two numbers or doing a memory address lookup stops being C, which is part of why prime factoring worked for RSA for so long.
  If anyone had made clockless logic work you would see that adding 1 + 1 is in fact faster than adding 2^63 + 1.
  If you put enough data into a hash table the key length has to increase logarithmically to the table size in order to have distinct keys per record. Even Knuth points out that hash tables are really nlogn - something I’m pretty sure my CS professors left out. In multiple classes. Man, did I get tired of hash tables, but I see now why they harped on them. Case on point, this article.
- throwaway2037 6 months ago
  
  > where linear search beats constant time maps
  Can you give an example? You said lots of good things in your post, but I struggling to believe this one. Also, it would help to see some wall clock times or real world impact.
  
  hansvm 6 months ago
  
  Pick any compiled language and test it. Pick an algorithm making heavy use of a small (<10, maybe up to a hundred elements) hashset, and try using a linear structure instead. The difference will be most apparent with complicated keys, but even strings of more than a few characters should work.
  Some example workloads include:
  1. Tokenization (checking if a word is a keyword)
  2. Interpretation (mapping an instruction name to its action)
  3. ML (encoding low-cardinality string features in something like catboost)
  4. JSON parsers (usually key count is low, so parse into a linear-scan hashmap rather than a general-purpose hashmap)
  Details vary in the exact workload, the hardware you're using, what other instructions you're mixing in, etc. It's a well-known phenomenon though, and when you're doing a microoptimization pass it's definitely something to consider. 2x speedups are common. 10x or more happen from time to time.
  It's similar to (but not _quite_ the same as) the reason real-world binary search uses linear scans for small element counts.
  When you go to really optimize the system, you'll also find that the linear scan solution is often more amenable to performance improvements from batching.
  As to how much it matters for your composite program? Even at a microoptimization level I think it's much more important to pay attention to memory access patterns. When we wrote our protobuf parser that's all we really had to care about to improve performance (33% less execution time for the entire workload, proto parsing being much better than that). You're much more likely to be able to write sane code that way (contrasted with focusing on instructions and whatnot first), and it's easier to layer CPU improvements on top of a good memory access pattern than to go the other way around.
  
  crote 6 months ago
  
  You've got to keep in mind that computers aren't the 1-instruction-at-a-time purely sequential machines anymore.
  Let's say you've got a linear array of bytes, and you want to see if it contains a specific value. What would a modern CPU need to execute? Well, we can actually compare 64 values at a time with _mm512_mask_cmp_epu8_mask! You still need a little bit of setup and a final "did any of them match" check, of course. Want to compare 512 values? You can probably do that in less than 10 clock cycles with modern machines
  Doing the same with a hash set? Better make sure that hash algorithm is fast. Sure it's O(1), but if calculating the hash takes 20 cycles it doesn't matter.
  
  ncruces 6 months ago
  
  This is a good point.
  A string search algorithm that uses SIMD to do quickly discard a majority of 16, 32 or 64 attempts in parallel, and then verify the surviving ones quadratically (again 16, 32 or 64 bytes at a time) can go a very long way against a sublinear algorithm that understands needle structure, but necessarily needs to process the haystack one byte at a time.
parthdesai 6 months ago

My rule of thumb for 80%-90% of the problems is, if you need complicated algorithm, it means your data model isn't right. Sure, you do need complicated algorithms for compilers, db internals, route planning et all, but all things considered, those are minority of the use cases.
- anttihaapala 6 months ago
  
  This is not a complicated algorithm. A hash map (dictionary) or a hash set is how you would always do deduplication in Python, because it is easiest to write / least keystrokes anyway. That is not the case in C though, as it is much easier to use arrays and nested loops instead of hash maps.
  
  throwaway2037 6 months ago
  
  > That is not the case in C though, as it is much easier to use arrays and nested loops instead of hash maps.
  I am confused. There are plenty of open source, fast hash map impls in C.
  
  saagarjha 6 months ago
  
  Yes, the problem is getting them into your project.
  
  account42 6 months ago
  
  That's only a problem if you have never done any C development.
- paulddraper 6 months ago
  
  This isn't knapsack. This is a dict lookup.
- LtWorf 6 months ago
  
  I wrote a funny algorithm to group together words that end the same way to write them once in my binary wordlist file, since there is an array that points to the start character and a \0 to end the word. My initial solution was O(n²) but it was too slow on a real wordlist so I had to come up with something better. In the end I just sort the list with quicksort, but revert the order of the words and then the groupable ones end up nearby each other.
SAI_Peregrinus 6 months ago

I'd say the exception is when `n` is under about 10, and is counting some sort of hardware constrained thing (e.g. some operation over all CAN interfaces pesent on an OBDII connector can be O(n^(2)) since n will always be between 1 and 4). If you wouldn't have to physically replace hardware for `n` to increase, you really need to avoid n^2 operations. And even then consider them carefully, perhaps explicitly failing if `n` gets too big to allow for noticing rework is needed before new hardware hits the field.
- echelon 6 months ago
  
  > perhaps explicitly failing if `n` gets too big
  That's the problem. A lot of these quadratic time algorithms don't set limits.
  Even 'n!' is fine for small 'n'. Real production use cases don't have small 'n'.
  
  masklinn 6 months ago
  
  > Real production use cases don't have small 'n'.
  Real production use cases absolutely have small n. You don't hear about them, because it's very hard for them to cause issues. Unless the use case changes and now the n is not small anymore and nobody noticed the trap.
  
  lassejansen 6 months ago
  
  Or, phrased differently, if n has an upper limit, the algorithm is O(1).
  
  connicpu 6 months ago
  
  As long as you have tests that regularly exercise your algorithm at n=max where you would notice if they were exceptionally slow
  
  haiku2077 6 months ago
  
  I have an app that's been running an O n^2 algorithm in "production" (free open source app used by various communities) for about half a year now.
  It's been fine because "n" is "number of aircraft flying in this flight simulator" - and the simulator's engine starts to fail above around 2000 anyway. So even in the worst case it's still going to run within milliseconds.
solardev 6 months ago

For those of us without a formal comp sci background, is there a way for the IDE to detect and warn about these automatically? Or any other easy telltale signs to look for?
As a self taught dev, when I encounter nested loops, I have to mentally go through them and try to see if they iterate through each item more than once. But that's not a very foolproof method.
- chacha102 6 months ago
  
  Too much domain knowledge for an IDE to catch. I'm self taught as well, and it comes down to spending more time thinking about the code than writing the code.
  It's a fairly simple thought experiment to ask yourself what if there was 10x items in this array? 100x? That is essentially what the O(n) notation is trying to quantify. You just don't need to do it that formally.
vlovich123 6 months ago

I have an n^3 operation that's currently a huge bottleneck at only 10k elements. Not sure how to fix it.
- IshKebab 6 months ago
  
  I once looked into tree diffing algorithms (the literature is all about diffing XML even though it's really trees). The obvious dumb algorithm (which seems to be what everyone uses) is n^4.

SoftTalker 6 months ago

Cool discovery but the article could have been about 1/10 as long and still communicated effectively. At least they didn't post it as a video, so it was easy to skim to the important details.

hliyan 6 months ago

Interesting that multiple people are noticing this same thing. For me, this could have been:
"We found that a large portion of the 48 hours taken to backup our rails respository was due to a function in `git bundle create` that checks for duplicate references entered as command line arguments. The check contained a nested for loop ( O(N2) ), which we replaced with a map data structure in an upstream fix to Git. The patch was accepted, but we also backported fix without waiting for the next Git version. With this change, backup time dropped to 41 minutes."
m104 6 months ago

It's the "impact" style of technical write-ups: sell the problem and the scale, then present the solution, which is thus presented and understood through the lens of business and customer success.
Generously, this writing style is supposed to show the business value of teams and individuals, for promotions or other recognition. But yeah, it can be frustrating to read this style.
kokada 6 months ago

Yes. I read the whole article thinking that this must have been generated by LLM, because at least the style remembers it.
- bearjaws 6 months ago
  
  Don't take my bullet points away from me
  
  jorvi 6 months ago
  
  They came for the em dashes, and I did not speak up. Then they came for the bullet points, and I did not speak up..
- jaygreco 6 months ago
  
  Glad I wasn’t the only one who thought this. The post is also missing one obvious thing that I expect in any technical post: code snippets. Let me see the code.
  ChatGPT has ruined bullet points for the rest of us…
  No offense but writing this blog post couldn’t take more than a few minutes, why spoil it with LLM? Shoot, use one to check grammar and recommend edits even.
- djdeutschebahn 6 months ago
  
  Exactly thought the same. Reading experience of the post would have been definitely improved, with less text.
- ruuda 6 months ago
  
  That was also my thought.
- Scaevolus 6 months ago
  
  Em dashes and bullet points!
davidron 6 months ago

For those that haven't read the article yet, scroll down to the flame graph and start reading unit it starts talking about back porting the fix. Then stop.
- wgjordan 6 months ago
  
  "How we decreased reading 'How we decreased GitLab repo backup times from 48 hours to 41 minutes' times from 4.8 minutes to 41 seconds"
1oooqooq 6 months ago

it could have been longer. I still don't know why they were doing backup bundles with two refs :)
- 6LLvveMx2koXfwn 6 months ago
  
  They weren't, if you look at the fix [1] the dedupe loop was run in all cases, not just those with known dupes, so the performance hit was any bundle with lots of refs.
  1.https://github.com/git/git/commit/bb74c0abbc31da35be52999569...
  
  rjmunro 6 months ago
  
  But why couldn't they just dedupe the refs from the command line before starting the actual bundling - surely there are never more than a couple of hundred of those (one per branch)?
  
  6LLvveMx2koXfwn 6 months ago
  
  The point is the performance hit had nothing to do with dupe count (which could be zero), and everything to do with ref count.
  
  nayak 6 months ago
  
  Spot on. Some of our repositories at GitLab can contain millions of references.

Arcuru 6 months ago

48 hours is a crazy amount of time to spend just to compress a git folder, it's only a couple GB. 41 minutes still seems like quite a long time.

Why aren't they just snapshotting and archiving the full git repo? Does `git bundle` add something over frequent ZFS backups?

bombela 6 months ago

> Be aware that even with these recommendations, syncing in this way has some risk since it bypasses Git’s normal integrity checking for repositories, so having backups is advised. You may also wish to do a git fsck to verify the integrity of your data on the destination system after syncing.
https://git-scm.com/docs/gitfaq#_transfers
It doesn't tell you how to make a backup safely though.
On a personal scale, Syncthing and Btrfs snapshots work plenty good enough. It's as fast as the storage/network too.
- nightfly 6 months ago
  
  Syncthing is the only way I've ever corrupted a git repo before
  
  hobofan 6 months ago
  
  I think that's why they specified the "BTRFS snapshots" part. Yes, directly syncing a .git directory seems like a recipe for disaster with how often I've seen individual files lagging to sync, but I guess with BTRFS snaphots one can ensure that only a consistent view of a git directory is being backed up and synced.
  
  bombela 6 months ago
  
  Nah I truly do it the wrong way around. Syncthing on the git repos. And one of my device in the Syncthing cluster does btrfs snapshots minutely for recovery and further backups.
  Because it's at a personal scale, the only time I can corrupt a git repo is if I work on the same repo (and it's workdir) from more than one device in the time it takes for Syncthing to replicate the changes.
  But even then it's not a big deal because git fsck is quick. And I have my snapshots, and the syncthing versioning, and git defaults to two weeks before pruning. And because of how git works, using hash to identify contents, files are not easily overwritten either.
  In 10y I only had one git corruption (I ran a command on the same repo on a different machine via ssh, yielding a synctning conflict). Syncthing kept copies of the conflict file. One commit disappeared from the history but not from the database. It was easy to rebase the changes. I think I used git fsck to deleted the syncthing versioned files.
- ndriscoll 6 months ago
  
  If filesystem snapshots weren't safe, wouldn't that also mean git is prone to corrupting your repo in the event of a power loss or crash? That seems like a bad bug.
unsnap_biceps 6 months ago

zfs snapshots are difficult to offsite in non-zfs replicas, say like an S3 bucket.
That said, there's another less known feature that bundles help out with when used with `git clone --bundle-uri` The client can specify a location to a bundle, or the server can send the client the bundle location in the clone results and the client can fetch the bundle, unpack it, and then update the delta via the git server, so it's a lot lighter weight on the server for cloning large repos, and a ton faster for the client for initial clones.
- benlivengood 6 months ago
  
  I think if you want consistent snapshot backups on non-zfs destinations the safest thing is to clone the snapshot and rsync from the clone. Not a single-step operation but preserves the atomicity of the snapshot.
  EDIT: you could also rsync from a .zfs snapshot directory if you have them enabled.
- nightfly 6 months ago
  
  ZFS can send to file or whatever you want to pipe to, you can have incremental sends, and if you convert to bookmarks on the sender you don't have to keep the historical data after you send it
AtlasBarfed 6 months ago

... so they added caching to things that should have been cached?
... is this really the way people "back up" git repos? I mean, it is git, so isn't there some way to mirror changes to the repo in another repo and just use ZFS / snapshots / backup software / etc to do that? It's a distributed version control system. Just make sure the version control information is ... distributed?
broken_broken_ 6 months ago

Reading the article I thought exactly the same! I‘d be curious to know how much time the same would take with zfs.
gregorvand 6 months ago

15 years also seems like a long time

hiddew 6 months ago

"fixed it with an algorithmic change, reducing backup times exponentially"

If the backup times were O(n^2), are they now O(n^2 / 2^n)? I would guess not.

umanwizard 6 months ago

This is not the precise mathematical definition of exponential, but rather the colloquial one, where it just means "a lot".
- cvoss 6 months ago
  
  You shouldn't use a word that can carry a precise mathematical meaning in a sentence that literally uses mathematical notation in order to speak precisely and then expect readers not to interpret the word in the precise mathematical way.
  
  IshKebab 6 months ago
  
  You should if you expect your readers to be normal humans who understand obvious context, and not pedantic HN readers who understand obvious context but delight in nit-picking it anyway.
  
  morepedantic 6 months ago
  
  >pedantic
  Who the fuck do you think is the intended audience for an article about an algorithm in `git bundle create`? I spent approximately two minutes of my life trying to figure out where the O(n^2) algorithm was being invoked in such a way that it influenced an exponential.
  Exponential was bolded in the same sentence as a big-O. 50/50 troll/author oversight.
  
  saagarjha 6 months ago
  
  Maybe not 'morepedantic
  
  Dylan16807 6 months ago
  
  You can, but it's not should.
  
  globular-toast 6 months ago
  
  Ah yes because "normal humans" know what O(n^2) means but damnit they are going to use exponential wrong.
  
  tomjakubowski 6 months ago
  
  I'm a normal human and I know what O(n^2) means. There are dozens of us.
  
  blharr 6 months ago
  
  I somewhat agree, but for lack of a better word, what would you use? Quadratically doesn't have the same punch
  
  morepedantic 6 months ago
  
  Algorithmic? Big-O? Polynomially? Linear improvement? O(n^2) to O(n)? Or if you want to be less mathematically precise: enormous improvement?
  Using exponential in this way in any context is a faux pas, because it's highly ambiguous, and requires context for clarification. But in this situation the context clearly resolved to the mathematically accurate definition, except it was used in the other way.
  
  ndriscoll 6 months ago
  
  Quadratically doesn't have the same punch because it is actually exponentially less than exponentially. So doing it for extra punch (as opposed to not knowing the correct word) in a technical context would just be lying. It'd be like a paper saying they had a result with p less than one in a trillion for "extra punch" when they actually had p=0.1.
  
  bobbylarrybobby 6 months ago
  
  “From quadratic to linear” seems fine.
  
  jrochkind1 6 months ago
  
  If you just mean "a lot" in a non-technical sense, there are plenty of words available. enormously. immensely. tremendously. remarkably. incredibly. vastly.
  
  bobbylarrybobby 6 months ago
  
  “From quadratic to linear” or “... to constant” seems fine.
  
  jjmarr 6 months ago
  
  "by a factor of `n`" also sounds impressive.
  
  globular-toast 6 months ago
  
  Runtimes dropped precipitously.
  
  umanwizard 6 months ago
  
  “Dramatically” ?
  
  remram 6 months ago
  
  "a lot"
  
  hskalin 6 months ago
  
  I find the whole article rather poorly written. Most likely using an LLM.
  
  MyFedora 6 months ago
  
  We simplify the big O notation in computer science. This is standard practice.
  
  rovr138 6 months ago
  
  Just drop the constants, it doesn't matter /s
  Production systems running and melting here...
  
  morepedantic 6 months ago
  
  Especially when the colloquial meaning derives from the mathematical meaning.
  
  sneak 6 months ago
  
  “Words mean things.”
  If you can’t agree with this, then you shouldn’t be speaking or writing, IMO.
  Those who argue that words that mean different things are actually equivalent have no business dealing with language.
  
  morepedantic 6 months ago
  
  I understood every word, phrase, and sentence you wrote. But I did not understand your point. Still, I got the meaning of your words, so presumably you're satisfied.
- morepedantic 6 months ago
  
  >Ultimately, we traced the issue to a 15-year-old Git function with O(N²) complexity and fixed it with an algorithmic change, reducing backup times exponentially.
  No, not in the exact same sentence as a big-O. That's either author error, or an intentional troll. Either way it's egg on their faces.
sn9 6 months ago

The algorithm complexity went down in the function they patched (6x improvement in their benchmark), but in the context of how they benefited with how they were using the algorithm the impact was much larger (improved to taking 1% of the time), which is plausibly exponential (and figuring out the actual complexity is neither relevant nor an economic use of time).
- ndriscoll 6 months ago
  
  > figuring out the actual complexity is neither relevant nor an economic use of time
  The fix was replacing a nested loop with a map. Figuring out that this goes from O(n^2) to O(n) (modulo details like bucket count) is immediate if you know what the words mean and understand enough to identify the problem and make the fix in the first place.
  
  sn9 6 months ago
  
  Yes that's the algorithmic complexity of the function they patched.
csnweb 6 months ago

If you replace an n^2 algorithm with a log(n) lookup you get an exponential speed up. Although a hashmap lookup is usually O(1), which is even faster.
- ryao 6 months ago
  
  That is not true unless n^C / e^n = log(n) where C is some constant, which it is not. The difference between log(n) and some polynomial is logarithmic, not exponential.
  
  csnweb 6 months ago
  
  But if you happen to have n=2^c, then an algorithm with logarithmic complexity only needs c time. Thats why this is usually referred to as exponential speedup in complexity theory, just like from O(2^n) to O(n). More concretely if the first algorithm needs 1024 seconds, the second one will need only 10 seconds in both cases, so I think it makes sense.
  
  ryao 6 months ago
  
  N is a variable in what I posted, not a constant.
  
  wasabi991011 6 months ago
  
  It depends if you consider "speedup" to mean dividing the runtime or applying a function to the runtime.
  I.e. you are saying and f(n) speedup means T(n)/f(n), but others would say it means f(T(n)) or some variation of that.
  
  morepedantic 6 months ago
  
  The man, or llm, used the mathematically imprecise definition of exponential in a sentence with a big-O notation. I don't think he's going to be writing entire arguments formally.
- ndriscoll 6 months ago
  
  They're still using the map in a loop, so it'd be nlogn for a tree-based map or n for a hash map.
wasabi991011 6 months ago

I interpreted that as n->log(n) since log and exp are inverses.
Also because I've often heard tha the quantum Fourier transform is an exponential speedup over the discrete Fourier transform, and there the scaling goes n^2->nlogn.
marcellus23 6 months ago

Meaningless and non-constructive pedantry.
- chrisweekly 6 months ago
  
  I'm not the OP you're responding to, but to be fair, in a sentence about big-O perf characteristics, which includes the word "algorithms", using "exponentially" in a colloquial non-technical sense is an absolutely terrible word choice.
  
  linguistbreaker 6 months ago
  
  Exponentially bad word choice even... since we're using that word however we want now?
  I don't think this is meaningless or non-constructive pedantry - we're a technical community and those are technical words.
- msgodel 6 months ago
  
  I disagree. Misuse of the word "exponential" is a major pet peeve of mine. It's a particular case of the much more common "use mathematically precise phrasing to sound careful/precise" that you often find in less than honest writing.
  Here they are actually using it to refer to growth functions (which is rare for this error) and being honest (which is also rare IMO) but it's still wrong. They should have written about quadratic or quadratic vs linear.
  Regardless sloppy language leads to sloppy thought.
  
  keybored 6 months ago
  
  Sloppy writing is up orders of magnitude lately.
  
  0cf8612b2e1e 6 months ago
  
  It will decimate readership.
  
  sneak 6 months ago
  
  My personal pet peeve is when the term “exponentially” is used to refer to a change between precisely two data points.
  It’s a very specific subset of the one you’re describing.
  “Tell me you don’t know what you’re talking about without telling me you don’t know what you’re talking about.”
  
  morepedantic 6 months ago
  
  >My personal pet peeve
  Just be glad you have only one.
- morepedantic 6 months ago
  
  >pedantry
  I wasted 2 minutes of my life looking for the exponential reduction. So did many others.
  Now I'm wasting more of my life shit posting about it, but at least that's a conscious choice.
  
  account42 6 months ago
  
  No, you spent 2 minutes improving your reading comprehension.
  
  morepedantic 6 months ago
  
  You're insulting people on the internet AND you're wrong, which puts you one Hitler comparison away from the trifecta.
abhashanand1501 6 months ago

In fact O(n^2) is exponentially more than O(log n).
morepedantic 6 months ago

I was also looking for the exponential algorithm.

bob1029 6 months ago

I'm confused why you wouldn't simply snapshot the block-level device if the protocol of the information on top is going to cause this much headache. Quiescing git operations for block level activity is probably not trivial, but it sounds like an easier problem to solve to me.

This is the approach I've taken with SQLite in production environments. Turn on WAL and the problem gets even easier to solve. Customer configures the VM for snapshots every X minutes. Git presumably doesn't have something approximating a WAL, so I understand the hesitation with this path. But, I still think the overall strategy is much more viable and robust to weird edges within git.

lbotos 6 months ago

> Git presumably doesn't have something approximating a WAL, so I understand the hesitation with this path.
Bingo. One of the worst problems is helping a client piece back together a corrupted repo when they are using snapshots. Check my profile to see how I know. :)
It's usually an OMG down scenario, and then you are adding in the "oh no, now the restore is corrupted."
It's fixable but it's definitely annoying.
amtamt 6 months ago

> This is the approach I've taken with SQLite in production environments. Turn on WAL and the problem gets even easier to solve.
A few months back a better solution was provided by SQLite: sqlite3_rsync
https://www.sqlite.org/rsync.html
nonameiguess 6 months ago

Gitlab isn't just a managed service. They release the software for self-hosted instances as well. There is no guarantee or requirement that users all run Gitlab on the same filesystem or even a filesystem that supports block level snapshots at all. Presumably, they want a universal backup system that works for all Gitlabs.
- bob1029 6 months ago
  
  I've never heard of a .git folder that spanned multiple filesystems. It sounds like we are now conflating the git workspace with everything else in the product.
  There are system requirements that a customer would be expected to adhere to if they wanted a valid enterprise support contract with one of these vendors.
  
  to11mtm 6 months ago
  
  I think GP's point is that the filesystems used by someone self-hosting gitlab may not be the same as what gitlab themselves are using.
  File systems can be weird. Sometimes the OS can be weird and fsync type calls may not do what you expect. At least at one point MacOS fsync didn't behave the same way as Linux (i.e. Linux should ensure the write is truly done and not just in cache so long as the drive isn't lying). [0]
  > There are system requirements that a customer would be expected to adhere to if they wanted a valid enterprise support contract with one of these vendors.
  Gitlab has a community edition. Not handling data well would be bad for their public image.
  [0] - https://news.ycombinator.com/item?id=30372218
  
  nonameiguess 6 months ago
  
  I can't reply to the other reply for some reason, but what they said is indeed what I meant. gitlab.com might be running their Gitaly servers on btrfs or zfs or lvm volumes or whatever, but other customers may be using ext2. Gitlab the company could require customers to only run Gitaly on a specific filesystem, but up to now, they never have, it would be pretty shitty to suddenly change their minds after a decade plus of establishing one expectation, and whoever the developer is who submitted the patch to upstream Git and got a technical blog post out of it has absolutely no power to dictate contract terms to enterprise customers personally and instead did what is actually in their power to do.

hinkley 6 months ago

This feels like some of the problems I’ve ended up tackling. The people too close to the problem don’t see it so someone has to reach across the aisle and make something happen in code they normally wouldn’t touch.

Even with the quadratic complexity I suspect this problem has been brewing for a while. This commit has been biding its time for 16 years waiting to ruin someone’s day (let’s not confuse the utility of Make it Work early in a project with justifying retaining that code in perpetuity. Make it Right, Make it Fast.) Backups have probably been going over six hours for years and steadily climbing.

“We” feels a lot like one or two people jumping on a grenade.

divbzero 6 months ago

There seems to be a lesson here about the balance between premature vs. anticipatory optimization. We’re generally warned against premature optimization but perhaps, as a rule of thumb, we should look for optimizations in frequently-called functions that are obvious and not onerous to implement.

Quekid5 6 months ago

If a set-of-strings was trivially available in the source language (at time of implementation) the original programmer would probably have done this (relatively) trivial optimization... This is a symptom of anemic languages like C.

einpoklum 6 months ago

I had a somewhat similar experience when writing a "remove duplicates" extension for Thunderbird:

https://addons.thunderbird.net/en-US/thunderbird/addon/remov...

I used a hash to begin with, using a simplistic digest function to the message headers I was comparing, getting me a 4-byte hash key.

That worked, but was kind of slow.

Finally, the idea came to me to not apply _any_ digesting, and use the combined concatenated headers of the messages as hash keys. About 2k bytes per hash key!

The result: About 20x perf improvement if memory serves.

How is that possible? The reason is that the code all runs in a Javascript machine; and applying the digest was not a built-in function, it was looping over the headers and doing the arithmetic. Thousands upon thousands of JS abstract machine steps. The use of the large hash key may be inefficient, but - it's just one JS object / dictionary operation, and one of the most heavily-optimized in any implementation.

gre 6 months ago

looks like the commit is here: https://github.com/git/git/commit/a52d459e72b890c192485002ec...

mortar 6 months ago

Thanks, this helped me find the submission and related discussion thread - https://lore.kernel.org/git/20250401-488-generating-bundles-...
throawayonthe 6 months ago

> * Warning: this function uses an O(N^2) algorithm.

pjmlp 6 months ago

A very good example that writing code in C doesn't help for performance, when the algorithms or data structures aren't properly taken into consideration.

IshKebab 6 months ago

I would say C makes this sort of thing far more likely because it's usually a ton of effort to obtain suitable containers. In C++ or Rust they have plenty of things like `unordered_set`/`HashSet` built in, so people are much more likely to use it and not go "eh, I'll use a for loop".
In this case Git already had a string set, but it's still not standard so there's a good chance the original author just didn't know about it.
- masklinn 6 months ago
  
  > In this case Git already had a string set, but it's still not standard so there's a good chance the original author just didn't know about it.
  The original commit was made in January 2009 (https://github.com/git/git/commit/b2a6d1c6868b6d5e7d2b4fa912...), strmap was added in November 2020 (https://github.com/git/git/commit/ae20bf1ad98bdc716879a8da99..., strset was added a few days later: https://github.com/git/git/commit/1201eb628ac753af5751258466...). It was first proposed in 2018 (https://lore.kernel.org/git/20180906191203.GA26184@sigill.in... the proposal specifically mentions it fixing possibly quadratic sites).
  As noted in the comment, git did have a sorted string list with bisection search, and that's from 2008 (and it actually dates back to 2006 as the "path list" API, before it was renamed following the realisation that it was a generalised string list). Though as the hashmap proposal notes, it's a bit tricky because there's a single type with functions for sorted and functions for unsorted operations, you need to know whether your list is sorted or not independent of its type.
- morepedantic 6 months ago
  
  I've seen this exact problem in C code many times in my life, especially in kernel space where data structures and memory allocations are fun.
  Ironically, this is much _faster_ for small sets. Sometimes the error is intentional, because the programmer believes that all inputs will be small. IME, those programmers were wrong, but that's the inverse of survival bias.
  
  masklinn 6 months ago
  
  And in fairness I can understand not considering someone might eventually be bundling a repository with tens of thousands of refs back in early 2009.
  
  nayak 6 months ago
  
  Even now, the contrast between repository sizes is wide. Most repos contain 1000s of references, which while not the best to run O(N^2) algorithm, is still okay. But as a Git forge, you also see a share of repositories which contain millions of references.
- pjmlp 6 months ago
  
  Yeah, not something that WG14 will ever care about.
account42 6 months ago

C does help with the C that is often conveniently omitted in the big O notation. That matters just as much as the algorithmic complexity - both need to be reasonable or the software will be slow. No one ever claimed C affects algorithmic complexity.

ashishb 6 months ago

O(n^2) is fast enough to end up I'm production and slow enough to cause problems at scale.

The worst troublesome cases of inefficient production are almost always O(n^2).

hinkley 6 months ago

Had a modular monorepo back when UML was still a thing people did. People were having trouble opening a TogetherJ project - it was taking 30 minutes. I dug into it, and I don’t recall how I timed it but I kept adding more submodules and the runtime went up ridiculously fast. I stopped at a project that took 18 hours to load. Started it before going home one day and timed it the next morning.
When I plotted the runtime, I got n^5 for the fitting curve. That’s the largest polynomial I’ve encountered in the wild. Second place has always been cubic.
Their response was that it had something to do with processing config files per module, and a suggestion to rearrange them as a workaround. They fixed the problem in the next patch and the load time went from 18 hours to a couple minutes.

edflsafoiewq 6 months ago

So they replaced a nested for loop for checking for duplicates with a set data structure. Surprised such a common thing was in git.

SkiFire13 6 months ago

> Ultimately, we traced the issue to a 15-year-old Git function with O(N²) complexity and fixed it with an algorithmic change, reducing backup times exponentially.

I have yet to finish the article, but this means they improved the complexity to something like O(logN), right? I hate when people confuse quadratic improvement for exponential ones.

globular-toast 6 months ago

Really annoying to see someone use exponential wrong when they're talking about performance of algorithms. We're supposed to know what this means! They went from quadratic to linear.

judofyr 6 months ago

Quadratic complexity sits in an awkward sweet spot: Fast enough for medium-sized n to pass first QA, but doomed to fail eventually as n grows.

hinkley 6 months ago

We have a habit of taking our eye off of old problems by trying to juggle several new ones. By the time someone notices that we have a problem, the dogleg in the graphs where the n² solution stopped fitting into the CPU cache has been obvious for months but nobody was looking, and we dance around that fact that we had time to take a reasonable approach to fix the problem if we had noticed it when it became measurable, by adding anxiety to the cleanup work.
And then someone learns from this experience, gets the bright idea to set up an alert for such things, but the alert doesn’t factor in things like customer base growth or feature creep slowly pushing up the expected runtime. Eventually organic load gets close to the alarm and then the fucking thing goes off on a three day weekend (why is it always a long weekend or just before one?) and then we wage war on alarm overreach and the whole cycle repeats itself.
We like to think of ourselves as blazing trails in the wilderness but most of the time we are doing laps around the parking lot.
vjerancrnjak 6 months ago

This particular change was not accidental. It was announced as quadratic.

kristianp 6 months ago

It would have been nice if the post included some details, such as before and after code.

abhashanand1501 6 months ago

Lot of comments complaining that going from O(n^2) to O(log n) is not an exponential improvement, but it is indeed an exponential improvement. In fact O(n) is exponentially more than O(log n).

tomxor 6 months ago

One of the many reasons I moved to self hosting. I use ZFS to backup every 15 minutes, ... could do it even more frequently but that seems a little pointless.

Also moved away from Gitlab because it's so damn slow.

mdaniel 6 months ago

It for sure has an excruciatingly slow UI, but I haven't yet been able to stomach the Ruby magick enough to dig in and find out if it's all of GitLab that's slow, and thus their culture of who-cares just propagates to the UI, or it's literally just the web frontend that's molasses yet because so many folks interact with it over the web that's where the conclusion ends up
I know their backend git proxy is written in golang, their runner agent is written in golang, it spawns CI jobs using containerd, written in golang, and they use postgresql and a redis-esque KV -- although for that part I do believe they're still using Sidekick (ruby) for doing job dispatch, so that one could very easily lolol back into not actioning tasks efficiently
GitHub Actions sure does enjoy just sitting there twiddling its thumbs when I push "run job," and is also both Ruby and their own crazypants ajax-y web framework, so I don't think it's exactly the shining star of performance itself
glookler 6 months ago

ZFS is great, getting off gitlab would also be great, what did you switch to?
- tomxor 6 months ago
  
  Gitea, i can't recommend it enough. Lightening fast compared to the proprietary offerings, clean simple UI like an older Github, and super simple to self host.
  
  mdaniel 6 months ago
  
  So long as you have an existing CI/CD story, or are willing to invest in head-to-heading them, to find one that fits your needs. Because both it and Forgejo's "we're deploying act, what could go wrong" give me the heebie-jeebies. And, aside from that, there are so many FLOSS ones they could have chosen that legitimately work making that decision extra opaque to me
haroldp 6 months ago

This was my thought too, but I figured I just didn't understand the problem. Why use git commands to backup when a file system copy seems like it would do? zfs snapshot takes less than a second on any size repo. zfs send transfers just the changes since the last backup, as fast as your network, more or less.
- tomxor 6 months ago
  
  Yup, once you've used snapshots for backups once, doing any kind of filesystem level backup just seems absurd. I now use it as the basis for every piece of infrastructure I add that holds valuable data. The only place it doesn't really fit is distributed databases.

Lammy 6 months ago

> What this means for GitLab customers — [a bunch of stuff about how customers can now back up more frequently and more robustly]

Realtalk: They should rewrite this post's headline to be in a positive tense instead of leading with a negative word. I'm glad I read the post, because it is a cool and good fix, but I saw “Decreasing […] repo backup” and my first thought was that it was an announcement of some service downgrade like some sort of cost-cutting measure.

serial_dev 6 months ago

I don’t think it’s unreasonable to expect interested people to read five words from the title. I know people don’t always do that, but complaining about it as if they did something wrong is ridiculous.
- robertlagrant 6 months ago
  
  No one was complaining.

ianks 6 months ago

Reminds me of the timeless advice I got for acing leet code interviews: “remember to use a hash map.” Tends to work out pretty well.

SergeAx 6 months ago

I can't shake off the feeling that the second half of the article is generated by a neural network.

SomaticPirate 6 months ago

How was the flame graph created? (Not very familiar with C and the performance tools around it)

landr0id 6 months ago

You can also use https://github.com/mstange/samply to make recording and viewing in the Firefox profiler easier.
It will spin up a localhost server after the trace ends, the profiler uses the localhost server and nothing is shared with Firefox servers unless you explicitly choose to upload the data and create a permalink.
plorkyeran 6 months ago

https://github.com/brendangregg/FlameGraph
You record performance data with `perf`, then use the scripts there to turn it into a SVG.
- IshKebab 6 months ago
  
  I strongly recommend not using this. Instead use pprof - it has a MUCH better interactive flamegraph, plus other nice performance visualisations (e.g. a call graph):
  https://github.com/google/pprof
  go install github.com/google/pprof@latest pprof -http=: prof.out
  I normally collect the profiles with gperftools (https://github.com/gperftools/gperftools) and then just
  LD_PRELOAD=/usr/lib/libtcmalloc_and_profiler.so CPUPROFILE=prof.out <your command>
  I've been meaning to try Samply though. Not sure if it works with pprof.
- immortaljoe 6 months ago
  
  OP here. In this particular case, I used https://github.com/flamegraph-rs/flamegraph

KETpXDDzR 6 months ago

With git it should be straightforward to implement an incremental, dedup backup solution since all objects are stored with their hashs in filename.

mdaniel 6 months ago

Run `man git-repack` or its `man git-gc` friend and recall that most filesystems hate dealing with a bazillion small files
I think there have been several attempts to use S3-ish blobstores as git backends but of the two major git hosters, only one of them is MIT licensed and they don't attempt that stunt, so safe to assume it's not a viable approach

katella 6 months ago

I don't know much about git backups, but why would there be race conditions if you just create the backup off the local repo instead of remote?

treesknees 6 months ago

Duplicate of https://news.ycombinator.com/item?id=44197061

mdaniel 6 months ago

Since that one only has one comment, I'm pretty sure that means the HN front-page is luck of the draw and I wouldn't have bothered even commenting about the other one

DeepYogurt 6 months ago

I miss a good tech blog post. Cheers for this

DamonHD 6 months ago

Here comes an unpopular nitpick: "... we traced the issue to a 15-year-old Git function with O(N²) complexity and fixed it with an algorithmic change, reducing backup times exponentially."

Uh no you didn't. Not possible. At most a polynomial reduction is possible else complexity theory needs a re-write.

(OK, yes, k could be doing some heavy lifting here, but I doubt it.)

If you are going to quote a maths formula then please don't use "exponetially" to mean "lots".

I stopped reading there: I don't want to have to read each word and wonder if they actually meant it, or it's just bad hype.

nayak 6 months ago

OP here. Feedback is always welcome, I did mean exponentially in the colloquial sense. I do see how it is confusing here, will change it.
- serial_dev 6 months ago
  
  You can’t use exponentially in the colloquial sense in a post about software performance.
  If my barista says that his new coffee is exponentially better, it’s ok to use it colloquially.
  If my git hosting provider writes about an impactful performance improvement, it’s not.
- DamonHD 6 months ago
  
  Thank you.
  (I don't think that anyone should use "exponentially" that way: it is an art term with a specific and particular meaning, so find another word if you mean something else! Like misusing specific legal or sporting terms...)
  
  taftster 6 months ago
  
  Which term is appropriate here? What would you suggest? (honest question)
  
  DamonHD 6 months ago
  
  "hugely" or "a lot" or "to O(XXX)" whatever the new XXX complexity is.
  
  taftster 6 months ago
  
  "to O(xxx)" is a good idea, in terms of keeping it mathematical and accurate. I like that best. "hugely" makes me giggle, because I really hear "bigly" when I see/hear it.
  
  ctxc 6 months ago
  
  Art term?
  
  wizzwizz4 6 months ago
  
  https://en.wiktionary.org/wiki/term_of_art
  
  ctxc 6 months ago
  
  Ah, thanks. I could only think of art as in literal, artistic art.
  
  kccqzy 6 months ago
  
  If you have that tendency, you just need to think of TAOCP, this industry's best distillation of intellectual achievement, with the word "art" in its name.
- DamonHD 6 months ago
  
  I'm disappointed that the page has not been fixed...
bombela 6 months ago

I can forgive exponential as a figure of speech. You missed when they describe using a map for the side effect of [key] dedupliction.
Instead of saying "set". The code itself uses the type "strset".

sneak 6 months ago

…and they say leetcode-style problems are out of date.

While perhaps implementing low level sorts is silly, knowing which data structures to use and when, and what underlying big-o performance they have, is critical, as demonstrated here.

gbacon 6 months ago

Algorithmic improvements beat micro-optimizations.

iamcreasy 6 months ago

Cool! Please post another flame graph after applying the patch.

jas39 6 months ago

What a poor write-up, neither defining the problem nor the solution with any clearity. Did anyone come away with any information that can be used for anything?

niffydroid 6 months ago

Why use this method over just having a script do a clone or pull elsewhere? Got is about distribution right?

James_K 6 months ago

Perhaps nested for loops should be some kind of compiler warning.

fHr 6 months ago

absolute chads, nice!

KlausWinter 6 months ago

[dead]

tcdent 6 months ago

TLDR if you only add objects to an array that are already not contained in said array you won't have to iterate back through it to remove the duplicates you created.

wild that this a pattern like this would be part of git-core, but I guess we all overlook stuff on a regular basis

rubit_xxx21 6 months ago

[flagged]

tomhow 6 months ago

We detached this subthread from https://news.ycombinator.com/item?id=44203252 and marked it off-topic.
divbzero 6 months ago

This is a valid concern, but checking for uniqueness with a hash is fairly straightforward and the new implementation reuses a strset struct that has been in use since 2020:
https://github.com/git/git/blame/v2.49.0/strmap.h#L205-L218
herpderperator 6 months ago

If the requirement is to check uniqueness, what assumptions could possibly cause a bug? In this case, why does it matter if the uniqueness is tested with a nested for loop or with a map? There are many identical ways to check uniqueness, some being faster than others.
- rubit_xxx22_ 6 months ago
  
  [flagged]
  
  nightpool 6 months ago
  
  Why are you making a new account for each comment? You seem to be deliberately avoiding HN's moderation system
  
  rubit_xxx22___ 6 months ago
  
  [flagged]
  
  Dylan16807 6 months ago
  
  > I don’t want something gathering all my thoughts historically together and tying it to something else; nothing good comes from that; I’m not writing a serial novel.
  Yeah but you should want your thoughts on a single post to tie together.
  > Many years ago I had a user with thousands of karma points. I used to get really annoyed with other users downvoting my valid and thoughtful comments because it affected my karma. Despite attempts to rally the community around getting rid of downvoting, that never happened.
  Sorry you had that reaction. While I get annoyed by downvotes sometimes, I've never cared about losing some points from the mostly useless pile.
  
  rubit_xxx22____ 6 months ago
  
  [flagged]
  
  bayindirh 6 months ago
  
  You don't have to enter any e-mail address to get an HN account. You login from a (Firefox) incognito window and get your cookies deleted the moment the window is closed.
  Why you're so afraid to let your ideas and views collect under a single account? Are they that controversial or are you weary of your own thoughts and don't want to see them again, or are you afraid to own your views as yours?
  We're talking (mostly) tech here, and nobody is forced to comment.
  
  rowanG077 6 months ago
  
  [flagged]
  
  bayindirh 6 months ago
  
  This is why I said "We are discussing (mostly) tech here". I don't agree that creating a throwaway for every comment is "superior". It's basically spamming and it's even noted in the guidelines.
  Nazi Germany & Jews issue is different. There's an aspect of forcing, and this is unethical and wrong on so many levels, and I'll just leave the subject here.
  OTOH, from my perspective if you're afraid that you're writing a sensitive comment, you can create a throwaway. That's justifiable IMHO, but creating three accounts to discuss maps vs. loops, now that's different.
  If we're talking about being ridiculous, bringing up Nazi Germany vs. Jews issue to a technical discussion is more ridiculous than the alleged ridiculousness of me asking the OP about their fears. To close, my questions was not to belittle or shame the OP, they were genuine. I'm not that person who jabs for giggles.
  
  rowanG077 6 months ago
  
  I don't know why you think tech comments are safe. Discussion any topic has a ton of side channel information even if you somehow believe tech is totally safe topic.
  
  friggles 6 months ago
  
  You assume the homogeneity and conformity of your thinking will save you. There were plenty of Germans that did that also. Germany lost the war.
  It’s fine to be fearless. But don’t persecute someone for trying to protect themselves.
  
  bayindirh 6 months ago
  
  You assume that people who say the same thing, think the same way. Your assumptions about me lost you the argument.
  As I said, I asked genuine questions. They might be blunt and unpopular questions, but they are questions, and it's totally OP's decision to answer me or not.
  I respect them in every case.

prymitive 6 months ago

Well done, next maybe make the web ui usable because right now there’s absolutely no distinction between the UI itself and the user content, which IMHO combined with action buttons in the middle of the page makes for a really poor ux.

jeffbee 6 months ago

Are there any reimplementations of git, by professional programmers using real tools? The source in question — object.c — is "banging rocks together" material.

umanwizard 6 months ago

Ignoring your inaccurate flamebait and answering the underlying question: there are several reimplementations of git, including "got" (Game of Trees) from the OpenBSD developers, jgit (reimplementation in Java), the Haskell git package, the gitoxide rust crate (and I assume several more half-finished Rust hobby projects), and probably others; that's just what I found with a cursory google.
landr0id 6 months ago

https://github.com/GitoxideLabs/gitoxide
gonzus 6 months ago

https://github.com/libgit2/libgit2
Vilian 6 months ago

Git, but as it was intended to be used
kgeist 6 months ago

We use go-git in one of our Go projects that relies on Git for version control of its data.
fHr 6 months ago

lmao
protonfckff 6 months ago

[dead]