Introducing DGit

262 points by samlambert 10 years ago

> Until recently, we kept copies of repository data using off-the-shelf, disk-layer replication technologies—namely, RAID and DRBD. We organized our file servers in pairs. Each active file server had a dedicated, online spare connected by a cross-over cable.

With all the ink spilled about creative distributed architectures, it's really humbling to see how far they grew with an architecture that simple.

viraptor 10 years ago

I'm not surprised really. I never worked at GH scale, but anyway learned early that simple solutions just work. Want a db cluster? Why not active-passive M-M instead. Want a M-M setup? Why not dual server and block replication.
Complicated things fail in complicated ways (looking at you, mysql ndb cluster), while simple solutions just work. They may be less efficient, but you'd better have a great use case for spending time on a new, fancy clustering solution - and even better idea how to handle it's state / monitoring / consistency.
- cbsmith 10 years ago
  
  For the real use cases for the fancy clustering solution, the benefits can be huge... and complicated things built with failure in mind actually fail in quite simple ways. DGit is an example in itself. Sure it heavily leverages the complexity of git's versioning/branching/distributed vcs mechanism to get the job done, but compared to a SAN or RAID + DRBD, the failure scenarios are much more straightforward to deal with.
  
  viraptor 10 years ago
  
  I agree. As long as you have people who can build it tailored to the service and handling all the errors. GH can afford doing that almost from scratch. Or if you're sure that this is exactly the solution you need and are familiar with the failure scenarios.
  But from what I've seen in a few places, a lot of people jump to cluster solutions without either a real need or enough people to support it.
  
  cbsmith 10 years ago
  
  > But from what I've seen in a few places, a lot of people jump to cluster solutions without either a real need or enough people to support it.
  Totally.

justinsb 10 years ago

The interesting bit here will be how they reconcile potentially conflicting changes between the replicas. It is pretty easy to replicate the data, because git is content-addressable and can go garbage collection - I think even rsync would work. The challenge is that when e.g. the master branch is updated, you essentially have a simple key-value database that you must update deterministically. I look forward to learning how github chose to solve that challenge!

LukeShu 10 years ago

rsync would only work well if there aren't any pack files (which reduce the disk space used, and access times). Pack files break the simplicity of the basic content-addressed store by compressing multiple objects together.
- justinsb 10 years ago
  
  Right, but it doesn't really matter if you have objects stored twice (or more) I believe. The next GC will clean it up anyway.
lotyrin 10 years ago

The replica count seems to be three - allowing quorum with a single lost host - and the repository goes read-only when quorum is lost.
- im_down_w_otp 10 years ago
  
  That doesn't do anything to resolve consistency issues. The only time that quorum, pinned to primary replicas (i.e. disallowing sloppy failovers), helps you with consistency is when all reads and writes to those replicas are funneled through a serialized reader/writer. Then that serialized process can ensure "Read Your Own Writes" consistency using majority quorum (again, pinned to primary replicas).
  So they're either doing strongly coordinated writes to make sure replication changes are consistent (serializer, consensus, or chain), they're encoding other causal information in the data and have some method that deterministically picks which replica should dominate the others, or they're silently losing updates on conflicts.
  It would be cool to know what they're doing.
  
  prirun 10 years ago
  
  The article said "Writes are synchronously streamed to all three replicas and are only committed if at least two replicas confirm success."
  
  im_down_w_otp 10 years ago
  
  Somehow missed that. Thank you.
  So quorum + 2PC is what that sounds like to me. Suffice it to say, that's not a safe protocol without some other system guarantees in place.
jvoorhis 10 years ago

I'm not sure how GH solved this, but consistently updating refs in that simple k-v store seems to be the main challenge.
I wonder whether the receive-pack operation offers a natural boundary for transactions?

drewm1980 10 years ago

I always imagined they had some kind of huge database holding all of the git objects to avoid excessive duplication. Now it sounds like they duplicate objects for every trivial fork, times three! I fork everything I am interested in just to make sure it stays available, and thought that was only costing github an extra file link each time...

idorosen 10 years ago

Objects don't need to be duplicated for forks. See the objects/info/alternates file. They're talking about redundancy for the object storage backend, which is independent of how many forks/clones/repositories. In fact, assuming they solve collision detection somehow, they could store all objects for all repositories in one object store and have all repos be thin wrappers with objects/info/alternates files forwarding to that object store repository...
wereHamster 10 years ago

You are wrong. Before they used a single git repository for all user repositories in a network. So the git objects in the original repository and all its forks were properly deduplicated. Deduplication across different networks does not make much sense.
Now, as far as overprovisioning, they had 4 times as much disk space provisioned as necessary: 2 disks in RAID in a single machine times 2 (hot spare). Now they only need 3x, for the three copies.
siong1987 10 years ago

http://githubengineering.com/counting-objects/
Under "Your own fork of Rails", you will see how it actually works. The answer to your question is "no, they don't store 3 copies of the same repo".
- seanp2k2 10 years ago
  
  Awesome post, thanks for sharing! Great to see that they committed their optimizations upstream too!

petemill 10 years ago

Please open-source this!

justinsb 10 years ago

It's relatively straightforward to build something like this using JGit - I blogged about how I did it a while ago: http://blog.justinsb.com/blog/2013/12/14/cloudata-day-8/

rctay89 10 years ago

Compare this approach with Google's - github sticks to the git-compatible 'loose' repository format on vanilla filestorage, while google uses packs on top of their BigTable infrastructure, and requires changes to repo-access at the git-level [1].

[1] https://www.eclipsecon.org/2013/sites/eclipsecon.org.2013/fi...

Interesting how Github is sounding like Google and Amazon. They're probably hitting the scale where it makes sense to build internal APIs and infrastructure abstractions to support their operations, eg. Bigtable and S3. In fact, DGit sounds like another storage abstraction like Bigtable and S3, albeit limited - eg. a git repo must be stored fully on a single server (based on my cursory reading of github's description of DGit), but in Bigtable, data is split into tablets that comprise the table might be stored on different places, which would allow higher utilization of resources.

venantius 10 years ago

Is there any plan to open source this?

Camillo 10 years ago

"dgit" is not the best of names, since git itself is already distributed, as they note. It would have been more accurate to call it "replicated git", or "rgit". I guess they just wanted to be able to pronounce it "digit".

newjersey 10 years ago

I agree.
I'll add that a person who pronounces git as JIT is probably a git. Dgit sounds like the git more than d JIT.

Artemis2 10 years ago

Is this what's been causing the very poor availability of GitHub today?

https://status.github.com/messages

simoncion 10 years ago

off-the-wall (and highly unlikely) suggestion: GH unleashed an aggressive Chaos Monkey today for the purpose of testing DGit's reliability claims in production.
eridius 10 years ago

Unlikely. According to the article they've been rolling this out for months. It's not like they flipped a switch today to turn it on.

ryao 10 years ago

Not to try to make this sound less awesome than it is, what happens in a proxy fails? Are the proxies now the weak point in GitHub's architecture?

mmckeen 10 years ago

The design of this seems very similar to GlusterFS, which has a very elegant design. It just acts as a translation layer for normal POSIX syscalls and forwards those calls to daemons running on each storage host, which then reproduces the syscalls on disk. This seems like very much the same thing except using git operations.

notacoward 10 years ago

Thanks for the compliment. While I as a GlusterFS developer would like to see us get more credit for the ways in which we've innovated, this is not such a case. The basic model of forwarding operations instead of changed data was the basis for AT&T UNIX's RFS in 1986 and NFS in 1989. Even more relevantly, PVFS2 already did this in a fully distributed way before we came along. I'd like to make sure those projects' authors get due credit as well, for blazing the path that we followed.
- mmckeen 10 years ago
  
  I didn't say that GlusterFS was the first to do it, just always liked the simplicity of the design. It makes hacking on the internals extremely easily to reason about.
  
  notacoward 10 years ago
  
  Glad you like it. We enjoy it too. :)

nwmcsween 10 years ago

I don't understand why they just didn't use ceph. Ceph has all the features dgit was invented to solve.

krakensden 10 years ago

Just is the worst word in the English language.
notacoward 10 years ago

No, not really. I love my cousins over in Ceph-land, that's for sure, but asynchronous but fully ordered replication across data centers is not in their feature set. At the RADOS level replication is synchronous, so it's not going to work well with that kind of latency. At the RGW level it's async, but loses ordering guarantees that you'd need for something like this. Replicating the highest-level operations you can tap into - in this case git operations rather than filesystem or block level - is the Right Thing To Do.
- nwmcsween 10 years ago
  
  Rados provides async APIs, couldn't you hook it to send git deltas between replicated copies? best of both worlds

samuel1604 10 years ago

is there a whitepaper actually showing how this works ?

brazzledazzle 10 years ago

I think your question may have been covered at the end:
>Over the next month we will be following up with in-depth posts on the technology behind DGit.

glasz 10 years ago

good lord. what a feat. tbh i'd have fucked this up.

systems 10 years ago

ok i dont get this ... shouldn't server availability problems be solved using traditional server availability technologies, like cloud technologies

why mess with git

shadowmint 10 years ago

Like what?
I'm going to tentatively suggest this is one of those 'hard' problems that throwing buzz words like 'cloud technologies' at doesn't solve.
What replication tech would you imagine solves this issue of distributing hundreds of thousands of constantly updated repositories?
- justinsb 10 years ago
  
  It's actually a relatively easy problem (compared to say a full POSIX filesystem) - I mentioned elsewhere in these comments a blog post where I implemented a fairly good solution. The objects are essentially immutable and content-addressable so you can get away with very relaxed semantics here. You need a reliable mapping of names to SHAs, but this is also comparatively easy (a key-value store).
  For example, you can easily satisfy this with S3 and DynamoDB - I think the latest version of the cloudata project I was blogging about actually does that now.
  
  summner 10 years ago
  
  They've addressed that in the post. To make git work comfortably especially on bigger repositories you need to have fast local access to all blobs. IIRC with something like libgit2 its relatively easy to implement what you describe but to make that perform while doing git log or diffs is completely different story.
  
  justinsb 10 years ago
  
  It isn't clear that they wouldn't get that same performance by caching the blobs locally. In my experience, you would, with much higher reliability and scalability.
wmf 10 years ago

There's a classic talk by vmg that explains GitHub's philosophy really well: http://devslovebacon.com/conferences/bacon-2013/talks/my-mom... (it gets webscale around the 15 minute mark)

rlpb 10 years ago

There's already a project called dgit, together with a package in Debian: https://lists.debian.org/debian-devel/2013/08/msg00461.html

Please address this before creating future hell for distributions.

douglasfshearer 10 years ago

Github's dgit is an internal infrastructure tool, it doesn't appear to have been open sourced.