GKE Sandbox: Independent operating system kernel to each container

cloud.google.com

158 points by alpb 7 years ago

For anyone who hasn't seen this before. There is a pretty good gVisor Architecture Guide that explains how this works pretty well via a few diagrams [1]. Lots more info on these pages too [2, 3].

> gVisor intercepts application system calls and acts as the guest kernel, without the need for translation through virtualized hardware. gVisor may be thought of as either a merged guest kernel and VMM, or as seccomp on steroids. This architecture allows it to provide a flexible resource footprint (i.e. one based on threads and memory mappings, not fixed guest physical resources) while also lowering the fixed costs of virtualization. However, this comes at the price of reduced application compatibility and higher per-system call overhead.

From what I understand, basically a user-space program that wraps your container and intercepts all system calls. You can then allow/deny/re-wire them (based on a config). So, you have pretty much complete control over what your apps can do.

This for me, is sort of the key takeway from the blog post too: "because we use gVisor to increase the security of Google's own internal workloads, it continuously benefits from our expertise and experience running containers at scale in a security-first environment". So, Google's using something like this internally too for their own workloads, which should be a pretty good sign this works in real life.

[1] https://gvisor.dev/docs/architecture_guide/

[2] https://github.com/google/gvisor

[3] https://gvisor.dev/

prattmic 7 years ago

> From what I understand, basically a user-space program that wraps your container and intercepts all system calls. You can then allow/deny/re-wire them (based on a config).
gVisor actually intercepts and implements the system calls in the user-space kernel. Two specific goals of gVisor are that (1) system calls are never simply allowed and passed through to the host kernel, and (2) you don't need to write a policy configuration for your application; just put your application inside gVisor and go. These are significant differences over simply using something like seccomp on its own (what the architecture guide calls "Rule-based execution").
Some of this is covered in our security model: https://gvisor.dev/docs/architecture_guide/security/#princip...
- saagarjha 7 years ago
  
  Reimplementing system calls is non-trivial, especially ones that have complex interactions with others (for example, the system calls related to process management). How do you prevent errors when translating this, and how do you implement features that ostensibly require calls to the OS anyways?
  
  prattmic 7 years ago
  
  For sure, implementing Linux is no easy task, and there is no magic bullet. For compatibility testing, we have extensive system call unit tests [1] and also run many open source test suites. Language runtime tests (e.g., Python, Go, etc) are particularly useful. We also perform continuous fuzzing with Syzkaller [2].
  > how do you implement features that ostensibly require calls to the OS anyways?
  gVisor's kernel is a user-space program, so it can and does make system calls to the host OS. Some examples:
  * An application blocks trying to read(2) from a pipe. gVisor ultimately implements blocking by waiting on a Go channel. The Go runtime will ultimately implement this with a futex(2) call to the host OS. * An application reads from a file that is ultimately backed by a file on the host (provided by the Gofer [3]). This will result in a pread(2) system call to the host.
  The purpose here isn't to avoid the host completely (that's not possible), but to limit exposure to the host. gVisor can implement all the parts of Linux it does on a much smaller subset of host system calls. Anything we don't use is blocked by a second-level seccomp sandbox around the kernel. e.g., the kernel cannot make obscure system calls, or even open files or create sockets on the host (those operations are controlled by an external agent).
  [1] https://github.com/google/gvisor/tree/master/test/syscalls/l...
  [2] https://github.com/google/syzkaller
  [3] https://gvisor.dev/docs/architecture_guide/overview/
  
  mav3rick 7 years ago
  
  How is this different than a nicerUI over a seccomp filter for your container?
- WestCoastJustin 7 years ago
  
  Awesome, thanks. I need to dig into this a little and just run a few demos / labs. This makes sense though. I really like your comments on this thread too (https://news.ycombinator.com/item?id=16976392).
  
  bradfitz 7 years ago
  
  App Engine now uses gVisor too:
  https://cloud.google.com/blog/products/gcp/introducing-app-e...
blaisio 7 years ago

It's basically the same thing as Wine - Wine provides the Windows API and implements it using Linux syscalls. Gvisor implements the Linux API using Linux syscalls, but with an extra authorization layer. I think people are just so gung ho about VMs that they forgot this was possible and easy (I did).
This is also similar to what Microsoft is doing in Windows with the WSL. This is another example of how we're really just in a big technology cycle. Dynamically typed -> statically typed -> dynamically typed; bare metal -> API wrappers -> VMs -> containers -> API wrappers. Soon we'll probably be back to bare metal.
- ryacko 7 years ago
  
  There are web hosts offer Raspberry PIs, but they tend to be more expensive than VMs. I'm guessing colocation costs are dominant.
thesandlord 7 years ago

> So, Google's using something like this internally too for their own workloads
A public example of this is Cloud Run [1, 2]
[1] https://news.ycombinator.com/item?id=19616832 [2] https://cloud.google.com/run/docs/reference/container-contra...
- WestCoastJustin 7 years ago
  
  Ah, cool, thanks. I didn't know they were running that under the hood. Yeah, I've checked out Cloud Run via a screencast I did on it a few weeks back [1]. I really like the concept and am looking forward to seeing the evolution of it!
  [1] https://sysadmincasts.com/episodes/69-cloud-run-with-knative
  
  spyspy 7 years ago
  
  The newest generation of AppEngine runs on this as well. In fact Cloud run and 2nd Gen GAE are exactly the same under the hood afaik. It allowed Google to ditch the custom APIs and toolchains they forced apps to use in order to keep their infra secure. Fun fact: Cloud Run and GAE both run code in Google's main search clusters, rather than their separate Google Cloud infra.
  
  cameronbrown 7 years ago
  
  > Cloud Run and GAE both run code in Google's main search clusters, rather than their separate Google Cloud infra
  What's the reasoning behind this?
  
  asciimike 7 years ago
  
  Cloud Run/App Engine PM
  Run and GAE run directly on Borg (which is the shared infrastructure that underpins all Google services, including Cloud products), rather than on VMs.
  Search/Ads/Maps/etc. run on Borg as well, but there's significant isolation between all those products.
  
  derefr 7 years ago
  
  That's the "what", but what's the "why"? Why run these in the main Borg cluster, rather than running them in the (separate, if I'm understanding you) Borg cluster that GCP uses as its substrate?
  Is it that the GCP Borg cluster is just big enough for GCP's control-plane, and then the rest of GCP is all Borg-less VM hypervisor boxes (running ESXi or what-have-you), so these gVisor-on-Borg workloads wouldn't have anywhere to "live" in the GCP cluster?
  If that is the issue, then I would have (naively) expected the solution to that to be adding a second, GCP-scale data-plane Borg cluster per zone, just for client workloads; rather than inviting these client workloads to co-mingle with Google's own workloads in the non-GCP part of the DC.
  
  tweenagedream 7 years ago
  
  Isolation is often done in software, Google has invested a lot of effort in making sure the distinct services that they run e.g. Youtube transcoding on the same machine search is running on don't interfere with each other. Whether through cpu constrains or some other priority levels. These are features of borg.
  https://scholar.google.com/scholar?lr&ie=UTF-8&oe=UTF-8&q=La...
  
  jkaplowitz 7 years ago
  
  I know nothing about the decisions behind where Cloud Run and GAE run, but even customer GCE VMs run on top of Borg, not just the control plane. GAE predates most or all of GCP, and there weren't separate GCP clusters when it got launched.
  (Used to work for Google including the GCP team, but haven't worked for them for over 4 years and I'm not speaking for them now. I'm reasonably sure this is all already public info.)
  
  spyspy 7 years ago
  
  likely just due to the age of appengine vs. all other gcp products
  
  sayhello 7 years ago
  
  Used to work on 2nd gen AppEngine.
  I helped ship the runtimes!
  Yes, this is due to age. GAE and Run depend on pieces of infrastructure going back a long time.
equalunique 7 years ago

My introduction to gVisor was the talk by Emma Haruka Iwao at InfoQ NY 2018: https://www.youtube.com/watch?v=Ur0hbW_K66s
I learned of the talk because Brian Cantrill referenced it during his very deep dive into operating systems, C, and Rust given later at the some event: https://www.youtube.com/watch?v=HgtRAbE1nBM
roryrjb 7 years ago

Isn't this the point of seccomp on Linux and pledge on OpenBSD (and others I'm sure I'm just more familiar with these two), but without this much overhead? Also I'd be interested to know, based on this quote in the post "There’s a saying among security experts: containers do not contain" how Solaris/illumos' Zones and FreeBSD Jails compare.
- WestCoastJustin 7 years ago
  
  > re: seccomp
  This thread has a good answer: https://news.ycombinator.com/item?id=16976392
  > "containers do not contain"
  Is sort of troll bait. They do contain. That is why everyone is using them. Sure, there will be exploits to break out of them, just like with VMs, and even CPU bugs now.
  Here is a good example of someone who broke out of a container on the play-with-docker.com site using a custom kernel module [1]. This allowed a container escape but you could say this was a bug since that wasn't the intent. So, you'd patch it. So, I get the joke in that people are extremely creative and will find ways around everything.
  [1] https://www.cyberark.com/threat-research-blog/how-i-hacked-p...
  
  013a 7 years ago
  
  That's fair, but at the same time: If the end-state is "containers should contain, they're secure, any insecurities are bugs" then why do we see so many defense-in-depth strategies like gVisor pop up which provide legitimate value to consumers?
  At what point are we just reinventing the VM hypervisor, but worse because every single one of these systems already has a VM hypervisor running somewhere? It seems likely to me that in the not-so-distant future the "Container" terminology won't actually mean anything because we'll figure out the engineering difficulty behind merging the best parts of VMs with the best parts of Containers, and managed systems like Fargate or even GKE don't really need both a VM hypervisor and a Container hypervisor when they're so similar.
  
  lima 7 years ago
  
  gVisor is a special kind of hypervisor, basically - it has a production-ready KVM backend.
  The main difficulties with VM-backend containers are storage passthrough and memory overcommit.
  
  wahern 7 years ago
  
  Memory overcommit is addressed by virtio memory ballooning (https://www.linux-kvm.org/page/Projects/auto-ballooning). Even OpenBSD supports this as both guest and host.
  For storage, there's already virtio block devices, not to mention PCI passthrough. But if you mean direct file system access, virtio-fs (https://virtio-fs.gitlab.io/) is just about ready to roll.
  There's still the issue that you're running an entire extra kernel. Not sure that's much slower than using Go; it's probably faster if what was described about bouncing on futexes elsethread is true.
  gVisor sounds like the kind of solution that makes sense for Google but not something that would survive in the wider community. The concept sounds great, but using Go sounds horrible, though I'm sure Go made prototyping the concept super simple--specifically goroutines reify execution flow in a nice way, but so would stackful coroutines in C or even Rust, which is easy to implement if you don't need to worry about deep recursion.
  
  d1zzy 7 years ago
  
  One big problem with KVM based VMs that gVisor fixes is that KVM is a (complex) piece of host kernel software. There have been many security incidents in the past related to KVM and there will be more for sure. With gVisor the "virtualization logic" runs purely in user space (and may itself be further isolated, like any other regular user space process, within the host environment). This means that any bugs in gVisor will, at most, impact the isolation unit where it runs in the host space, as opposed to KVM where bugs in KVM would impact the entire machine (including other customer workloads on that machine).
  The non-security related issues you listed, specialized interfaces to allow I/O to bypass the generic hardware virtualization layer, are IMO hacks (even the name of "para-virtualization" given to such mechanisms should be a tell). Because it would be to inefficient to do almost any I/O we care about to perform fast (network and storage) through the overall machine virtualization interface, we poke holes in that interface, specialized ones, that will allow us to carry requests and replies from the guest to the host more directly/efficiently. As a software engineer that seems like a hack. When something like gVisor comes along which provides much better security for the host environment and allows to quickly handle syscall level I/O by design I much prefer that approach over a VM. The drawback of gVisor is one similar to Wine: having to write bug for bug compatibility with the ABI supported (Linux x64 in this case). However, different from Wine, the Linux ABI surface is extremely small vs to what Wine has to reimplement to run even the simplest Windows applications and, most of all, with gVisor there's direct access to the source code of the ABI that it needs to implement making development much easier than something like Wine.
  
  ian-lewis 7 years ago
  
  There's a blub on "why Go?" on the website if you're interested.
  https://gvisor.dev/docs/architecture_guide/ > gVisor is written in Go in order to avoid security pitfalls that can plague kernels. With Go, there are strong types, built-in bounds checks, no uninitialized variables, no use-after-free, no stack overflow, and a built-in race detector. (The use of Go has its challenges too, and isn’t free.)
  Using a memory-safe language was a conscious design decision. https://twitter.com/LazyFishBarrel/status/112900096574140416...
  
  pjmlp 7 years ago
  
  I kind of agree with you, but actually like that gVisor exists and is written in Go.
  As it kind of proves a point about systems level software being written in Go.
  I would rather see VM/unikernels take off instead.
  
  wahern 7 years ago
  
  I can't find a link to the thread (in 5 minutes of Googling), but IIRC in a thread discussing the 2nd iteration[1] of a patch to fix the recent runc container breakout exploit this year one of the developers responsible for the patch flat out stated that Go was a poor choice for runc and has resulted in too much pain and ugly hacks. For example, because namespaces are per thread in Linux and you can't control how Go threads (goroutines, cgo stacks) migrate across kernel threads, the most basic task of simply creating and entering a namespace is complicated. (Neither Go nor Linux are amenable to providing a mechanism to alleviate the issue.) And then there's the issue of memory management--too much bloat and lack of fine-grained control as compared to a managed memory environment. These things don't usually matter, but when they do matter they really matter. They can become the primary source of complexity.
  [1] The original fix was to exec runc from a memfd-backed copy so writing to /proc/self/exe in the container didn't poison the binary outside the container. But the change in memory usage broke some existing workloads in the wild which had low memory resource limits. I think the second iteration used O_TMPFILE on tmpfs, or at least that was what was under discussion.
  
  lima 7 years ago
  
  > Here is a good example of someone who broke out of a container on the play-with-docker.com site using a custom kernel module [4]. This allowed a container escape but you could say this was a bug since that wasn't the intent. So, you'd patch it. So, I get the joke in that people are extremely creative and will find ways around everything.
  That one was an extremely obvious misconfiguration - running with --privileged=true. There dozens of ways to abuse that, probably much easier than using a custom kernel module.
  Yes, containers do contain, but the attack surface is MUCH larger than a virtual machine or something like gVisor. Just look at the constant stream of Linux local privilege escalations.
  
  ian-lewis 7 years ago
  
  Rather that the contain/don't-contain dichotomy, what's more important is gVisor's design principle that it always has 2 layers of isolation from the host and doesn't rely on any one bug in the Linux kernel, sentry, or elsewhere in order to break out of the sandbox. This leaves you less exposed to 0-day attacks and lags in patching kernels.
  You can't get that from normal Linux containers due to their fundamental design.
  
  WestCoastJustin 7 years ago
  
  Well, it happened and on a pretty popular site too. So if they got it wrong how many other people do. This is a core reason folk should check out gVisor. Not sure why the downvotes as this is a pretty good example use-case?
  
  lima 7 years ago
  
  gVisor has unsafe modes of operation, too. What I'm saying is that this is not a good example of "Container breakout", as it was just a misconfiguration, not an exploit.
  "people are extremely creative and will find ways around everything" is not an excuse - it's a matter of risk management and threat modelling.
  Escaping from a VM or gVisor is much, much harder than escaping from Linux namespaces ("containers") due to the MUCH smaller amount of attack surface/amount of exposed code. Using Linux containers in an untrusted multi-tenant environment is very dangerous, especially if you're a high profile cloud provider, which is why all of these projects exist.
  
  raesene9 7 years ago
  
  So, play with Docker are trying to do something very niche and not something that almost anyone else would try in production, which is running Docker inside Docker, which they do in order to produce the very cool service they do.
  Their breach isn't really a good indicator as I can't think of any/many reasons that most companies would try and do that...
  
  windexh8er 7 years ago
  
  > Their breach isn't really a good indicator as I can't think of any/many reasons that most companies would try and do that...
  There are a bunch of legitimate reasons to run Docker in Docker. The most obvious is in a build pipeline. For example Jenkins does Docker builds in containers all the time.

raesene9 7 years ago

This is a really interesting add-on to GKE and I'm glad to see vendors starting to offer a variety of container runtimes on their platforms.

That said, I'm really not a fan of the opening line where it references the old trope of "containers don't contain"

The idea that it's trivial to break out of any Docker style container just doesn't reflect reality.

Have there been vulns that allow for container breakout, sure there have, but every piece of software (including gVisor) has had vulnerabilities in it.

What you can say about gVisor is that it likely presents a smaller attack surface in its default configuration than a runc style Docker container.

However, of course, there's nothing to stop people tightening up on the defaults and still using runc.

As an aside for anyone who thinks container breakouts are trivially easy, you can go to https://contained.af and win yourself some money :)

amscanne 7 years ago

(I'm a co-author of the blog post)
I generally agree re: trope, but it's useful because I'm not sure the core idea is widely understood outside security circles. Many people assume that containers provide a strong isolation boundary, and while a break-out is not trivial, providing more isolation in some cases is important, as you allude.
While one option is certainly to provide a locked down policy, monitor the flow of kernel CVEs, and patch constantly, this may not be feasible for many organizations if a) they lack the technical expertise or b) don't know the workloads they're running a priori and can't apply a fixed policy.
So different container runtimes are about providing additional tools for defense-in-depth. (VMs are fantastic tool for this, but it's also nice to have tools that play well in containerized infrastructure other than custom security policies.) None of these tools will be perfect of course, hopefully they can make it easier to improve on the status quo.
Re: contained.af, this is a great example of the workloads problem. If you have a known workload where you can essentially disable all capabilities and access to system resources (e.g. no network), there are many options for securing that workload. They aren't all generalizable.
- raesene9 7 years ago
  
  Oh I'd agree and gVisor provides (IMO) a smaller attack surface than a default runc container.
  With that said both options, and indeed hypervisor based isolation, are generally one security flaw away from a breakout vulnerability, so the only difference in that respect is the incidence of those flaws.
  My experience of people's expectations of container isolation is perhaps somewhat different to yours, which is what prompted my initial comment.
  It's all too common (in my experience) to see container isolation dismissed using that "containers don't contain" trope, and for me that feels frustrating as the real picture is much more nuanced than that.
  It's all about choosing the right isolation technology for a) a given workload and b) a given threat model/attack surface.
  There are tradeoffs (both in terms of performance, and in terms of flexibility) in replacing the runc layer with a different container runtime. Sometimes those will make sense, other times not so much :)
  All that said I'm very excited to see more options here, as it'll give everyone the choice of what mechanism works for them for specific workloads.
longtermsec 7 years ago

The idea that it's trivial to break out of any Docker style container just doesn't reflect reality.
-- not just being contrarian here, actually, the reality is that it might be trivial. and it was demonstratively trivial for a long time (see CVE-2019-5736)
As for contained.af -- its not a good indicator, it mostly indicates that the reward doesnt meet the market price for demonstrating an escape from a set of hardened namespaces (which is going to cost more than an escape from "any docker container").
- raesene9 7 years ago
  
  So the runc vuln, only applied if you were a) running as root in the container and b) hadn't enabled user namespacing. (Also for completeness, it didn't work on RHEL based distros that applied their standard SELinux policy (IIRC))
  Also not specifically a Docker vulnerability, it was a runc issue which also affected other Linux containerization software (e.g. lxc)
  But despite all that, that's just an example of what I was talking about, all software has vulns, including runc, including gvisor.
  Stating that "containers don't contain" implies that it's not just a specific bug, but that architecturally the process is flawed (at least IMHO), which I would suggest is at the least an over-simplification.
  as to contained.af, well if it was indeed "trivial" then surely not a large reward would be required :)
  
  longtermsec 7 years ago
  
  so a) and b) are common in practice. these were not obscure boundary conditions or a corner case. and it was very trivial to exploit.
  "all software has vulns" is a slippery slope is my overarching point. you can't use that to say that the the security risks and isolation are comparable to gvisor. gvisor does away with a very significant amount of attack surface in the linux kernel and reimplements it in golang, which eliminates many bug classes.
  for a realistic risk assessment you should consider the linux kernel as a bottomless barrel of memory management bugs, which are exploitable from within a container, whereas gvisor will have a much more finite set of bugs
  On our team we've got extensive experience in finding compromises in this area, particularly in kernels, and that is why I am adamant that one should not think what docker provides meets the bar for best practices in a security critical environment. Something like gvisor would much more fit the bill.
  
  raesene9 7 years ago
  
  The original point I was making what that dismissing container isolation with the trope "containers don't contain" is overly simplistic, not that I thought that docker/runc containers with a default profile had as small an attack surface as gVisor.
  Generally the security of a piece of software isn't considered fundamentally flawed just because it has a security bug, otherwise pretty much every piece of software would be in that bucket by now. As such dismissing containers using that trope based on a bug which wasn't discovered when the trope was coined (by Dan Walsh IIRC) doesn't seem appropriate.
  There have been (AFAICR) three breakouts that would affect a default Docker installation in the last 3-4 years (Dirty C0w, WaitID, and the runc issue). That doesn't feel like a particularly high incidence, and gVisor has had at least one in it's shorter lifespan...
  If it's always trivial to breakout of docker/containerd/runc containers as (if I'm understanding you correctly) you appear to be implying and which is what appears to be implied by the trope, then I imagine people will be making good money from bug bounties for a long time as a lot of companies are creating platforms which execute semi or untrusted code in runc containers.
  
  longtermsec 7 years ago
  
  I'm not sure that it is overly simplistic, I think the statement that "containers do not contain" is an intentional oxymoron that points to some ground truths. These ground truths are that a process in a container is running in the same kernel, and although namespaces are meant to isolate some set of resources from other processes, and there are still very many shared resources that might not be isolated at all. This means a lot of attack surface, and exploiting the kernel will grant access to the other processes on the system.
  In terms of quantity, 4 is not an accurate picture. I haven't sat down to analyze CVEs (https://www.cvedetails.com/product/47/Linux-Linux-Kernel.htm...), but say out of 50 practically exploitable kernel memory corruption bugs/year 4-5 new bugs every year are reachable from some common namespace configuration for a container. And this just marks what is publicly disclosed, which is a subset of the vulnerabilities attackers know about.
  Bounties arent the only outlet for these, see: VEP.
  
  raesene9 7 years ago
  
  So (as I'm sure you know) linux container isolation isn't just a product of namespaces, but namespaces+capabilities+cgroups+(SELinux/Apparmor)+seccomp-bpf. Each one of those layers provides some aspect of isolation and for a Linux kernel exploit to succeed in escaping a container it needs to bypass/compromise each one (or as in the case of the runc vulnerability occur prior to the sandbox being fully established).
  So just taking Linux kernel bugs as a metric doesn't really apply.
  That's why I gave the list I did, as those are the only ones which I'm aware of which can bypass all the layers of isolation in a standard Linux container.
  If the ground truth "containers don't contain" applies, then it appears you're saying that Linux is innately and architecturally unsuitable for multi-user/process use, which seems like a fairly bold statement given its prevalence...
  After all, all a container is, is a Linux process with Linux isolation mechanisms applied to it...
  
  longtermsec 7 years ago
  
  bingo. one should always assume that userland access on a linux box is a short step away from full system privileges and active exploits are ready for use by an attacker.
  docker has started adding hardening with SELinux+Seccomp because theres a realisation that the linux kernel bugs keep coming, but this is just a bandaid. the other problem with this approach is that in practice a hardened config is too restrictive for real-user use and has real maintenance cost so most will never use them (as argued by others in this thread for why the gvisor approach is superior). AppArmor is very poorly maintained, buggy, and not a practical solution
  
  raesene9 7 years ago
  
  For me, that comes down to threat model.
  Should every organization assume that every attacker has access to Linux 0-days that they can use to privesc on a box?
  My opinion is that that's not a realistic assessment for every attacker.
  Do some attackers have that? I'm sure they do, but not every company should assume that every attacker will be able to do that.
  And all this goes back again to the original point. The trope "containers don't contain" is overly simplistic and not appropriate for every companies threat mode.
  
  ryacko 7 years ago
  
  If you do cybersecurity work and Zerodium bug bounties for your stack are less than your yearly wages, you are honor-bound to offer your resignation and request that the company use your salary towards bug bounties.
  Fortunately zerodays aren't commonly used.

muricula 7 years ago

GKE Sandbox/gVisor syscall performance is at least 100x worse than virtualization[0], which is huge. Why shouldn't I just run everything in a VM/lxc container instead? Is it worth proxying everything through your syscall broker when I can just trust my hypervisor to be a security boundary instead? [0]: https://gvisor.dev/docs/architecture_guide/performance/

amscanne 7 years ago

(I am co-author of the post)
System calls are important, but only one factor. The linked doc is an attempt to clarify and delineate various costs. There are number of platform options (the platform is what does syscall interception), and I don't believe any of them are 100x so to say "at least" is a bit disingenuous. You may have confused the "runsc-kvm" number with "using a VM". "runsc-kvm" is the system call performance of gVisor using the kvm platform, which is not a full VM [1]. In general the syscall cost in a VM depends entirely on the guest OS, since there is no VMEXIT for this operation.
VMs are a valid choice depending on your workload, and this is providing an additional tool that provides an easy control for containerized infrastructure. You can use what works for you. Native containers certainly work as well, but you'll probably want to consider additional security controls of some form if you're really running untrusted stuff in there.
[1] https://github.com/google/gvisor/tree/master/pkg/sentry/plat...
- muricula 7 years ago
  
  You're right, that's not the chart I wanted to see. I'm just dubious that reimplementing lots of the Linux kernel in Go while paying the cost of the ptrace interception is worthwhile. It seems like you're just adding a lot of attack surface (admittedly managed code > native code) with a large perf impact. Do you have any docs on how the kvm-runsc platform works? Skimming the files, I don't see some of the bits necessary for a bluepill style hypervisor, so I'm not sure why parts are named bluepill in there. I also don't see a lot of the linux kernel paravirt vdev code I would expect, and you seem to imply that you're not telling KVM to enable syscall trapping for the guest.
  
  amscanne 7 years ago
  
  I'm not sure what you mean by KVM syscall trapping for the guest. The bluepill refers to the fact that the Sentry runs transparently in VMX non-root ring 0 and regular host ring 3.
  I'm not sure what to provide re: docs -- the code is all there, reasonably documented and there are discussions on the public groups of how the KVM platform works. I feel a bit like you're coming in with a specific set of ideas and skimming files (e.g. the performance guide and the code itself) in order to confirm an existing understanding, but it's just not working.
  I'd love more precise criticisms re: adding to the attack surface, but otherwise I'm not sure how I can help.
  
  muricula 7 years ago
  
  I'm very skeptical about the platform and don't have the time to devote to reading the codebase or having conversations as I would like. The TL;DR is that the syscall interception technique seems expensive and I wonder if you will write all sorts of logic bugs in the sentry broker. It seems like you folks care about security, and have some good ideas, but if you really care about hostile multi-tenant containers, why not stick the container in a VM and call it a day?
  
  yoshiat 7 years ago
  
  I replied in other comments but our talk at Next'19 [1] includes a story by one of our customers, which may help understand the use cases. In a nutshell, GKE Sandbox should allow sharing the resources of GKE Nodes (VMs) among multiple tenants.
  [1] https://www.youtube.com/watch?v=TQfc8OlB2sg
danbeaulieu 7 years ago

These kernels and micro vmms (kata, firecracker, etc) are aimed at workloads that have already been containerized or are being deployed to container orchestration systems.
For some cases, having something that is compatible with kubernetes is worth the performance penalty, especially if your workload isn't syscall heavy.
lifty 7 years ago

where did you get the 100x number? Haven't seen it on the page you provided. Also, I checked previously gVisor and the performance indeed worse, but nowhere even close to 100x worse.
- muricula 7 years ago
  
  Look at the syscalls chart. It's hard to tell exactly what the numbers are on the log log scale on the syscalls chart, and it's probably not the graph I want anyway, but it looks like their runsc-kvm platform clocks in at 1k ns/syscall while their ptrace platform looks close to 100k ns/syscall. The fact that it's log log is telling alone.
  
  amscanne 7 years ago
  
  It's just log, not log log. The chart generates from .csv hosted on the site [1] and the benchmark tools are all open.
  The ptrace numbers are 20x and the KVM platform is actually lower than the Docker default case (though that doesn't mean everything is faster, as system call time is only one factor). As I note above, I think you're confused about what the KVM platform is -- it's not a VM.
  [1] https://gvisor.dev/performance/syscall.csv
  
  lifty 7 years ago
  
  Missed that one. The gVisor calls are closer to 40k but the difference is indeed big.

whalesalad 7 years ago

Isn't this server-side react rendering? What are we doing?

We started with virtual machines and then thought, no, we can share a kernel and do this without the overhead. Now we want each of our containers to have their own kernel. This is full curcle... why not just fire up a VM? Am I missing something?

Firecracker doesn't have the product vision behind it to do this, but at some point we will have a microvm technology with the ergonomics of containers and then we'll be WAY closer to true portability and better security.

amscanne 7 years ago

(I'm a co-author of the blog post)
Many functions of the kernel are still effectively shared: memory management (e.g. reclaim, swap), thread scheduling, etc. The application is simply limited in its ability to interact with the shared kernel, and functionality related to system APIs is isolated. Arguably I think this is closer to the ergonomics of containers, but with compatibility and performance trade-offs.
lemoncucumber 7 years ago

One reason I can see is the same reason that linuxkit [1] is a really interesting way of putting together Linux OS images: the container ecosystem has produced some tools that are really useful for building and packaging up Linux userspaces, and being able to reuse those tools in other ways is valuable.
With linuxkit you give up some of the niceties of image layer caching, but with an approach like this you get the best of both worlds -- the isolation of VMs but the tooling and usability of containers.
[1] https://github.com/linuxkit/linuxkit
dastbe 7 years ago

(I work at AWS)
Have you seen https://github.com/firecracker-microvm/firecracker-container... ? The team here is working to make firecracker as seamless as possible for running containerized applications in a microvm.

nullwasamistake 7 years ago

The only advantage I see to containers over VM is RAM sharing. Beyond that, hardware VM's are better performing and much more secure.

gVisor is just another flavor of containers that replaces kernel interfaces with a Go shim layer to reduce the attack surface in return for worse performance.

If somebody could hack ram sharing/overcommit into traditional VM's all this container nonsense could be dispensed with. Containers are a virtualization layer just like the old days when we used the JVM to run "safe" applets on client machines. Like the JVM, the attack surface will always be huge and security issues nearly endless.

bogomipz 7 years ago

I have read that all containers at Google run inside of a VM and indeed that article mentions that gVisor is in use in things like App Engine and their internal workloads.

So if containers on GKE were already being spun up inside lightweights VMs what does allowing customer's to select the gVisor runtime offer beyond whatever Google's existing lightweight VM already provides?

thesandlord 7 years ago

Other way around: Everything at Google runs inside a container, including the VMs
gVisor lets you run multiple untrusted workloads on the same VM, in this case a GKE node.
- bogomipz 7 years ago
  
  What would running a VM inside a container provide in terms of security and isolation that just running a VM would not?
  This ACM article from a few years ago written by folks that worked on Borg/Omega/Kubernetes states:
  >"The isolation is not perfect, though: containers cannot prevent interference in resources that the operating-system kernel doesn't manage, such as level 3 processor caches and memory bandwidth, and containers need to be supported by an additional security layer (such as virtual machines) to protect against the kinds of malicious actors found in the cloud."[1]
  Also see slide 13 of Joe Beda's talk from a five years ago shows the container running in a VM not the other way around:
  https://speakerdeck.com/jbeda/containers-at-scale?slide=13
  [1] https://queue.acm.org/detail.cfm?id=2898444
  
  thesandlord 7 years ago
  
  (I work for GCP)
  It looks something like this:
  your container -> Compute Engine VM (GKE Node) -> container -> Borg
  The container on top of Borg is used for scheduling and management. Joe's talk has a slide on this. As a GCP customer, you never have to worry about this or care about it, as it is an implementation detail.
  >"The isolation is not perfect, though: containers cannot prevent interference in resources that the operating-system kernel doesn't manage, such as level 3 processor caches and memory bandwidth, and containers need to be supported by an additional security layer (such as virtual machines) to protect against the kinds of malicious actors found in the cloud."
  As a GCP customer using GKE, your applications are separated from other GCP customer using VMs.
  However, if you want to run your OWN untrusted workloads, then in the past you would have to spin up a separate VM for untrusted workload A and a one VM for untrusted workload B.
  This sucks in terms of resource utilization. It would be better in many cases if you could run workload A and B on the same VM. That's where gVisor comes into play.
  your untrusted container -> gVisor -> Compute Engine VM (GKE Node) -> container -> Borg
  I hope this makes sense!
  
  bogomipz 7 years ago
  
  Thanks for the explanation, this makes sense yes.
  >"The container on top of Borg is used for scheduling and management."
  Is this the "open source node container manager" box on slide 13 then? I'm guessing this is the Borg's version of the kubelet then?
  https://speakerdeck.com/jbeda/containers-at-scale?slide=13
  
  yoshiat 7 years ago
  
  That's a very old slide :) I "guess" the slide deck was talking about https://cloud.google.com/compute/docs/containers/deploying-c...
  
  bogomipz 7 years ago
  
  I see. So is "the container on top of Borg is used for scheduling and management" the Borg equivalent of the K8S kubelet then?
  
  yoshiat 7 years ago
  
  As you can see from the Borg paper [1] and the name, "borglet" is the most closest component to "kubelet".
  [1] https://pdos.csail.mit.edu/6.824/papers/borg.pdf
  
  yoshiat 7 years ago
  
  (Co-author of the post)
  The fact that gVisor is being used in multiple services at Google is probably the confusing part. In case of GKE Sandbox, the users here are external and using Cloud (specifically GKE). The target use case is to add defense in depth to their pods running on potentially shared GKE Nodes (VMs) for Multi-Tenancy. Our talk at Next'19 [1] includes a story by one of our customers, which may help understanding the use cases.
  [1] https://www.youtube.com/watch?v=TQfc8OlB2sg
  
  bogomipz 7 years ago
  
  Thanks for the link that does make the use case clear i.e multitenancy/SaaS. Am I correct in assuming though that when someone creates a K8S cluster via GKE that the containers that make up their cluster such as the kubelets and masters are all running in VM underneath?
  
  yoshiat 7 years ago
  
  Yes, exactly. https://cloud.google.com/kubernetes-engine/docs/concepts/clu...

metta2uall 7 years ago

I quite like this defense-in-depth approach, but it's disappointing that it will only be available as part of the probably expensive GKE Advanced. I would have thought safety features should be standard..

dilyevsky 7 years ago

I think either way control plane is free now?
- metta2uall 7 years ago
  
  Well gVisor doesn't use the control plane. It is free, but I wouldn't think it has a high cpu or memory load, and Google would make a lot of profit on the nodes.
  
  dilyevsky 7 years ago
  
  I know but they may conceivably just charge fixed fee for enabling that option on the nodepool.
  > it has a high cpu or memory load, and Google would make a lot of profit on the nodes.
  They currently solve that problem by having their node VMs melt down at like 50% utilization so you have to run everything with huge padding.

andrewstuart 7 years ago

Sandboxed containers with kernels - so what's the difference now between this and fully isolated virtual machine?

Another approach might be to make virtual machine technology more like containers. Then the two shall meet.

edoo 7 years ago

I didn't dig into the implementation details but the term para-hvm came to mind, not quite para virtual but not quite full hvm. Perhaps if security is a real issue then HVM is the only real choice.

conroy 7 years ago

For those more familiar with Kubernetes and gVisor, would this allow me to build a CI/CD service that runs untrusted user code?

raesene9 7 years ago

Well like all things in security, that kind of depends :)
What gVisor does is provide a smaller attack surface to a containerized process, when compared with a "traditional" Docker container using standard Docker setup (you can, of course harden Docker containers considerably from base, if you are so inclined).
However it doesn't affect anything outside of that interface so, for example, if your CI/CD process is running on a network that has other insecure services on them, then gVisor alone won't really help you if malicious code is executed inside a container allowing an attacker to start probing the environment from the perspective of that container.
snug 7 years ago

> gVisor provides a virtualized environment in order to sandbox untrusted containers.
So yes

fulafel 7 years ago

gVisor is easy to run in on your local dev laptop Docker too. It's a nice alternative to running Docker in a VM, if you prefer a security boundary between random Docker containers you get off Docker Hub and your host machine.

After you built or downloaded the gVisor single binary to /usr/local/bin/ or wherever, just put a snippet provided in the gVisor README in the Docker settings file ("runtime": {...}), and bob's your uncle.

ronsor 7 years ago

At this point, why not just use a virtual machine? We've come full circle!

jacques_chester 7 years ago

Mostly the performance characteristics. A virtual machine presenting as a machine needs an operating system to be useful. Most operating systems have long-engrained assumptions about the nature of the world, such as:
"There is a time when I go from power-off to power-on, and it is rare, so I may perform expensive operations then to amortise their cost over running time".
or
"While running, time does not skip and hardware does not change".
The practical upshot being that the OS needs to be booted from scratch in a number of scenarios.
But it's not the OS that provides value. It's a means to an end, and that end is to run software. Most software written to run on OSes also have engrained assumptions, such as "I will come to be launched on a fully-booted system".
Containers move the virtualisation up from hardware to the OS API surface. Because the cost of booting is now amortised over all containers running on the system, the original assumptions of both OS designers and software designers become, approximately, true again.
So you're right, we came full circle, but not to a point that means "use fully-dressed VMs again".