PyTorch 2.0

336 points by lairv 3 years ago

hintymad 3 years ago

A big lesson I learned from PyTorch vs other frameworks is that productivity trumps incremental performance improvement. Both Caffe and MXNet marketed themselves for being fast, yet apparently being faster here and here by some percentage simply didn't matter that much. On the other hand, once we make a system work and make it popular, the community will close the performance gap sooner than competitors expect.

Another lesson is probably old but worth repeating: investment and professional polishing matters to open source projects. Meta reportedly had more than 300 (?) people working on PyTorch, helping the community resolve issues, producing tons of high-quality documentation and libraries, and marketing themselves nicely in all kinds of conferences and media.

vikinghckr 3 years ago

What I really find interesting here is that PyTorch, a library maintained by Facebook, is winning the marketshare and mindshare due to clean API, whereas Tensorflow, maintained by Google, is losing due to inferior API. In general, Google as a company emphasizes code quality and best practices far more than Facebook. But the story was reversed here.
- roenxi 3 years ago
  
  Google's stereotype is the company that can handle and manage complexity - things like Kubernetes, indexing the internet, etc. They are clumsy at persuading people to use their products and have a patchy history of launching platforms that people want to use. Google+ and Google cloud vs AWS spring to mind, Kubernetes is a good platform but challenging to learn. Chrome is an unusual aberration where they did a great job. Android is maybe also a strong counterexample, not sure what the state of play was there.
  But whatever their code quality may be, luring in devs to their API has seen hits and misses.
  
  wartijn_ 3 years ago
  
  They did a pretty good job in persuading people to use search, adsense, YouTube, Gmail, Maps and a bunch of other products. Singling out a few couple of the less successful ones to claim a trillion dollar company is “clumsy at persuading people to use their products” seems like a pretty bad take.
  
  roenxi 3 years ago
  
  You'll note that all the examples you're citing were released more than a decade ago (search wasn't even this century!). And YouTube was an acquisition. They were basically the work of a different Google to the one that is walking around now.
  The stereotype of Google back then was very different. People would quote things like "don't be evil".
  
  fsociety 3 years ago
  
  The only products you listed which weren’t acquisitions are search and gmail. Search is exactly the problem Google was made to handle and Gmail.. well I don’t love gmail but it is a good enough product I guess?
- ldjkfkdsjnv 3 years ago
  
  The same thing is true of react vs angular. Angular is kind of garbage, whereas react is elegant
- ShamelessC 3 years ago
  
  Google historically has a culture of writing libraries with other Google employees as the target audience.
  
  pjmlp 3 years ago
  
  The way Android NDK and AGK experience works, I always have the feeling it is maintained as a 20% project and none of the developers has ever used anything from the competition.
- rawrawrawrr 3 years ago
  
  Google has a culture of encouraging their developers to build complex things, which is important for promotion.
- niels_bom 3 years ago
  
  Couldn’t these organizations be big enough to have different technical cultures from team-to-team or product-to-product? If that’s the case comparing FAANG to each other is like comparing Asia to Europe and then taking North-Korea and Portugal as examples.
- miohtama 3 years ago
  
  An interesting question. Maybe Google lacks the culture to work with external developers (think Android) while Facebook has some of it.
- solarkraft 3 years ago
  
  > In general, Google as a company emphasizes code quality and best practices far more
  That doesn't automatically have to mean better DX. For example the API can be clean, but too low level to conveniently accomplish typical cases (I'm not familiar with this example).
- aschleck 3 years ago
  
  It's worth noting that Google is also working on JAX these days and that's picking up steam, and one reason is because its API is super clean.
  
  chem83 3 years ago
  
  And ML compilers in the form of IREE/MLIR, which if successful in delivering hw and framework optionality, should make ML frameworks a far less consequential decision.
  
  rawrawrawrr 3 years ago
  
  Not really, because people will still use the dominant framework because that's where all the research, examples, libraries are.
SleekEagle 3 years ago

Exactly, especially in the age of ridiculously rapid development that we have found ourselves in over the past few years. This is exactly why TensorFlow is dying
- amitport 3 years ago
  
  I do not believe TensorFlow is anywhere close to dying.
- la_fayette 3 years ago
  
  :) it still is impossible to bring a cnn rnn network in pytorch to mobile, which works fine with tflite...
  
  antixk 3 years ago
  
  Not sure if you have heard of Pytorch Mobile but it is very possible[0]
  [0] https://pytorch.org/tutorials/beginner/deeplabv3_on_ios.html
  
  perturbation 3 years ago
  
  The big thing that PyTorch Mobile is lacking compared to TF Lite is on-device accelerator support (GPU/DSP/etc.) (there's experimental support for NNAPI https://pytorch.org/tutorials/prototype/nnapi_mobilenetv2.ht..., but this is a hack).
  
  levesque 3 years ago
  
  I don't think that's a problem for the vast majority of pytorch users
t-vi 3 years ago

> Another lesson is probably old but worth repeating: investment and professional polishing matters to open source projects
While I'm not going to disagree that it matters, to my mind the main thing about early (0.x, 1.x) PyTorch's success was a clear vision (most visible to me from Soumith and Adam) that put the targeted users first (as in your productivity comment), awesome execution on it, and a creating a community where I, as an outside contributor with modes skill and no AI track record, and likely many others, felt welcome.
Now, the many great people employed by Meta, but also other players like MS, NVidia, AMD, Intel, ... to work on PyTorch enabled all the things that make PyTorch 2.0 nicer and more broadly applicable than 1.0. (And to my mind it's quite a non-trivial accomplishment to enable other corporate players to join the party.)
wging 3 years ago

> Meta reportedly had more than 300 (?) people working on PyTorch,
How much did this change after the big Meta layoffs? I think I know people who are no longer there, but I haven't talked to them about it yet.
- PartiallyTyped 3 years ago
  
  NB: PyTorch is now under the Linux Foundation.
  https://news.ycombinator.com/item?id=32810976
- not2b 3 years ago
  
  Because Meta open sourced it, and because PyTorch caught on, hopefully some of those laid off people can continue to work on it and also market themselves as PyTorch experts.

ansk 3 years ago

So this looks like a further convergence of the tensorflow and pytorch APIs (the lower-level APIs at least). Tensorflow was designed with compilable graphs as the primary execution model and as part of their 2.0 release, they redesigned the APIs to encompass eager execution as well. Pytorch is coming from the other end, with eager execution being the default and now emphasizing improved tools for graph compilation in their 2.0 release. The key differentiator going forward seems to be that tensorflow is using XLA as their compiler and pytorch is developing their own toolset for compilation. As someone who cares far more about performance than API ergonomics, the quality of the compiler is the main selling point for me and I'll gladly switch over to whatever framework is winning in the compiler race. Does anyone know of any benchmarks comparing the performance of pytorch's compilers with XLA?

brrrrrm 3 years ago

Some context/history:
For compiler people reading this, a lot of common compiler terms have been entirely reinvented in the context of machine learning frameworks. An ML "graph" refers almost exactly to the dataflow graph (DFG) of a program. TensorFlow 1.0 only exposed a DFG, which is well known to be far simpler to apply optimizations to (assuming you have a linear algebra compiler).
PyTorch integrated with Python (an interpreted language) and does not expose an underlying DFG. This is labeled "eager" and means that compilation of PyTorch requires optimization over both the control flow graph (CFG) and DFG. Python by default exposes neither of these things in a standard way. Some ML workloads simplify easily to a DFG (torch FX can handle this), but the general case does not. Although TorchScript (a subset of Python) tackled the CFG in 1.0, the team is now taking it further and compiling Python byte-code itself (with torchdynamo), which means you don't need to change any code and still get compilation speed ups! That's why 2.0 is significant.
Of course, all of this requires a linear algebra compiler to actually do the optimizations which is why things like AITemplate (for inference) and TorchInductor (which calls into a bunch of other compilers for training) exist for PyTorch. TensorFlow's linear algebra compiler is XLA.
- brrrrrm 3 years ago
  
  edit: TorchInductor calls into GCC/Triton depending. I mistook it for the various backends TorchDynamo supports (including TVM)
amelius 3 years ago

How much performance can be squeezed from going from the plain python API to the graph-based solution, typically?
- ansk 3 years ago
  
  This varies quite a bit based on the type of model. The graph-based approach has two benefits: (1) removing overhead from executing python between operations and (2) enabling compilers to make optimizations based on the graph structure. The benefit from (1) is relatively modest for models which run a few large ops in series (e.g. image classifiers and most feedforward models) but can be significant for models with many ops that are smaller and not necessarily wired up sequentially (e.g. RNNs). In my experience, I've had RNN models run several times faster in tensorflow's graph mode than in its eager mode. The benefit from (2) is significant in almost any model since the typical "layer" building block (matmul/conv/einsum->bias->activation) can be fused together which improves throughput on GPUs. In my experience compilation can offer performance increases from 1.5x to 3x, but I don't know if this holds generally. Also note that the distinction between graph and eager execution can be somewhat blurry, as even an "eager" API could be calling a fused layer under the hood.
- PartiallyTyped 3 years ago
  
  It depends.. Using Jax to compile down to XLA, I often saw >2 orders of magnitude improvements. This however was roughly 6 months ago.

quietbritishjim 3 years ago

> Today, we announce torch.compile, a feature that pushes PyTorch performance to new heights and starts the move for parts of PyTorch from C++ back into Python.

I'll admit I don't know enough about PyTorch to know what torch.compile is exactly. But does this means some features of PyTorch will no longer be available in the core C++ library? One of the nice things about PyTorch had been that you could do your training in Python then deploy with a pure C++ application.

fddr 3 years ago

The `torch.compile` API itself will not be available from C++. That means that you won't get the pytorch 2.0 performance gains if you use it via C++ API.
There's no plan to deprecate the existing C++ API, it should keep working as it is. However, a common theme of all the changes is implementing more of pytorch in python (explicitly the goal of primtorch), so if this plan works it could happen in the long run.
danieldk 3 years ago

One of the nice things about PyTorch had been that you could do your training in Python then deploy with a pure C++ application.
Or even train in C++ or Rust without much loss in functionality.
- synergy20 3 years ago
  
  Rust really has not had any presence in AI training engine yet, it's probably 100% c++.
  
  danieldk 3 years ago
  
  I was referring to the libtorch library, which you can use through the tch crate. It is possible to make such rich bindings because so much of Torch is exposed through the C++ API. When more new functionality is moved to Python, it makes it harder to use functionality from the C++ interface and downstream bindings.
synergy20 3 years ago

Facebook did similar thing to its original code PHP, it uses HHVM to 'compile' PHP(now called Hacklang) to gain performance, it seems doing similar thing with python here.
- pjmlp 3 years ago
  
  If I remember correctly, the JIT version of the PHP compiler proved itself against the AOT compilation to native code via translation to C++, so they went with the more productive variant.

amelius 3 years ago

One thing I'm noticing lately is that these DL libraries and their supporting libraries are getting unwieldy large, and difficult to version-manage.

In my mind, DL is doing little more than performing some inner-products between tensors, so I'm curious why we should have libraries such as libcudnn, libcudart, libcublas, torch, etc. containing gigabytes of executable code. I just checked and I have 2.4GB (!!) of cuda-related libraries on my system, and this doesn't even include torch.

Also, going to a newer minor version of e.g. libcudnn might cause your torch installation to break. Why isn't this backward compatible?

modeless 3 years ago

The complexity of deep learning algorithms is low but the complexity of the hardware is high. The problem solved by these gigabytes of libraries is getting peak utilization for simple algorithms on complex and varied hardware.
CuDNN is enormous because it embeds precompiled binaries of many different compute kernels, times many variations of each kernel specialized for different data sizes and/or fused with other kernels, and again times several different GPU architectures.
If you don't care about getting peak utilization of your hardware you can run state of the art neural nets with a truly tiny amount of code. The algorithms are so simple you don't even need any libraries, it's easy enough to write everything from scratch even in low level languages. It's a fun exercise. But it will be many orders of magnitude less efficient so you'll have to wait a really long time for it to run.
- amelius 3 years ago
  
  Ok, is there any way to trim down the amount of code used without reducing the performance of my particular application, and my particular machine?
  I have the feeling that it's an all-or-nothing proposition. Either you have a simple CPU-only algorithm, or you have several gigabytes of libraries you don't really need.
  Also, in some applications I would be willing to give up 10% of performance if I could reclaim 90% of space.
  
  modeless 3 years ago
  
  CuDNN is only for Nvidia GPUs, and those machines generally have decent sized disks and decent network connections so no nobody cares about a few GBs of libraries. There are alternatives to using CuDNN with much smaller binary size. Maybe they can match or beat it or maybe not, depending on your model and hardware. But you'll have to do your own work to switch to them, since most people are happy enough with CuDNN for now.
  The real problem with deep learning on Nvidia is the Linux driver situation. Ugh. Hopefully one day they will come to their senses.
  
  amelius 3 years ago
  
  It's not just disk size. Also memory size, and loading speed.
  Yes, I agree about the driver situation.
  
  modeless 3 years ago
  
  The disk size of the shared library is not indicative of RAM usage or loading speed. Shared libraries are memory mapped with demand paging. Only the actually used parts of the library will be loaded into RAM.
- claytonjy 3 years ago
  
  While I think you raise important points about the dominance of hardware optimizations, I think you're massively overstating the simplicity of the algorithms.
  Sure, it's easy to code the forward pass of a fully connected neural network, but writing code to train a useful modern architecture is a very different endeavor.
  
  brrrrrm 3 years ago
  
  I disagree, the burden is almost exclusively maintaining fast implementations of primitive operators for all hardware. These ML libraries are collections of pure functions with minimal interfaces. There's very little code interdependence and it's not particularly difficult to implement modern algorithms to train networks.
  full stable diffusion in <800 lines: https://github.com/geohot/tinygrad/blob/4fb97b8de0e210cc3778...
  autograd in <30 lines: https://github.com/geohot/tinygrad/blob/4fb97b8de0e210cc3778...
  Adam in <20 lines: https://github.com/geohot/tinygrad/blob/4fb97b8de0e210cc3778...
  
  modeless 3 years ago
  
  I disagree. I mean it's not trivial but it is completely within reach of a single person. The only part you'd really need to lean on libraries for would be data loading (e.g. jpeg). The core neural net stuff really is not that complex, even in the latest architectures like transformers or diffusion models. Look at stuff like George Hotz's tinygrad or Andrej Karpathy's makemore.

marban 3 years ago

From what I see, still no 3.11 support — Same for Tensorflow which won't ship before Q1 23.

minimaxir 3 years ago

It's not a huge deal, as the speed improvements in 3.11 likely wouldn't trickle down at the core PyTorch level.
- black3r 3 years ago
  
  but if you do some data pre-processing or post-processing in python, that would be affected by 3.11 speed improvements... or if you have a pytorch based model integrated into a bigger application as just one of many features, there are still some devs who prefer monoliths over microservices....
joelfried 3 years ago

Who is still supporting Windows 3.11?
- sairahul82 3 years ago
  
  It’s python 3.11 :)
  
  robertlagrant 3 years ago
  
  Guido works for Microsoft, so Python 95 will be out soon.
  
  sgt 3 years ago
  
  Speaking of Guido, be sure to check out the podcast with Lex Fridman. Guido is such a down to Earth guy.
  
  itgoon 3 years ago
  
  I'm going to skip Python ME.
  
  fatneckbeardz 3 years ago
  
  i hear that it has sockets built right in. no more Trumpet.

chazeon 3 years ago

How is PyTorch compares to JAX and its stack?

PartiallyTyped 3 years ago

I find that Jax tends to result in messy code unless you build good abstractions. I personally don't like Flax and Haiku, I prefer stax and Equinox as they are more transparent on what is happening, feel a lot less like magic, and more pythonic (explicit is better than implicit etc).
PyTorch is far more friendly for deep learning stuff, but sometimes all you want is pure numerical computations that can be vmapped across tensors, and this is where jax shines imho.
Personal Example: I needed to sample a bunch of datapoints, make distributions out of them, sample, and then compute the density of each sample across distributions. Doing this with pytorch was rather slow, I was probably doing something wrong with vectorization and broadcasting, but I didn't have the time to figure it out.
With jax, I wrote a function that produces the samples, then I vmapped the evaluation of a sample across all distributions, then vmapped over all samples. Took a couple of minutes to implement and seconds to execute.
PyTorch also has the advantage of a far more mature ecosystem, libraries like Lightning, Accelerate, Transformers, Evaluate, and so on make building models a breeze.
- zone411 3 years ago
  
  > Personal Example: I needed to sample a bunch of datapoints, make distributions out of them, sample, and then compute the density of each sample across distributions. Doing this with pytorch was rather slow, I was probably doing something wrong with vectorization and broadcasting, but I didn't have the time to figure it out.
  You probably were not doing anything wrong. I spent a lot of time trying to be clever in order to parallelize things like this and it just wasn't possible without doing CUDA extensions. But it is now! PyTorch now has vmap through functorch and it works.
  
  PartiallyTyped 3 years ago
  
  I found that there still some limitations with functorch's vmap, but I can't recall what it was.
  
  carbocation 3 years ago
  
  Probably not your issue, but one kind of annoying bit is that the inputs need to be tensors. I ended up calling partial on the function I was messing around with and then vmapping the partial, which seemed to work.
  
  PartiallyTyped 3 years ago
  
  Pretty similar actually, I needed to pass in some parameters as tuples, lists, etc. Partial would have probably worked, but tbh I wasn't in the mood to try harder at 2am.
staunch 3 years ago

PyTorch and JAX are both open-source libraries for developing machine learning models, but they have some important differences. PyTorch is a more general-purpose library that provides a wide range of functionalities for developing and training machine learning models. It also has strong support for deep learning and is used by many researchers and companies in production environments.
JAX, on the other hand, is designed specifically for high-performance machine learning research. It is built on top of the popular NumPy library and provides a set of tools for creating, optimizing, and executing machine learning algorithms with high performance. JAX also integrates with the popular Autograd library, which allows users to automatically differentiate functions for training machine learning models.
Overall, the choice between PyTorch and JAX will depend on the specific requirements and goals of the project. PyTorch is a good choice for general-purpose machine learning development and is widely used in industry, while JAX is a better choice for high-performance research and experimentation.
https://chat.openai.com/chat
- whimsicalism 3 years ago
  
  I was reading this and thinking it was a pretty terrible answer - glad it is just generated by an AI and not you personally so I'm not insulting you.
  JAX is basically numpy on steroids and lets you do a lot of non-standard things (like a differentiable physics simulation or something) that would be harder with Pytorch.
  They are both "high-performance."
  Pytorch is more geared towards traditional deep learning and has the utilities and idioms to support it.
  
  brap 3 years ago
  
  I’m not sure why, but I realized it was AI from the very first sentence, not exaggerating. It’s just not something someone on HN would write.
  
  windsignaling 3 years ago
  
  Yup. Reminds me of an article you'd find in the top 10 Google search results...
  
  eastWestMath 3 years ago
  
  It reminded me of the sort of lazy Wikipedia regurgitation that a lot of undergrads used to give when I was teaching. So it is a bit jarring to see a response like that in a non-compulsory setting.
  
  dekhn 3 years ago
  
  jax is not numpy on steroids. jax is "use python idiomatically to generate optimized XLA code for evaluating functions both forward and backward."
  
  whimsicalism 3 years ago
  
  Probably the primary use of jax is `jax.numpy` which is XLA accelerated and differentiable numpy.
  I'll admit that saying "basically numpy on steroids" might have been an overreduction. It is a system for function transformations that is built on XLA and oriented towards science & ML applications.
  It's not just me saying stuff like this.
  François Chollet (creator of Keras): "[jax is] basically Numpy with gradients. And it can compile to XLA, for strong GPU/TPU acceleration. It's an ideal fit for researchers who want maximum flexibility when implementing new ideas from scratch."
  
  dekhn 3 years ago
  
  Yes- and that gradient part is a key detail that makes it more than "numpy on steroids". numpy on steroids would be a hardware accelerator that took numpy calls and made them return more quickly, but without the command-and-control and compile-python-to-xla aspects.
  
  whimsicalism 3 years ago
  
  Well clearly I meant steroids of the gradient-developing variety.
  I think you are being far too pedantic about what a biological compound would analogously do to a software library, especially given that I mention the differentiability property in the same sentence you are taking issue with.
  
  dekhn 3 years ago
  
  OK, actually as long as it's gradient-developing steroids, I'll allow it.
  
  uoaei 3 years ago
  
  Can someone comment more on what makes JAX that much better for differentiable simulations than PyTorch?
  I'm working on a new module for work and none of my colleagues have much experience developing ML per se. I'm trying to decide whether to force their hand by implementing v1 in PyTorch or JAX and differentiable physics simulations is a likely future use case. Why is PyTorch harder?
  
  patrickkidger 3 years ago
  
  At least prior to this announcement: JAX was much faster than PyTorch for differentiable physics. (Better JIT compiler; reduced Python-level overhead.)
  E.g for numerical ODE simulation, I've found that Diffrax (https://github.com/patrick-kidger/diffrax) is ~100 times faster than torchdiffeq on the forward pass. The backward pass is much closer, and for this Diffrax is about 1.5 times faster.
  It remains to be seen how PyTorch 2.0 will compare, of course!
  Right now my job is actually building out the scientific computing ecosystem in JAX, so feel free to ping me with any other questions.
  
  adgjlsfhk1 3 years ago
  
  If you care about performance of differential physics you shouldn't use python. Diffrax is almost OKish, but is missing a ton of features (e.g. good stiff solvers, arbitrary precision support, events for anything other than stopping the simulation, ability to control the linear solve which are needed for large problems). For simple cases it can come close to the C++/Julia solvers, but for anything complicated, you either won't be able to formulate the model, or you won't be able to solve it efficiently.
  
  patrickkidger 3 years ago
  
  > If you care about performance
  This definitely isn't true. On any benchmark I've tried, JAX and Julia basically match each other. Usually I find JAX to be a bit faster, but that might just be that I'm a bit more skilled at optimising that framework.
  Anyway I'm not going to try and debunk things point-by-point, I'd rather avoid yet another unpleasant Julia flame-war.
  
  whimsicalism 3 years ago
  
  Because the `jax.numpy` operations & primitives are almost 1:1 with numpy, many working scientists who already have experience working with numpy will be able to figure out jax faster.
  It is also easier to rewrite existing code/snippets (say you were working on a non-differentiable simulator before) into jax if you already have them in numpy then to do the whole rewrite in pytorch.
  I will say that I think pytorch has improved its numpy compatability a lot in recent years, functions that I was convinced didn't exist with pytorch (like eigh) apparently actually do.
  
  chazeon 3 years ago
  
  I have seen JAX-MD[1] but not sure about "much better". On the other hand, there is just no MD implemented with PyTorch.
  [1]: https://github.com/jax-md/jax-md
  
  new_user55 3 years ago
  
  There is [torch-md](https://github.com/torchmd/torchmd)
- satvikpendem 3 years ago
  
  It seems to use the same type of template for comparisons:
  React and Vue are both JavaScript libraries for building user interfaces. The main difference between the two is that React is developed and maintained by Facebook, while Vue is an independent open-source project.
  React uses a virtual DOM (Document Object Model) to update the rendered components efficiently, while Vue uses a more intuitive and straightforward approach to rendering components. This makes Vue easier to learn and use, especially for developers who are new to front-end development.
  React also has a larger community and ecosystem, with a wider range of available libraries and tools. This can make it a better choice for larger, more complex projects, while Vue may be a better fit for smaller projects or teams that prefer a more lightweight and flexible approach.
  Overall, the choice between React and Vue will depend on your specific project requirements and personal preferences. It's worth trying out both to see which one works better for you.
- cube2222 3 years ago
  
  It's funny, cause already after the first sentence it felt like ChatGPT, probably because I've played with it a lot these past few days, and expectedly I found a disclaimer at the end.
  That said, the answer isn't really useful, as it's very generic, without anything concrete (other than the mention of Autograd) imo.
  Though a follow up question might improve on that.

anigbrowl 3 years ago

I get that Nvidia is the favorite GPU (because CUDA) and that library maintainers want to chase the latest and best to do the most. But I don't get why support for older hardware (including CUDA stuff) is just deprecated or abandoned, nor why support for other GPU architectures is lacking across many popular ML libraries.

A lot of this is down to driver availability and software stacks...but is all of it? Game engines/engineers seem to be able to be productive on a wide variety of GPU hardware, why do so many ML libraries just not provide any support at all? Sure 75% of potential performance is less satisfying than 100%, but it's also infinitely better than 0%. How come every ML library doesn't at least have an OpenCL fallback or the like?

qayxc 3 years ago

NVIDIA invested billions in their software infrastructure over the past 15 years (CUDA was first released in 2007) and basically brute forced their way into academia back in 2008, sponsoring labs around the world to get GPGPU going.
Both mindshare and market share (outside of super computers) are just overwhelming at this point. Their market share in the consumer market is ~86% as of this quarter. The data centre market is quite fragmented when it comes to accelerators, but in AI training, NVIDIA is still the market leader.
> Game engines/engineers seem to be able to be productive on a wide variety of GPU hardware, why do so many ML libraries just not provide any support at all?
That's a different kettle of fish. Game engines rely on the graphics driver's implementations of low- and mid level APIs like Vulkan or Direct3D.
The brunt of the work is also often performed by the middle-ware (mostly Unreal Engine and Unity or in-house engines like Frostbite) that had been in development for decades; with most games focusing on high level optimisation wrt. the middle-ware used.
ML-frameworks on the other hand need to optimise compute kernels as well as data flow between host CPU and accelerator (e.g. GPU). This involves hand-tuning algorithms to best match specific GPU architectures, while most shaders used in games are basically the same across all GPU vendors and it's the vendors themselves who do the fine tuning and per-game optimisations in their graphics drivers (hence the obscene sizes of GPU drivers these days).
While that's a good enough approach for games, it's simply not possible to do the same for ML-models. There's just too much flexibility (no API that dictates which calls do what, when, and how) to make general ML-optimisations at the driver level.
> How come every ML library doesn't at least have an OpenCL fallback or the like?
OpenCL is horrible to work with and stopped being properly supported by vendors (e.g. newer versions are rarely being implemented and optimised). The difference between CUDA and OpenCL from an implementor's perspective is that CUDA works seamlessly with surrounding C++ code and compute kernels can be embedded in the host CPU code base. OpenCL on the other hand is modelled after the ancient OpenGL 2.x paradigm and requires tedious setup and careful integration (checking capabilities and all that jazz). OpenCL is basically dead at this point.
There are alternatives to CUDA, but most frameworks rely heavily on the highly optimised libraries NVIDIA ships (e.g. CuDNN) and don't have the resources to implement the functionality themselves. Some hardware vendors offer proprietary backends, like Apple or Intel and you just have to wait for them to catch up. AMD has ROCm, but that's more of a drop-in replacement that aims at running CUDA code on AMD cards.
- orbital-decay 3 years ago
  
  > NVIDIA invested billions in their software infrastructure over the past 15 years (CUDA was first released in 2007) and basically brute forced their way into academia back in 2008, sponsoring labs around the world to get GPGPU going.
  Worth mentioning that CUDA didn't appear in vacuum either - GPGPU was a thing since the first shaders in consumer GPUs, which were also introduced by NVidia in NV20. CUDA was a result of years of ad-hoc attempts at GPU programming. So they really just built the entire field from nothing.
- anigbrowl 3 years ago
  
  Thanks much for this in-depth explanation. I've been struggling with this for a while as I am low on the learning curve with a lot of ML stuff (partly due to headaches with finding a GPU that was affordable and properly supported in order to develop elementary competence). I think I'm just going to get an eGPU with a recent & decent CUDA card rather than waiting for a utopia of interoperability and backwards compatibility.
  
  mdda 3 years ago
  
  Google Colab gives you $free GPU (usually a 16Gb T4) preloaded with frameworks, ready to run. Later, you might be tempted by the Pro(+) version, but there's plenty of scope to move up the learning curve before spending any money.
  
  anigbrowl 3 years ago
  
  I should check that out. Jetbrains just integrated remote management for code and notebooks into their IDEs and this seems like the perfect way to test. Thanks for the tip!

singularity2001 3 years ago

Not available for Mac M1 yet (?)

ngcc_hk 3 years ago

Meta is sort of strange. They bet a lot of unusual language and invest a lot on others. And unlike google which just try this that (which also have a good point as we need innovation, just do not assume it will exist a few month/year down the road). Sad the strange taste of that thing. At least it is not as disruptive like Twitter and only hurt itself.

algon33 3 years ago

The FAQ contains re-states the content for point 14 in point 13. Point 14 is about why your code might be slower when using 2.0. 13 should be about how to keep up with PT 2.0 developments. Someone should change that.

poorman 3 years ago

Sill no support for/from Apple Silicon?

cube2222 3 years ago

It's supported since 1.12[0], no?
[0]: https://pytorch.org/blog/introducing-accelerated-pytorch-tra...
- chadykamar 3 years ago
  
  It's also officially in beta as of 1.13 https://pytorch.org/blog/PyTorch-1.13-release/#beta-support-...

singularity2001 3 years ago

Getting Started

…

and zero words on how to get started.

pip3 install torch2?

pip3 install torch==2.0? nope

SekstiNi 3 years ago

https://pytorch.org/get-started/pytorch-2.0/#requirements

belval 3 years ago

> We believe that this is a substantial new direction for PyTorch – hence we call it 2.0. torch.compile is a fully additive (and optional) feature and hence 2.0 is 100% backward compatible by definition.

How about just calling it PyTorch 1.14 if it's backward compatible? Version numbering shouldn't be used as a marketing gimmick.

js2 3 years ago

Dismissive comments like this make me not want to read HN anymore and in addition it’s against the HN guidelines:
It’s snarky. It’s incurious. It’s neither thoughtful nor substantive. It’s flame bait. It’s a shallow dismissal. It doesn’t teach anything. It’s the most provocative thing to complain about.
https://news.ycombinator.com/newsguidelines.html
I’m sorry I had to leave this comment, so let me also try to respond thoughtfully:
Assuming that PyTorch is using semantic versioning requires that the major version MUST change when making a backwards incompatible API change:
> Major version X (X.y.z | X > 0) MUST be incremented if any backwards incompatible changes are introduced to the public API. It MAY also include minor and patch level changes. Patch and minor versions MUST be reset to 0 when major version is incremented.
This requirement does NOT preclude changing the major version when making backwards-compatible changes.
PyTorch has not violated semver here. It is absolutely compatible with semver to bump the major version for marketing reasons.
https://semver.org/
- belval 3 years ago
  
  Personal attack aside, from your own link:
  > Given a version number MAJOR.MINOR.PATCH, increment the:
  > MAJOR version when you make incompatible API changes
  > MINOR version when you add functionality in a backwards compatible manner
  > PATCH version when you make backwards compatible bug fixes
  > Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.
  You can point towards some other details, but it doesn't change the fact that for the overwhelming majority of people, the quote above is what semver is. Besides, my original comment does not say "They broke semver", it says they shouldn't bump the major version if they don't make backward incompatible change because afterwards the mental model of "Can I use version X.Y.Z?" is broken.
  When TensorFlow moved to 2.0 it's because they were changing from graphs and session definition to eager mode. That makes sense, that means the underlying API and how the downstream users interact with it changed. These are just newer features that, while very useful, have limited bearing on downstream users.
- lostmsu 3 years ago
  
  Frankly, they are not really showing benchmarks, and given my experience with hyped torch.jit I don't expect much.
pdntspa 3 years ago

They're saying it represents a change in direction and is a pretty big feature, traditionally that's been a good reason to increment a major version number.
posharma 3 years ago

Is this really the biggest problem that needs to be solved in AI?
- belval 3 years ago
  
  Not sure I understand that question, is versioning the biggest problem no, but it costs nothing to keep semver and prevent production headaches later.
  If you meant inference speed then yeah it's a very big problem so it's good that they are addressing it.
  
  mi_lk 3 years ago
  
  what exact production headaches you are expecting by bump the number from 1.13 -> 2.0, while all existing codes keep working as before?
  And how is it different from bumping 1.13 to 1.14, even if they named it 1.14?
  
  belval 3 years ago
  
  The soft kind. Major versions are deeply ingrained as "possible backward-compatibility issues" in most engineers' brain. If you handle model development, evaluation and deployment yourself than sure you won't have any issues, but in a bigger organization you have to get people to switch and that version number will mean that everyone will ask the same "hang on this is a major version change?!" question every step of the way.
- whimsicalism 3 years ago
  
  No? What would have given you that impression?
  Oh, I see. You were trying to be dismissive.
- robertlagrant 3 years ago
  
  Quite the non sequitur you have there.