Launch HN: Deepsilicon (YC S24) – Software and hardware for ternary transformers

189 points by areddyyt 4 months ago

Hey Hacker News! We’re Abhi and Alex from deepsilicon (https://deepsilicon.com). We are building software and hardware for training and inferencing ternary transformer models. Here's a video of the software: https://www.youtube.com/watch?v=VqBn-I5D6pk.

Transformer-based models are getting bigger every generation, making the inference hardware requirements more and more expensive. Running large transformer models on device is even more challenging. Usually, they require trillions of FLOPs to run at decent speeds and use too much energy and space.

Our solution is to train ternary transformer models. There are two advantages to using ternary values. The first is that the weights can now be stored in two bits (or even less) from 16 bits. This represents an almost 8x compression ratio for every weight matrix in the transformer model (slightly less because of the float16 scaling value and extra norm, but that’s negligible). The second advantage is a reduction in the arithmetic intensity. If we do a dot product between ternary values and INT8 values, we either add the INT8 if the ternary value is 1, subtract the INT8 if the ternary values is -1, or do nothing if the ternary value is 0. There are numerous ways to take advantage of this change in arithmetic, from look up tables to bit mask reductions. As for why ternary and not quaternary/binary, ternary hits a sweet spot of compression and (symmetric) representational value for weights in our experiments.

Currently, hardware is not really optimized for extreme low bit-width matrix operations (whether multiplication or otherwise). We’ve tried various implementations of kernels on both CPUs/GPUs (really only NVIDIA GPUs). We don’t even come close to the theoretical maximum speed for our kernels, and a large part of the failure is because the architecture of existing hardware isn’t optimized for the operations we want them to do. Creating custom silicon for ternary LLMs can accelerate inference by implementing and designing algorithms/circuits that only work for ternary LLMs. Unlike most hardware companies, which need silicon to show improvements, we can already show improvements to active VRAM usage and throughput with our custom kernels on existing hardware. This sets pretty impressive lower bounds for custom silicon.

We originally started working on this after reading the BitNet paper from Microsoft, and were disenchanted that we couldn't run SOTA models on our consumer hardware (3090 and 3070M). Both Alex and I worked on research at Dartmouth, I worked more on the ML/model architecture side, while Alex worked on randNLA CUDA kernels to accelerate training. The research experience, and opportunity to talk to professors, made us realize that if we could pull off ternary transformers, it could solve the large scale inference problem on the edge and cloud.

First, we must either retrain or pretrain a model with our custom linear layers based on the Bitnet 1.58 layers (we’re working on open sourcing our framework for training, data labelling, and synthetic data generation here: https://github.com/deepsilicon/Sila). The model is trained with FP16 weights, but the weights are quantized and the quantization function is detached from the computational graph to allow gradients to flow, and the loss is measured w.r.t. the quantized weights. Once the model converges, we can inference the model with our custom kernels written for CPUs or GPUs (we are working on Inferentia and TPU support). The end goal is to create purpose-built custom silicon to work with the ternary weights, where we can have better compression, throughput, latency, and energy improvements compared to our kernels on existing hardware.

We know this is a highly challenging problem due to technical and market difficulties. Plenty of hardware companies have tried to accelerate inference, but most are not profitable. The biggest problem in the ML hardware market, perplexingly, is software. It's challenging to convince companies to switch to some new hardware when their entire infrastructure and software stack has been configured for some other hardware. On the technical side, we must support various deployment options and model architectures to make large-scale custom silicon production worthwhile. This is compounded by the fact we want to have a single line of code handle everything, abstracting what we're doing away from the ML engineers. So, we need to handle everything on the technical side: compiling the right kernels for your platform, generating the right bindings for ONNX/TensorRT, tuning the kernels, setting the mode to training or inference, etc.

We’d love to hear your opinions about ASICs for transformer inference - and if you know anyone who might be interested in deploying these models, my email is abhi@deepsilicon.com. We can’t wait to hear what you all think!

danjl 4 months ago

In my experience, trying to switch VFX companies from CPU-based rendering to GPU-based rendering 10+ years ago, a 2-5x performance improvement wasn't enough. We even provided a compatible renderer that accepted Renderman files and generated matching images. Given the rate of improvement of standard hardware (CPUs in our case, and GPU-based inference in yours), a 2-5x improvement will only last a few years, and the effort to get there is large (even larger in your case). Plus, I doubt you'll be able to get your HW everywhere (i.e. mobile) where inference is important, which means they'll need to support their existing and your new SW stack. The other issue is entirely non-technical, and may be an even bigger blocker -- switching the infrastructure of a major LLM provider to a new upstart is just plain risky. If you do a fantastic job, though, you should get aquahired, probably with a small individual bonus, not enough to pay off your investors.

  • areddyyt 4 months ago

    We're targeting the edge market first, such as NVIDIA's Jetson line, because it's far less supported/focussed on. In our experience, whenever we did training runs on H100 clusters with x86, any pip package would be easily installable, and a wide array of software just worked. This is not the case in Jetson, where we constantly have to rebuild packages from source, and in general, NVIDIA will only release a better board every five years. As for the second part of your question, we agree. Much of our work has been trying to make switching to our software layer straightforward (a single line of code). The ideal endgame is that, given an ONNX file, we can parse the generated node tree and determine if our hardware supports all the nodes. Of course, this is assuming we have a large enough share of the market using our software, so we know what operations we need to support on the hardware side of things.

    • danjl 4 months ago

      I cannot see any way of building HW profitably for the Jetson market. You are really competing with Raspberry PI, not Jetson, IMO. I mean, I'm no expert, but I would suggest doing a deep dive on your business plan if you intend to target the small hardware world rather than spending any time designing HW or SW. Then reduce your estimate by at least half since doing anything in that embedded/edge world has many more technical issues.

      • areddyyt 4 months ago

        In general, Jetson has quite a large market. Vehicle companies use automotive-rated Jetson Orins, and defense companies also use Jetson Orins to power ML applications on the edge (Anduril). Many of the companies we currently talk to are robotics companies that are forced to use Jetsons because they are both the least of the bad options and the only edge compute provider with enough juice to run larger transformer models.

        • danjl 4 months ago

          And the auto and Defense markets are so easy to enter! /s

          Both of these markets have long lead times, tight HW build times, and move incredibly slowly. They are not the kind of markets that like using stuff from new companies with no history. Again, I'm no expert, but I'd say you need to be concentrating on sales and market research now.

          • hedgehog 4 months ago

            With respect it doesn't sound like you know much about any of these businesses. This startup is extremely early, the road to silicon is long, and there is a lot of external change and learning by doing that will happen between here and there. This is them getting started and based on my related work experience I think it's pretty interesting.

          • areddyyt 4 months ago

            We are not under the illusion these markets are easy to enter. Still, we believe providing an effortless and compatible experience for edge ML computing is a strong competitive advantage. We have not met anyone who likes using Jetsons yet, unlike A100/H100s in the server market.

            Edit: I should note that if it weren't for Dusty and his docker image generating GitHub repo for Jetson, we would have spent weeks trying to get our kernels and optimized models shipped to customers.

          • autoconfig 4 months ago

            What's your point? Is it that one shouldn't attempt to enter a market just because it's difficult? Or are you trying to educate the founders about something obvious that they likely have already spent 1000x more time thinking about than you?

            • motoxpro 4 months ago

              This 1000%. Just because a business in a tangential area didn't work, doesn't mean innovation shouldn't happen

    • danjl 4 months ago

      I think the only way this could work is if you had the backing of one of the major LLM providers who decided that your ideas are worth doing a PoC. That way you actually have a client on board before you spend all the money. I know you guys probably like the designing of the HW and SW, and maybe the implementation of both, but really, what you need now is to do sales.

      • lumost 4 months ago

        There are multiple ways to run a business like this.

        1. Go deep on the tech, there are funders who will want equity stakes in risky startups because they operate in adjacent markets. It's often cheaper to invest 1MM on a startup than internal R&D activities. If it has promising results, those same investors may ramp up their spend or pivot to an acquisition strategy.

        2. Get early customers, if you have 1-10 large enterprises with a committed spend - then you are likely golden. However as nice as this option sounds, there are few avenues to get this type of commitment. If you are in the fortunate position of knowing the exec/founding/investor team of a large LLM provider - it's possible. But easier said than done.

        3. Build it and they will come, business strategies take time to develop - maybe that time is poorly spent. Build the best version of your product and someone might take it up. There are a few investors who will take a flyer on this type of founder mentality. Benefit to the investor is that they can get a much larger equity stake/board position in exchange for the early creative freedom. If it works out, the investor can get a lot of alpha. A card which handled LLM inference at 1/100th the cost of an H100 could produce quite a bit of value for the right buyer.

        • threeseed 4 months ago

          The most realistic and likely scenario is:

          4. Do the technical work to get it a little bit beyond just an idea and then get acqui-hired by a large company who has the resources to push this.

          So if I was them I would be doing thought experiments on how this technology could benefit a whole range of businesses e.g. gaming consoles, televisions etc. Not many people would've guessed LG acquiring Palm for example.

      • areddyyt 4 months ago

        Agreed. We don't plan on making hardware until there is enough demand from customers to make it economically viable.

    • acidhax 4 months ago

      I'm currently working on a portable computer vision project using Pi/Jetson with some Luxonis camera modules and I completely see where you're headed. In the long-game I think you could capture hw accelerated robotics CV.

    • ben_ja_min 4 months ago

      Why not target the enthusiast first? The buzz created around something interesting an "amateur" cooked up may be what you need. The investment involved with creating dev hardware should be minimal, correct?

      • simne 4 months ago

        I may be wrong, but from few other enthusiast niches I conclude, enthusiasts number is very little to feed hardware development. - Need millions sells, but really most real project have made thousands sells.

        And this is long known - even Raspberry born for other market, fortunately, was not just killed but conversed to target enthusiast and even now incomplete project.

  • jroesch 4 months ago

    Having been working in DL inference for now 7+ years (5 of which at startup) which makes me comparably ancient in the AI world at this point. The performance rat race/treadmill is never ending, and to your point a large (i.e 2x+) performance improvement is not enough of a "painkiller" for customers unless there is something that is impossible for them to achieve without your technology.

    The second problem is distribution: it is already hard enough to obtain good enough distribution with software, let alone software + hardware combinations. Even large silicon companies have struggled to get their HW into products across the world. Part of this is due to the actual purchase dynamics and cycle of people who buy chips, many design products and commit to N year production cycles of products built on certain hardware SKUs, meaning you have to both land large deals, and have opportune timing to catch them when they are evening shopping for a new platform. Furthermore the people with existing distribution i.e the Apple, Google, Nvidia, Intel, AMD, Qualcomms of the world already have distribution and their own offerings in this space and will not partner/buy from you.

    My framing (which has remained unchanged since 2018) is that for silicon platform to win you have to beat the incumbents (i.e Nvidia) on the 3Ps: Price (really TCO), Performance, and Programmability.

    Most hardware accelerators may win on one, but even then it is often theoretical performance because it assumes their existing software can/will work on your chip, which it often doesn't (see AMD and friends).

    There are many other threats that come in this form, for example if you have a fixed function accelerator and some part of the model code has to run on CPU the memory traffic/synchronization can completely negate any performance improvements you might offer.

    Even many of the existing silicon startups have been struggling with this since them middle of the last decade, the only thing that saved them is the consolidation to Transformers but it is very easy for a new model architecture to come out and require everyone to rework what they have built. This need for flexibility is what has given rise to the design ethos around GPGPU as flexibility in a changing world is a requirement not just a nice to have.

    Best of luck, but these things are worth thinking deeply about as when we started in this market we were already aware of many of these things but their importance and gravity in the AI market have only become more important, not less :)

    • areddyyt 4 months ago

      We've spent a lot of time thinking about these things, in particular, the 3Ps.

      Part of making the one line of code work is addressing programmability. If you're on Jetson, we should load the CUDA kernels for Jetson's. If you're using a CPU, we should load the CPU kernels. CPU with AVX512, load the appropriate kernels with AVX512 instruction, etc.

      The end goal is that when we introduce our custom silicon, one line of code should make it far easier to bring customers over from Jetson/any other platform because we handle loading the correct backend for them.

      We know this will be bordering impossible, but it's critical to ensure we take on that burden rather than shifting it to the ML engineer.

      • danjl 4 months ago

        Why start a company to make this product? Why not go work at one of the existing chip manufacturers? You'd learn a ton, get to design and work on HW and/or SW, and not have to do the million other things required to start a company.

        • areddyyt 4 months ago

          We were waiting for a Bitnet-based software and hardware stack, particularly from Microsoft, but it never did. We were essentially nerd-sniped into working on this problem, then we realized it was also monetizable.

          On a side note, I deeply looked into every company in the space and was thoroughly unimpressed with how little they cared about the software stack to make their hardware seamlessly work. So, even if I did go to work at some other hardware company, I doubt a lot of customers would utilize the hardware.

          • danjl 4 months ago

            I recommend getting a job at NVIDIA. They care deeply about SW. It is a great place to learn about HW and the supporting SW. There is much to learn. Maybe you will learn why you are unimpressed with their SW offerings. For me, the hard part was the long lead time (8+ years) from design to customers using the product. One of the things that always amazed me about NVIDIA was that so many of the senior architects, who have no financial need to keep working (true for more than a decade), are still working there because they need the company to do what they love.

            • areddyyt 4 months ago

              I think there is a comment somewhere here where I comment on NVIDIA, but I think NVIDIA is the best hardware company for making good software. We had a very niche software issue for which NVIDIA maintained open-source repos. I don't think NVIDIA's main advantage is its hardware, though; I think it's the software and the flexibility it brings to its hardware.

              Suppose that Transformers die tomorrow, and Mamba becomes all the rage. The released Mamba code already has CUDA kernels for inference and training. Any of the CSPs or other NVIDIA GPU users can switch their entire software stack to train and inference Mamba models. Meanwhile, we'll be completely dead in the water with similar companies that made the same bet, like Etched.

              • danjl 4 months ago

                You said (implied?) that your reason for starting a company was that you were waiting for somebody (MS) to build your favorite tech, and you realized it was monetizable. Finding a gap is a great start. But, if money is your goal, it is far easier to make money working at a company than starting one. Existing companies are great places to learn about technology, business, and the issues that should really drive your desire to start something yourself.

                • areddyyt 4 months ago

                  I don't think I ever implied we started this for money. We started working on the technology because it was exciting and enabled us to run LLMs locally. We wouldn't have started this company if someone else came along and did it, but we waited a month or two and didn't see anyone making progress. It just so happens that hardware is capital intensive, so making hardware means you need access to a lot of capital through grants (which Dartmouth didn't have for chip hardware) or venture capital (which we're going for now). I'm not sure where you got the idea we're doing this solely for money when I explicitly said "We were essentially nerd-sniped into working on this problem"

                  • danjl 4 months ago

                    Glad to hear money isn't your focus. Your comment "...then we realized it was also monetizable" was the reason for my interpretation. Its also a very common rational. I don't know what "nerd-sniped" means, so...

                    Good luck with the VCs. I hope you all stay friends through the challenging process.

              • simne 4 months ago

                > I think NVIDIA is the best hardware company for making good software

                I must support Your words. Long time I thought that Intel is the best, but unfortunately I could not anymore.

                Must admit, I still don't understand, how it happened, but now NVIDIA is best.

                • areddyyt 4 months ago

                  100%.

                  When performing performance optimization on CPUs, I was impressed with Intel's suite of tools (like VTUNE). NVIDIA has some unbelievable tools, like Nsys and, of course, its container registry (NGC), which I think surpasses even Intel's software support.

  • gchadwick 4 months ago

    Is GPU rendering used today for VFX? From a quick google it seems that yes GPU based rendering is definitely an option, even if there's various reasons to still prefer CPU. So in your case was it really what you were aiming to do was pointless or simply your particular solution failed to succeed?

    You're right that as a small player it's very hard to gain traction, even if the tech is fantastic because it's risky to switch your tech stack over. Though if you do do a good job with the tech I'd say you have a decent chance of an acquisition from a bigger player who wants a ready-made (or 90% of the way there) solution they can make their own. Perhaps you can call this an aquihire but I think you're significantly underplaying the potential upside of this exit. Imagine this startup is seen as having a great ternary transformer solution and ternary transformers are the way to go you could get multiple large players eyeing up an acquisition to get ahead pushing the price up.

    My feeling is custom ASICs for ternary transformers is a great area to look at. There is a genuine chance of providing a significant step up from GPUs in terms of power efficiency and potentially performance. Plenty of risk of course, ternary models might just not perform as well as the full fat equivalents and building custom silicon, especially as a start-up, comes with all kinds of issues.

    • jsheard 4 months ago

      > Is GPU rendering used today for VFX?

      Yes by small studios with the agility to change their workflow without too much friction, and whose projects are small enough to fit into the constraints of GPU renderers, but largely not by huge studios who already have in-house CPU farms and whose projects need hundreds of gigs of RAM to render anyway.

      • kridsdale3 4 months ago

        The Unreal Engine I hear is getting a lot of work these days.

cs702 4 months ago

Watching the video demo was key for me. I highly recommend everyone else here watches it.[a]

From a software development standpoint, usability looks great, requiring only one import,

  import deepsilicon as ds
and then, later on, a single line of Python,

  model = ds.convert(model)
which takes care of converting all possible layers (e.g., nn.Linear layers) in the model to use ternary values. Very nice!

The question for which I don't have a good answer is whether the improvement in real-world performance, using your hardware, will be sufficient to entice developers to leave the comfortable garden of CUDA and Nvidia, given that the latter is continually improving the performance of its hardware.

I, for one, hope you guys are hugely successful.

---

[a] At the moment, the YouTube video demo has some cropping issues, but that can be easily fixed.

  • areddyyt 4 months ago

    Thank you!

    CUDA and Nvidia are practically impenetrable on the server side. To be very concrete, we did training for our models on AWS with parallel cluster. We used P5 instances (8xH100) that were scheduled with SLURM. A problem we ran into however, was that our training jobs were containerized. Thankfully, pyxis and enroot exist to run containerized jobs on SLURM. And who else, but Nvidia, develop and maintain those plugins. For practically any weird niche use case, Nvidia seems to have some software solution - but only on x86.

    Jetson is a whole other beast. There is no guarantee any pip package you install has an aarch64/arm64 wheel. For example, we could not use torch_tensorrt, to compile to TensorRT via Torch Inductor. Why? Because the Bazel build system was only configured to build for Jetpack 4.6 or Jetpack 5.1, and we were using Jetpack 6. While Nvidia provides docker images for x86 systems that come with torch_tensorrt installed, their L4T (Linux for Tegra) images do not. Instead we had to manually write out a new workspace file and compile for Jetpack6 to provide TensorRT compiling support.

    tl;dr: Nvidia and CUDA have a great walled garden on x86, not so much on their edge computing devices

    • cs702 4 months ago

      My understanding is that, so far, most deployments of AI on edge devices are on mass-market mobile and entertainment devices relying on software and hardware tightly controlled by a handful of mega-corporations, such as Apple (iOS), Google (Android), Samsung (phones, TVs, etc.), and Tesla (proprietary in-car chips for FSD), and so on. Aren't those mega-corporations, not Nvidia, the ones who have the actual walled gardens on AI edge computing?

      Do you think otherwise?

      • areddyyt 4 months ago

        You're absolutely right about mobile devices (Apple, Google, etc.). However, most companies, with the exception of Tesla, do use Nvidia for edge computing capabilities. We know for a fact that most of the automotive industry uses automotive rated Orins (the 32GB unified RAM SKU) [1] and Anduril also use Orins. Our primary GTM is with robotics companies, and we have not met a single robotics company not using Jetson, I'm not exaggerating.

        [1] Particularly vehicles with advanced self driving capabilities. Qualcomm is another large vendor of hardware for vehicles (though they have even worse support)

        • cs702 4 months ago

          > Our primary GTM is with robotics companies, and we have not met a single robotics company not using Jetson, I'm not exaggerating.

          Huh. That's a really good sign. I'm rooting for you!

  • areddyyt 4 months ago

    Video cropping issues should be fixed!

0xDA7A 4 months ago

I think the part I find most interesting about this is the potential power implications. Ternary models may perform better in terms of RAM and that's great, but if you manage to build a multiplication-free accelerator in silicon, you can start thinking about running things like vision models in < 0.1W of power.

This could have insane implications for edge capabilities, robots with massively better swarm dynamics, smart glasses with super low latency speech to text, etc.

I think the biggest technical hurdle would be simulating the non linear layers in an efficient way, but you can also solve that since you already re-train your models and could use custom activation functions that better approximate a HW efficient non linear layer.

  • areddyyt 4 months ago

    The non-linear layers, particularly the softmax(QK^T), will be crucial to getting ultra-low latency and high throughput. We're considering some custom silicon just for that portion of every transformer block

jacobgorm 4 months ago

I was part of a startup called Grazper that did the same thing for CNNs in 2016, using FPGAs. I left to found my own thing after realizing that new better architectures, SqueezeNet followed by MobileNets, could run even faster than our ternary nets on off-the-shelf hardware. I’d worry that a similar development might happen in the LLMs space.

  • areddyyt 4 months ago

    It's always possible, but transformers have been around since 2017 and don't seem to be going anywhere. I was bullish on Mamba and researched extended context for structured state-space models at Dartmouth. However, no one cared. The bet we're taking is that Transformers will dominate for at least a few more years, but our bet could be wrong.

nicoty 4 months ago

Could the compression efficiency you're seeing somehow be related to 3 being the closest natural number to the number e, which also happens to be the optimal radix choice (https://en.wikipedia.org/wiki/Optimal_radix_choice) for storage efficiency?

  • areddyyt 4 months ago

    We don't achieve peak compression efficiency because more complex weight unpacking mechanisms kill throughput.

    To be more explicit, the weight matrix's values belong to the set of -1, 0, and 1. When using two bits to encode these weights, we are not effectively utilizing one possible state:

    10 => 1, 01 => 0, 00 =>-1, 11 => ?

    I think selecting the optimal radix economy will have more of a play on custom silicon, where we can implement silicon and instructions to rapidly decompress weights or work with the compressed weights directly.

nostrebored 4 months ago

What do you think about the tension between inference accuracy and the types of edge applications used today?

For instance, if you wanted to train a multimodal transformer to do inference on CCTV footage I think that this will have a big advantage over Jetson. And I think there are a lot of potentially novel use cases for a technology like that (eg. if I'm looking for a suspect wearing a red hoodie, I'm not training a new classifier to identify all possible candidates)

But for sectors like automotive and defense, is the accuracy loss from quantization tolerable? If you're investing so much money in putting together a model, even considering procuring custom hardware and software, is the loss in precision worth it?

  • areddyyt 4 months ago

    Great question. So a little bit of background about quantization (apologies if you are already familiar).

    There are two types of quantization (generally), post training quantization (PTQ) and quantization aware training (QAT).

    PTQ almost always suffers from some kind of accuracy degradation. This is because usually the loss is measured with respect to the FP16/BF16 parameters, and so the weights and distribution are selected to minimize the loss with those weights. Once the quantization function is applied, the weights and distribution change in some way (even if it's by a tiny amount), resulting in your model no longer being at minima.

    We do QAT to get around the problem of PTQ. We actually quantize the weights during the forward pass of training, and measure the loss with respect to the quantized weights. As a result, once we converge the model, we have converged the ternary weights as well, and the accuracy it achieved at the end of training is the accuracy of the quantized model. At ~3B parameters the accuracy on downstream task performance between FP16 and ternary weights is identical.

henning 4 months ago

I applaud the chutzpah of doing a company where you develop both hardware and software for the hardware. If you execute well, you could build yourself a moat that is very difficult for would-be competitors to breach.

sidcool 4 months ago

Congrats on launching. This is inspiring. .

transfire 4 months ago

Combine it with TOC, and then you’d really be off to the races!

https://intapi.sciendo.com/pdf/10.2478/ijanmc-2022-0036#:~:t...

  • areddyyt 4 months ago

    Funnily enough, our ML engineer, Eddy, did a hackathon project working with Procyon to make a neural network with a photonic chip. Unfortunately, I think Lightmatter beat us to the punch.

    Edit: I don't think the company exists in its current form anymore

Havoc 4 months ago

> This represents an almost 8x compression ratio for every weight matrix in the transformer model

Surely you’d need more ternary weights though to achieve same performance outcome?

A bit like a Q4 quant is smaller than a Q8 but also tangibly worse so the “compression” isn’t really like for like

Either way excited about more tenary progress.

  • areddyyt 4 months ago

    We do quantization-aware training, so the model should minimize the loss w.r.t. the ternary weights, hence no degradation in performance.

stephen_cagle 4 months ago

Is one expectation from moving from a 2^16 state parameter to a tristate one that the tristate one will only need to learn the number of states of the 2^16 states that were actually significant? I.E. we can prune the "extra" bits from the 2^16 that did not really affect the result?

mikewarot 4 months ago

Since you're flexible on the silicon side, perhaps consider designing things so that the ternary weights are loaded from an external configuration rom into a shift register chain, instead of fixed. This would allow updating the weights without having to go through the whole production chain again.

tejasvaidhya 4 months ago

There’s more to it. https://x.com/NolanoOrg/status/1813969329308021167

I will be archiving the full report with more results soon.

  • areddyyt 4 months ago

    I should note that our linear layers are not the same as Microsoft's, in fact, we think Microsoft made a mistake in the code they uploaded. When I have time later today, I'll link to where I think they made a mistake.

    I've been following TriLLM. They've achieved great results, and I'm really impressed with the llama.cpp contributors already getting the models integrated.

99112000 4 months ago

An area worth exploring are IP cameras imho

1. They are everywhere and aren't going anywhere.. 2. Network infrastructure to ingest and analyze thousands of cameras producing video footage is very demanding.. 3. Low power and low latency scream asic to me

  • areddyyt 4 months ago

    There was another founder that said this exact same thing. We'll definitely look into it especially as we train more ViTs.

bjornsing 4 months ago

Have you tried implementing your ternary transformers on AVX(-512)? I think it fits relatively well with the hardware philosophy, and being able to run inference without a GPU would be a big plus.

  • areddyyt 4 months ago

    Our CPU implementation for X86/AMD64 utilizes AVX-512 or AVX-2 instructions where possible. We're experimenting with support for ARM with NEON.

marmaduke 4 months ago

What kind of code did you try on the CPU for, say, ternary gemm? I imagine ternary values maps nicely to vectorized mask instructions, and much of tiling etc from usual gemm

dnnssl2 4 months ago

What is the upper bound on the level of improvement (high performance networking, memory and compute) you can achieve with ternary weights?

maratc 4 months ago

Is there a possibility where this can run on a specialized hardware which is neither a CPU nor GPU, e.g. NextSilicon Maverick chips?

lappa 4 months ago

Great project, looking forward to seeing more as this develops.

Also FYI, your mail server seems to be down.

  • areddyyt 4 months ago

    Thank you, and good catch.

    We recently acquired deepsilicon.com, and it looks like the forwarding hasn't been registered yet. abhi@deepsilicon.net should work.

ccamrobertson 4 months ago

Congrats, always cool to see YC founders working on silicon!

luke-stanley 4 months ago

The most popular interfaces (human, API and network) I can imagine are ChatGPT, OpenAI compatible HTTP API, Transformers HuggingFace API and models, Llama.cpp / Ollama / Llamafile, Pytorch. USB C, USB A, RJ45, HDMI/video(?) If you can run a frontier model or a comparable model with the ChatGPT clone like Open UI, with a USB or LAN interface, that can work on private data quickly, securely and competitively to a used 3090 it would be super badass. It should be easy to plug in and be used for running chat or API use or fine-tune or use with raw primitives via Pytorch or a very similar compatible API. I've thought about this a bit. There's more I could say but I've got to sleep soon... Good luck, it's an awesome opportunity.

  • areddyyt 4 months ago

    Have you sat in on my conversations with my cofounder?

    The end plan is to have a single chip and flush all weights onto the chip at initialization. Because we are a single line of code that is Torch compatible (hence HF compatible), every other part of the codebase shouldn't change.

    • luke-stanley 4 months ago

      I've not but that sounds cool! I would point out though, in terms of mind share, how memorable, and how relatable and useful the products are: it might help to have ways that directly show the application for the kinds of people buying GPUs for inference and training or using cloud for this that would love to not have to fight their ATX case in a hot sweaty corner while repeatedly dropping screwdrivers and calculating how much RAM they need to buy for the 405B while llama.cpp is recompiling again... I think people would throw money at that. I'd be happy to listen in or have a chat some time!

anirudhrahul 4 months ago

Can this run crysis?

  • 0xDA7A 4 months ago

    Can this run Doom?

    • campers 4 months ago

      Can it generate Doom at runtime?

Taniwha 4 months ago

Yeah I've been thinking about this problem for a while from the making gates level, I've been thinking that the problem essentially breaks down to a couple of pop counts and a subtract, it's eminently pipelineable

hy3na 4 months ago

ternary transformers have existed for a long time before you guys TerDit, vision ones etc. Competing in the edge inference space is likely going to require a lot of capex and opex + breaking into markets like defense thatre hard asf without connections and a strong team. neither of you guys are chip architects either and taping out silicon requires a lot of foresight to changing market demands. good luck, hopefully it works out.

felarof 4 months ago

Very interesting!

_zoltan_ 4 months ago

you might want to redo the video as it's cropped too much, and maybe it's only me but it's _really_ annoying to watch like this.

  • areddyyt 4 months ago

    Oops, good catch. Will re upload shortly.

  • dang 4 months ago

    Thanks! We've updated the youtube link at the top to the fixed version.