Show HN: We made glhf.chat – run almost any open-source LLM, including 405B

glhf.chat

161 points by reissbaker a month ago

Try it out! https://glhf.chat/

Hey HN!

We’ve been working for the past few months on a website to let you easily run (almost) any open-source LLM on autoscaling GPU clusters. It’s free for now while we figure out how to price it, but we expect to be cheaper than most GPU offerings since we can run the models multi-tenant.

Unlike Together AI, Fireworks, etc, we’ll run any model that the open-source vLLM project supports: we don’t have a hardcoded list. If you want a specific model or finetune, you don’t have to ask us for it: you can just paste the Hugging Face link in and it’ll work (as long as vLLM supports the base model architecture, we’ll run anything up to ~640GB of VRAM, give or take a little for some overhead buffer).

Large models will take a few minutes to boot, but if a bunch of people are trying to use the same model, it might already be loaded and not need boot time at all. The Llama-3-70b finetunes are especially nice, since they’re basically souped-up versions of the 8b finetunes a lot of people like to run locally but don’t have the VRAM for. We’re expecting the Llama-3.1 finetunes to be pretty great too once they start getting released.

There are some caveats for now — for example, while we support the Deepseek V2 architecture, we actually can only run their smaller “Lite” models due to some underlying NVLink limitations (though we’re working on it). But for the most part if vLLM supports it, we should too!

We figured Llama-3.1-405B Launch Day was a good day to launch ourselves too — let us know in the comments if there’s anything you want us to support, or if you run into any issues. I know it’s not “local” Llama, but, well, that’s a lot of GPUs…

vlugorilla a month ago

Please, accept cryptocurrency payments! I've found very few LLM providers that can be paid with crypto, and most of them are of bad quality.

  • azinman2 a month ago

    I’m curious your motivation. To me this is the easiest way for scams/bad actors to generate a lot of not-good-for-society content.

    • codetrotter a month ago

      And the US Dollar in cash is the easiest way for drug dealers to get payment for the drugs they are pushing on the street. Yet, the answer to that is not to demand that society goes cash-less. Money will always be used by people you disagree with. And you should not try to use the money itself as a way of limiting what people do. For if you do, one day it could be you that finds himself unable to spend his money the way he wants to because someone else, more powerful than you, disagrees with how you spend that money.

      Using money that is yours should never be illegal. Prosecute the bad people for the bad thing they did in the first place to get the money / the bad thing that someone paid them money for.

      • newaccount74 a month ago

        > Prosecute the bad people for the bad thing they did in the first place

        How do you expect to do that, without following the money?

        If someone shows up with a suitcase full of money, it's extremely unlikely that they earned this money in a legal way and paid taxes.

        Same thing with crypto. Theoretically it's possible that people use it for legal means, in practice 99% of people do not.

        • throw101010 a month ago

          It's much easier to follow someone using a public ledger than having to beg/subpoena banks for records with the broken/outdated AML/KYC/KYB system, with bank secrecy in many places, with banks participating in illicit activities too sometimes, etc.

          > Theoretically it's possible that people use it for legal means, in practice 99% of people do not.

          This is a completely made up statistic and it is showing your bias and/or ignorance about the topic. The lower estimates from industry sources like Chainalysis (0.15-0.62%) contrast sharply with higher academic estimates (23-46%) because the latter tend to include illicit activities that happen off chain but get "washed" on-chain which explains their own huge range of estimation... it is hard to quantify but nobody serious ever came close to 99%.

          Something for which there are estimates close to 90% is the volume of transactions happening on centralized exchanges, and since these are required pretty much everywhere to follow AML procedures, just like traditional banks (sometimes even more intrusive than banks), it means it is just as easy, if not easier, to prosecute criminals who would use these... with the added bonus of having a public ledger with records of their activity on-chain.

        • zorrn a month ago

          It's either a bad person or MrBeast.

      • 9dev a month ago

        > cash is the easiest way for drug dealers to get payment for the drugs they are pushing on the street. Yet, the answer to that is not to demand that society goes cash-less. That is exactly what many are proposing, or at least a core argument of many anti-cash proponents.

        Crypto is not money. Money is coupled to a value; crypto is as much a currency as paintings are: completely arbitrary.

        • ImHereToVote a month ago

          What value is money coupled to?

          • 9dev a month ago

            Effectively, human work. Money is an abstraction over the relative value we assign physical goods; instead of bartering, we collectively agree on using money as an intermediate form. The financial system, then, is a lot more abstraction over the value of producers of things—and thereby human work.

            Of course it's way more complex than that, but that's the basic difference between real-world money and crypto currencies.

            • ImHereToVote a month ago

              I'm not a crypto bro by any means, but I think crypto satisfies those criteria.

              If you want to know the actual answer however. The answer is debt. Or more specifically assets and liabilities on various bank ledgers. Those banks organically set the value to monetary systems. This system would fall apart without a forced taxation system for a given currency.

              The actual value is debt and taxation. I'm not complaining, who am I to criticize? But that is the answer.

              One could conceive of such a crypto ledger system, and that would be a CBDC.

              • 9dev a month ago

                Doesn't that come down to the same thing effectively? Debt and taxation are details of a system of trusted authorities built around the economic system of human work.

                And if we consider a CBDC to still count as cryptocurrency, in my opinion we've just shifted the goal posts from a decentralised, anonymous, revolutionary, grass-roots form of digital currency to the technical underpinnings of the digitisation process of the global banking system.

                • ImHereToVote a month ago

                  The current centralization is what keeps it "stable". This stability makes the currency lose it's value in a very gradual way. Only housing, property, education, and healthcare are gaining value at a outlier rate.

                  Decentralization will incentivize hoarding. Why spend crypto when it might go to the moon tomorrow. Maybe that is a feature, maybe not.

                  I don't judge.

            • 0xedd a month ago

              Wrong. For all intent and purposes, you could switch any cryptocurrency with whatever FIAT in your sentence and it still applies. I don't hunt money and wear it as a necklace (lion teeth). I don't mine it out of a gold mine. It is given to me in exchange for work.

              Given. Exchange.

              If you and I agree that we only accept little tin circles; What's the difference? When you say money, what are you referring to? Just the US bank note? Some brass coins? Any FIAT?

              Today's FIAT have nothing to do with what currency, basically just a fancy IOU note, used to be. Currency WAS representative of human work. Directly. The oldest known form are lion teeth. You had to WORK to get them. It was direct proof of work. And it was important to keep that link so no arbitrary value is taken or added to it.

              The next major leap of currency was IOU notes. Notes that banks respected between each other. One bank gives you some paper that says "trust me, bro, I'm worth 2 lion teeth". You go to another bank and can exchange that paper back to 2 lion teeth.

              At one point in time, the US bamboozled the entire world and declared the US dollar as some form of new gold bar (what it was previously tied to, gold; proof of work). Tying all of FIAT to it. A value that can more easily be arbitrarily changed. And, since then, propaganda reigns supreme and people like you are born. Praising a nonsensical paper as some kind immovable artefact of mankind.

              Language evolved. Law evolved. Currency didn't evolve. And, for some reason, you are fighting its evolution.

              Don't vote.

              • 9dev a month ago

                This is simply not correct. A fiat currency is ultimately bound to the performance of the global economy, so it sure ties into gold mines or the lion teeth gathered by hunters, I've you're so inclined to keep it savage.

                The dollar is fundamentally a promise of the US government that they owe you a given amount of money. As long as the US government exists, a dollar is going to have a worth. You may debate the virtue of the US government all you want, but at the same time, there's nobody to give you any such guarantee for a Bitcoin, or an NFT. The moment the market settles for a new plaything, these binary numbers you praise will be entirely worthless. So when I say money, I'm referring to the trustworthiness of its issuer.

                The only thing Crypto has brought us is a gambling system rich people use to get richer, and poor people use to loose money; lots of brainlessly burned electricity; a neat way to collect ransomware payouts; and a bunch of Ponzi schemes.

                People like you like to feel smarter than anybody else, and I know you'll stay committed to cryptocurrency regardless of any reason. So I will drop out of this discussion here, as it's fruitless.

          • BoorishBears a month ago

            Government backing, our shared delusion, unicorn dust, take your pick: whatever it is, it's something stable enough that it doesn't experience quintuple digit deflation in a decade.

            • exe34 a month ago

              yes, runaway inflation is much better!

              • BoorishBears a month ago

                Your tone implies you're unironically trying to compare 3% inflation to 20,000% deflation. That'd be embarrassing.

                • logicchains a month ago

                  And you're comparing annualised inflation to the deflation over a decade, which is equally embarrassing. Regardless, all other things being equal anyone who has their own financial interests at heart would rather hold a currency that experienced 20,000% deflation since its creation than one that lost 99% of its value to inflation since its creation. Because the best predictor of future performance is past performance.

                  • BoorishBears a month ago

                    Did you just try to browbeat me for not breaking down the deflation over a time period... then compare 10 years of BTC movement to some arbitrary period of USD movement?

                    Then top it off by claiming someone would want a currency that deflates 20,000%?

                • exe34 a month ago

                  no, no, I love 10% inflation, it's really good for employers, they can charge customers more because of "inflation", and they don't have to pay me any more, because their costs have gone up so much because of... "inflation".

                • exe34 a month ago

                  there are plenty of places where they chop off a couple of zeros off their currency every few years.

      • mathnmusic a month ago

        > you should not try to use the money itself as a way of limiting what people do

        Are you saying that Anti money laundering rules should be scrapped?

        • herbst a month ago

          If they don't work, are fully ineffective and generally only catch small fishes I guess we could scrap them.

    • BoorishBears a month ago

      You're being downvoted but as someone running an AI generation site, I can't even imagine the kind of bottom of the barrel filth you'd attract not even having the tiny deterrent of KYC.

      You wouldn't end up being "just another generation site with crypto", you'd become a magnet for everyone who's generating things they're scared of having their legal identity tied to.

      • nunobrito a month ago

        Please stop with FUD.

        No, I don't want to upload my passport and personal information to some site.

        No, I don't want to use Paypal or credit cards

        No, I'm not a criminal. I value my privacy, as many others do.

        • BoorishBears a month ago

          FUD implies I'm not certain of the outcome.

          I had to stop supporting logged out users because 99% of the CSAM generated on the site came from people trying to skirt the basic login requirements.

          Sorry but your privacy doesn't trump my not wanting to run a child porn site, and the overwhelming majority of the public manages to live through having to login in and pay for things with traceable money.

          And before the conversation goes there: is it a perfect filter? No. But a 99% reduction (realistically near 100% reduction) in unwanted behavior is what I'd call very effective.

          • nunobrito a month ago

            Before we can continue this conversation, please submit your passport pictures to the hackernews team.

            Because otherwise hackernews will be filled with spam. We don't want hackernews to be running a child porn site and this will reduce 99% of that risk.

            Giving away your personal privacy isn't the only option and you know that.

        • ShayNehmad a month ago

          The moment you say "Please stop with FUD" I instantly assume that the reason you're actually worried about FUD is because it's directly linked to the value of your tokens.

  • saurik a month ago

    genai.orchid.com (Note: I am not just affiliated with this company, but am "in charge of technology" for it... however, I haven't actually used this demo we are building, except to understand how to better generalize it into our overall platform; I thereby cannot vouch for the "quality", though I will happily vouch for it not stealing your money or anything.)

imiric a month ago

Congrats on the launch!

Your service looks interesting, but I think you need to be more transparent about your infrastructure. To which "inference providers" do you proxy to, and when? Who is hosting the GPU clusters?

Also, a privacy policy and ToS document are pretty important, even at this stage.

  • reissbaker a month ago

    Good callout! Currently:

    1. If Together.ai has the model, we proxy to them, since they're faster than us. We might switch to Fireworks for the Llama-3.1 models since they offer them at lower cost; in general that's why I didn't specify the inference providers, since we'll probably change and optimize that quite a bit. (Also would be interesting to try using Groq, since they're so fast.)

    2. If the model isn't hosted somewhere else, e.g. a lot of the Llama 3 finetunes, we run the models on our own GPU clusters hosted in Fly.io. This will probably change in the future as well, since some models would really benefit from NVLink (which Fly doesn't support currently).

tspng a month ago

Congratulations on the launch!

Unfortunately, I got an error signing up: "You have reached your limit of 500 users. If you need more users, please use a Production instance."

  • ssl-3 a month ago

    More worryingly: It presents this message after it has collected my personal login credentials.

  • billycao a month ago

    This is now fixed, and signups should be working again! Had a misconfiguration with our auth provider, even though we were on the paid plan.

    Thanks for testing! :)

    - Billy

  • reissbaker a month ago

    Whoops, sorry, we were indeed using the Clerk free tier (which claimed that we had a cap of 10k users, but apparently not?). Fixing that now.

    • tspng a month ago

      AFAIK 500 is the hard limit for dev instances, 10k is the limit of the free tier for production environments.

      Btw, there is a stray semicolon on the signup page after the cardbox ;)

      • reissbaker a month ago

        Yes, we have sadly learned this the hard way... Also we can't actually migrate users automatically from dev to prod. T_T

        Signups are re-enabled, and I'm writing a script now to re-associate everyone's old chat history. Sorry for the launch day struggles!

  • jdthedisciple a month ago

    Same here, I guess they are using a free trial for the identity provider which is limited to 500 users.

langcss a month ago

A quick play and the UX is great. If you are allowing anyone to use 405B by API then you beat Groq to that :) they have limited the access.

Probably need to know more about you. Main concerns are "can this be used for production" and "will you be around in a year" and "will you likely have 429s if you get too popular?" and of course likely pricing.

  • billycao a month ago

    Thanks for the feedback!

    We're working on fleshing out ToS, privacy policy, and company specifics, but just to answer your first question, I'm Billy Cao, an ex-Google eng, and Matt Baker is ex-Airbnb, ex-Meta.

    Re: concerns, our infra will scale relatively well (several qps per model, probably), but we're still in the stages of fleshing things out and getting feedback. :)

    Feel free to drop us a line at hi@glhf.chat if you wanted to chat specifics!

    - Billy

    • langcss a month ago

      Thanks Billy!

      Nothing specific... just wanted to get the feel. Out of interest are you bare metal or using AWS or something? Or both?

      • billycao a month ago

        Appreciate the feedback! We currently use fly.io as our cloud GPU provider, but we're actively investigating other providers due to various limitations (like NVLink support).

nunobrito a month ago

Registration limit: "You have reached your limit of 500 users. If you need more users, please use a Production instance."

  • billycao a month ago

    Whoops! We had a hiccup with our auth provider and have just pushed the fix.

    Sign up should be working again! Thanks for testing! :)

    - Billy

jdthedisciple a month ago

Probably a really dumb question and I missed it somehow, but who is paying for this if it's free?

  • reissbaker a month ago

    Currently my credit card is paying for it :P

    We weren't really sure how to price it, so we're using the beta period for now to figure out what mix of models people are using and trying to figure out reasonable pricing based on that, and also ironing out various bugs and sharp edges. Then we'll start charging for it; personally I'd prefer to have it be usage-based pricing rather than the monthly subscriptions that ChatGPT and Claude use, so that you can treat it more like API access for those companies and don't have to worry about message caps.

    • solarkraft a month ago

      I’m bit curious: How much are you investing and how set are you on making it profitable (vs. just a cool playground)?

      It currently reads to me like you’re just having fun with it at the moment.

      • reissbaker a month ago

        It's relatively cheap to run at the moment, so I haven't had to invest too much! We're serious about making it work, though, which includes getting it to be profitable.

Tepix a month ago

Looks promising! However, no privacy policy is a total show stopper.

  • billycao a month ago

    2nd dev Billy here.

    Great point. Right now we don't log or store any chat messages for the API (only what models people are choosing to run). We do store messages for the web UI chat history and only share it with inference providers (currently together.ai) per request for popular models, but I know some hand-waved details from an HN comment doesn't suffice.

    We'll get on that ASAP. :)

tcsenpai a month ago

That's really amazing. How can you afford to run it for free? Tbh I'd pay more willingly this than perplexity pro

  • billycao a month ago

    Thanks a bunch! For launch/beta, we're paying for the GPU time ourselves. :) Quoting Matt below:

    "We weren't really sure how to price it, so we're using the beta period for now to figure out what mix of models people are using and trying to figure out reasonable pricing based on that, and also ironing out various bugs and sharp edges. Then we'll start charging for it; personally I'd prefer to have it be usage-based pricing rather than the monthly subscriptions that ChatGPT and Claude use, so that you can treat it more like API access for those companies and don't have to worry about message caps."

    Open to feedback here! :)

    - Billy

  • tcsenpai a month ago

    Thanks for the info! I think implementing both an usage based version for example on the API access and a flat fee for let's say the webapp can be a way to go, to increase flexibility, but is a personal opinion. I like Perplexity model which is like that, I can use the webapp for daily quick searches and use the API for more programmatical access with a more granular price control. Of course then is up to you, anyway I find this to be a great way to do it.

sweca a month ago

If you guys can beat Together AI pricing on LlaMa 3.1 405B, you will be my favourite AI provider

  • campers a month ago

    I noticed Fireworks.ai is only $3/million compare to $5 for together.ai for Llama3 405b

    And in Google Vertex its free at the moment!

    • sweca a month ago

      Thanks for the recommendation! Fireworks pricing looks amazing still. $3 for input AND output, vs $3 and $15 for the comparable Claude 3.5 Sonnet from Anthropic.

SuperHeavy256 a month ago

I tried loading Mistal's Mamba Codestral. Didn't work. Got the error: "Application error: a client-side exception has occurred (see the browser console for more information)."

renewiltord a month ago

It worked very well. Llama-3.1-405B is really impressive! But then the whole thing stopped working, as more people started using I suppose. Can't wait for the paid offering.

codetrotter a month ago

I was chatting with the 405B model and got this after a few messages:

Application error: a server-side exception has occurred (see the server logs for more information).

Digest: 1138364229

android521 a month ago

Do you have api access ? Most people needs to build their own apps with api and will choose you if you have cheaper hosting/api price

  • billycao a month ago

    Hey there! I'm Billy, the other dev working on glhf.chat :)

    We do have API support! We expose an OpenAI compatible API. You can see details when logged in at https://glhf.chat/users/settings/api

    Just like our web UI it supports feeding in any huggingface user/repo.

    (Also available via the user menu)

    Let us know if you have any questions/feedback!

GTP a month ago

I just tried opening the link, and got a 500 error. Too many connections from hn?

  • billycao a month ago

    Might have been a transient error as we were deploying some auth provider fixes. Should be good to go now! :) - Billy

akadeb a month ago

Could you clarify what running models multi-tenant means?

  • billycao a month ago

    Borrowing Matt's words from our reddit thread:

    It means that we can spin up a single model server and use it for multiple people, effectively splitting the cost. Whereas if you try to rent the GPUs yourself on something like Runpod, you'll end up paying much more since you're the only person using the model.

    - Billy

  • swalsh a month ago

    vLLM is nice because you can have multiple streams. I believe the tech is called batched attention, but working from memory so I might be wrong. When scaling up inference, this feature is super key. This is the difference between 70 tokens a second and 500 tokens a second using the same compute.

    My assumption is they're calling this ability multitenant.

jcuenod a month ago

Seems like setting max_tokens crashes your endpoint.

tcsenpai a month ago

Giving 500 error here, is it me?

  • tcsenpai a month ago

    Nvm it self healed

lxe a month ago

Why AWQ instead of exl2?

  • reissbaker a month ago

    Unfortunately vLLM (the open source inference engine we're built on) doesn't support exl2. That being said I want to add additional backends so we can support exl2 as well!

tardismechanic a month ago

In case you were wondering, it supposedly stands for Good Luck Have Fun.

It's in the title which is not immediately visible on mobile.

  • imafish a month ago

    Any gamer would know this by heart ;)

codetrotter a month ago

Aaaand it's gone :(

I only get 503 Error response now when I try to load the page.

kcb a month ago

1) Give access to expensive hardware for free

2) ...

3) Profit

ec109685 a month ago

Pretty bait and switchy to let the person sign up and then demand an invite code.

  • reissbaker a month ago

    Sorry, the invite code was at the top of the post! (It was "405B" without the quotes.) I just disabled the invite code system entirely, since you weren't the only one who missed it; anyone should be able to sign up now. Definitely didn't mean to bait and switch, we want you to use it!

  • brianjking a month ago

    In fairness, the invite code is at the top of this post.

    What concerns me more is the lack of any privacy policy or anything about training on inputs/outputs and much more.

    The lack of a system prompt seems weird too.

    How does this compare to OpenRouter.ai?

    https://openrouter.ai/models/meta-llama/llama-3.1-405b-instr...

    • reissbaker a month ago

      If you want a custom system prompt, try our API! (Click on your profile picture in the corner, then "API settings").

      Yeah we should add a privacy policy / ToS: good point.

      • latchkey a month ago

        and a contact us link.

        (let's talk about access to GPUs with a lot more vram and no nvlink issues... mi300x, email in my profile)

        • reissbaker a month ago

          There is a contact us page! Linked from the profile dropdown. I'll reach out to you though.

          • latchkey a month ago

            Ah. I haven't logged in. =)

shmatt a month ago

attempt 1 with llama 3.1 405b

what version llama are you

I am an instance of Llama 2, an artificial intelligence model developed by Meta AI.

how many parameters

I am a Llama 2 model with 7 billion parameters.

Attempt 2

which llama are you exactly

I am a variant of the LLaMA model, specifically LLaMA 1.5B.

Here's a brief overview of the different LLaMA models:

LLaMA 7B: This is the largest and most powerful LLaMA model, with 7 billion parameters. It's the most advanced and capable model in the LLaMA family.

LLaMA 3B: This model has 3 billion parameters and is a smaller but still very capable version of the LLaMA model.

LLaMA 1.5B: This is the model I'm based on, with 1.5 billion parameters. It's a smaller and more efficient version of the LLaMA model, but still very knowledgeable and capable.

LLaMA 650M: This is the smallest LLaMA model, with 650 million parameters. It's a more lightweight and efficient version of the model, but still useful for many tasks.

is 3.1 this bad at hallucinations? or are you just serving something else

This just keeps on giving

when does your training data end

" My training data is a massive corpus of text that was available to me up to a certain point in time. My training data is cut off at December 2021."

  • Onawa a month ago

    People continually asking the models what their abilities and knowledge are demonstrates a disconnect in the capabilities of the raw models and what people are expecting them to be capable of. Many of the answers that models give in regards to their capabilities are actually from the system prompts provided from the chat interface providers.

    • Zambyte a month ago

      I asked mistral-nemo:12b-instruct-2407-q8_0

      > Who are you?

      In my own machine without a system prompt and it said:

      > I am a language model developed through a collaboration between Mistral AI and NVIDIA.

      Often times this information will be included in the instruct fine tuning.

      The same from qwen2:7b-instruct-q8_0:

      > I am a large language model created by Alibaba Cloud. I'm called Qwen.

      All I did was change the model and re-run the prompt in Open WebUI, so if there was a system prompt, it would have been used for both.

  • mattnewton a month ago

    You need a system prompt. The raw model isn't going to see much data about itself yet in the training corpus - it hasn't been released yet when the corpus was assembled!

    It'd be like asking a child who was never spoken to before their name. How would the child know until someone told it?

  • reissbaker a month ago

    Hmm, yeah models typically don't know much about themselves and hallucinate when asked. We can probably fix this with a system prompt; it's the real 3.1-405B though. Try asking it for some code and compare it with llama-3-8b in comparison!

    • reaperman a month ago

      What quantization is used for it? I see you can run models that use up to approximately 640GB but the full non-quantized 405B uses ~810GB of VRAM (native weights for that model are fp16 before quantization).

      • reissbaker a month ago

        * Models we run on our own infra (the ones with boot times) always run at whatever quantization the repo uses — typically unquantized, although if you paste in a link to a 4-bit AWQ repo of course that's what we'll run.

        * For always-on proxied models, we're dependent on the inference provider. Typically we use Together AI, but for Llama 3.1 we found some show-stopping context length related bugs that caused us to switch to Fireworks.

        Since we proxy 405B to Fireworks, that one runs at FP8 (that's what Fireworks uses). This may change in the future and in general we prefer to match the underlying repo quantization when possible; proxying is a bit of a hack in order to keep more models warmed up and reduce boot times (and to run 405B for now, although we're working on getting larger clusters and supporting distributed multi-node inference). As we grow, hopefully we'll be able to reduce our dependency on third party proxies.

        I'm also curious to try running models on AMD, since that would allow for much larger single-node models — for example, an MI300X has 192GB per card, so 8x of those would run 405B at fp16. Obviously there are some support issues with AMD so TBD how well it'll work but it's an interesting option.