Launch HN: Langfuse (YC W23) – OSS Tracing and Workflows to Improve LLM Apps
github.comHey HN, we are Marc, Clemens, and Max – the founders of Langfuse. Langfuse leverages traces, evaluations, prompt management, and metrics to help developers debug and improve LLM applications. Here is a full walkthrough: https://www.youtube.com/watch?v=2E8iTvGo9Hs
With Langfuse, you can instrument your app and start ingesting traces, thereby tracking LLM calls and other relevant logic in your app such as retrieval, embedding, or agent actions. Langfuse then helps to analyze traces and use features such as evaluations or prompt management to make improvements to your app.
You can sign up to try Langfuse Cloud (https://cloud.langfuse.com/ – we have a generous free tier) or self-host Langfuse (https://langfuse.com/self-hosting) within a couple of minutes.
In the 15 months since our “Show HN” (https://news.ycombinator.com/item?id=37310070), thousands of teams adopted the project (including teams like KhanAcademy, Twilio, and Samsara) and we hit all of the scaling limits that we anticipated in the original Show HN thread. On our v1/v2 setup, we frequently exhausted IOPS on Postgres and had our Node.js container grind to a halt during tokenizations. Since then, we migrated our Cloud infrastructure from Vercel/Supabase to Porter and then to AWS & Clickhouse. Last week, we put the finishing touches on the Langfuse v3.0.0 release (https://github.com/langfuse/langfuse/releases/tag/v3.0.0) that unlocks major scalability improvements we have made over the past half year and are happy to share with the OSS ecosystem today.
Langfuse v3 addresses three challenges we encountered as an LLM observability platform: a) handling high ingestion throughput with large events (long strings, multimodal images/audio/video), b) providing fast analytical, table, and single-item reads across the product, and c) serving prompts quickly and reliably in the critical path of user’s applications. Langfuse is used by thousands of active self-hosting deployments, so at every point we needed to prioritize stability, fully automated migrations/upgrades, and use of infrastructure components that self-hosters can deploy freely on any cloud vendor.
The v3 release adds powerful infrastructure with a Clickhouse database next to Postgres, blob storage for events and introduces a worker as well as queues and caches (Redis) for data ingestion.
The Langfuse SDKs were originally written to send updates to a single trace to our backend. The backend then upserts tracing data in Postgres. Dealing with these updates to guarantee backwards compatibility with older SDK versions was a challenge. Our ingestion pipeline writes all events into S3 and sends a reference to the file via Redis to our worker container. From there, we read all events with the same id (including all previously ingested ones) and merge them into a final event. We insert the new row into ClickHouse which automatically replaces the existing data for the same ID. Re-merging all event updates enables us to keep a high-throughput pipeline by converting updates into new insert-only records.
We ran many iterations to optimize our sorting keys in ClickHouse, use skip indexes efficiently, and rewrote almost all of our queries and API endpoints to make optimal use of the schema. Using a specialized, analytical database required a more database-centric application design than a swiss-army-knife database like Postgres.
The new infrastructure delivers dramatic performance gains: dashboards now respond within 400ms (95th percentile) instead of timing out on large projects and lookback windows, and tables load up to 90% faster - displaying data within 800ms even for the largest projects.
Finally, to serve prompts from prompt management with low-latency and high availability, we use caches heavily and also decoupled our infrastructure. For sensitive paths, we use dedicated deployments to avoid “noisy neighbors” within the same server. We also improved client-side caching in our SDKs. This enhancement allows them to prefetch prompts and revalidate them in the background, resulting in zero latency when retrieving a prompt at runtime.
If you have any questions or feedback, please join us in this HN thread, or in future on our Discord and GitHub Discussions. While Langfuse v3 is scalable, we tried hard to make it easy to get started with Langfuse and self-host it in your own infrastructure (https://langfuse.com/self-hosting).
PS: Here (https://langfuse.com/blog/2024-12-langfuse-v3-infrastructure...) is a more in-depth blog post on how we built Langfuse V3.
PPS: if you find these problems exciting, we are hiring (https://langfuse.com/join-us) in Berlin!
(unsolicited review) we've been happy adopters of LangFuse at AINews (https://smol.ai/news). ive been tracking the llm ops landscape (https://www.latent.space/p/braintrust) for a while and its very nice to have an open source solution that is so comprehensive and intuitive!
reflections/thoughts on where this field goes next:
1. i wonder if there are new ops solutions for the realtime apis popping up
2. retries for instructor like structured outputs mess up the traces, i wonder if they can be tracked and collapsible
3. chatgpt canvas like "drafting" workflows are on the rise (https://www.latent.space/p/inference-fast-and-slow) and again its noisy to see in a chat flow
4. how often do people actually use the feedback tagging and then subsequently finetuning? i always feel guilty that i dont do it yet and wonder when and where i should.
appreciate your constructive feedback!
> i wonder if there are new ops solutions for the realtime apis popping up
This is something we have spent quite some time on already, both on designs internally and talking to teams using Langfuse with realtime applications. IMO the usage patterns are still developing and the data capturing/visualization needs across teams is not aligned. What matters: (1) capture streams, (2) for non-text provide timestamped transcript/labels, (3) capture the difference between user-time and api-level-time (e.g. when catching up on a stream after having categorized the input first).
We are excited to build support for this, if you or others have ideas or a wishlist, please add them to this thread: https://github.com/orgs/langfuse/discussions/4757
> retries for instructor like structured outputs mess up the traces, i wonder if they can be tracked and collapsible
Great feedback. Being able to retroactively downrank llm calls to be `debug` level in order to collapse/hide them by default would be interesting. Added thread for this here: https://github.com/orgs/langfuse/discussions/4758
> chatgpt canvas like "drafting" workflows are on the rise (https://www.latent.space/p/inference-fast-and-slow) and again its noisy to see in a chat flow
Can you share an example trace for this or open a thread on github? Would love to understand this in more detail as I have seen different trace-representations of it -- the best yet was a _git diff_ on a wrapper span for every iteration.
> how often do people actually use the feedback tagging and then subsequently finetuning? i always feel guilty that i dont do it yet and wonder when and where i should.
Have not seen finetuning based on user-feedback a lot as the feedback can be noisy and low in frequency (unless there is a very clear feedback loop built into the product). More common workflow that I have seen: identify new problems via user feedback -> review them manually -> create llm-as-a-judge or other automated evals for this problem -> select "good" examples for fine-tuning based on a mix of different evals that currently run on production data -> sanitize the dataset (e.g. remove PII).
Finetuning has been more popular for structured output, sql generation (clear feedback loop / retries at run-time if the output does not work). More teams fine-tune on all the output that has passed this initial run-time gate for model distillation without further quality controls on the training dataset. They usually then run evals on a test dataset in order to verify whether the fine-tuned hits their quality bar.
im too lazy to send traces for now haha. maybe in future when it REALLY bothers me.
good luck keep going.
Thread is filled with positive reviews.. Little odd
Felt the same way. While Langfuse could be great, it oddly looks like solicited “review”ish comments from existing Langfuse users. Just gotta be a careful.
> Make sure your friends don't post booster comments. That's not allowed on HN. Our readers have a nose for this, and will sniff them out and flame you. That will damage your reputation—and ours—and we may have to bury your thread.
https://news.ycombinator.com/yli.html
[flagged]
This is actually one of the more interesting LLM observability platforms I've seen. Beyond addressing scaling issues, where do you see yourself going next?
Positioning/roadmap differs between the different project in the space.
We summarized what we strongly believe in here: https://langfuse.com/why Tldr: open apis, self-hostable, LLM/cloud/model/framework-agnostic, API first, unopinionated building blocks for sophisticated teams, simple yet scalable instrumentation that is incrementally adoptable
Regarding roadmap, this is the near-term view: https://langfuse.com/roadmap
We work closely with the community, and the roadmap can change frequently based on feedback. GitHub Discussions is very active, so feel free to join the conversation if you want to suggest or contribute a feature: https://langfuse.com/ideas
What are other potential platforms?
This is a good long-list of projects, although it is not narrowly scoped to tracing/evals/prompt-management: https://github.com/tensorchord/Awesome-LLMOps?tab=readme-ov-...
One missing in the list below is Agenta (https://github.com/agenta-ai/agenta).
We're oss, otel compliant with stronger focus on evals and the enabling collaboration between subject matter experts and devs.
Bunch of them : Langsmith, Lunary, Phoenix Arize, Portkey, Datadog and Helicone.
We also picked Langfuse - more details here: https://www.nonbios.ai/post/the-nonbios-llm-observability-pi...
Thanks, this post was insightful. I laughed at the reason why you rejected Arize Phoenix, I had similar thoughts while going through their site!=)
> "Another notable feature of Langfuse is the use of a model as a judge ... this is not enabled in the free version/self-hosted version"
I think you can add LLM-as-judge to the self-hosted version of Langfuse by defining your own evaluation pipeline: https://langfuse.com/docs/scores/external-evaluation-pipelin...
Thanks for the pointer !
We are actually toying with building out a prompt evaluation platform and were considering extending langfuse. Maybe just use this instead.
"Langsmith appeared popular, but we had encountered challenges with Langchain from the same company, finding it overly complex for previous NonBioS tooling. We rewrote our systems to remove dependencies on Langchain and chose not to proceed with Langsmith as it seemed strongly coupled with Langchain."
I've never really used Langchain, but setup Langsmith with my own project quite quickly. It's very similar to setting up Langfuse, activated with a wrapper around the OpenAI library. (Though I haven't looked into the metadata and tracing yet.)
Functionally the two seem very similar. I'm looking at both and am having a hard time figuring out differences.
Thanks for sharing your blogpost. We had a similar journey. I installed and tried both Langfuse and Phoenix and ended up choosing Langfuse due to some versioning conflicts on the python dependency. I’m curious if your thoughts change after V3? I also liked that it only depended on Postgres but the scalable version requires other dependencies.
The thing I liked about Phoenix is that it uses OpenTelemetry. In the end we’re building our Agents SDK in a way that the observability platform can be swapped (https://github.com/zetaalphavector/platform/tree/master/agen...) and the abstraction is OpenTelemetry-inspired.
As you mentioned, this was a significant trade-off. We faced two choices:
(1) Stick with a single Docker container and Postgres. This option is simple to self-host, operate, and iterate on, but it suffers from poor performance at scale, especially for analytical queries that become crucial as the project grows. Additionally, as more features emerged, we needed a queue and benefited from caching and asynchronous processing, which required splitting into a second container and adding Redis. These features would have been blocked when going for this setup.
(2) Switch to a scalable setup with a robust infrastructure that enables us to develop features that interest the majority of our community. We have chosen this path and prioritized templates and Helm charts to simplify self-hosting. Please let us know if you have any questions or feedback as we transition to v3. We aim to make this process as easy as possible.
Regarding OTel, we are considering adding a collector to Langfuse as the OTel semantics are currently developing well. The needs of the Langfuse community are evolving rapidly, and starting with our own instrumentation has allowed us to move quickly while the semantic conventions were not developed. We are tracking this here and would greatly appreciate your feedback, upvotes, or any comments you have on this thread: https://github.com/orgs/langfuse/discussions/2509
So we are still on V2.7 - works pretty good for us. Havent tried V3 yet, and not looking to upgrade. I think the next big feature set we are looking for is a prompt evaluation system.
But we are coming around to the view that it is a big enough problem to have dedicated saas, rather than piggy back on observability saas. At NonBioS, we have very complex requirements - so we might just end up building it up from the ground up.
We launched Laminar couple of months ago, https://www.lmnr.ai. Extremely fast, great DX and written in Rust. Definitely worth a look.
Congrats on the Launch!
apologies for hijacking your launch (congrats btw!)
thanks Marc :)
I'm a maintainer of Opik, an open source LLM evaluation and observability platform. We only launched a few months ago, but we're growing rapidly: https://github.com/comet-ml/opik
Congrats Marc! We've been using Langfuse for about 6-months for our LLMOps tooling. While its SDKs are limited to python and typescript, their openapi specification is pretty easy to implement in any language.
The team behind it is amazing, and their product being OSS is one of the reasons we chose it. But it just keeps getting better.
We're incidentally only using part of the product because we've implemented most of these new features, prompt caching, execution etc in our app. But with the API you can decide what parts are core to your business logic and outsource the parts you don't want to deal with to Langfuse.
I appreciate that its not an opionated product.
Thanks for the feedback.
Being unopinionated and API-first has been a core design decision. We want to build the building blocks that everyone needs while acknowledging that most Langfuse users are very sophisticated teams that have a clear idea of what they want to achieve. Over time we will build more abstractions for common workflows to make it easier to get started but new features will always start API-first.
More on this here: https://langfuse.com/why
A happy Langfuse customer here!
We've been building an agent platform and some of our customers wanted someway to exfil OTEL traces to their own setup. Initially we tried building our own but then realised Languse does exactly what we needed doing. So we offered it as a first class integration, (and started using it internally).
Great product, and hope you guys continue to improve it!
Thanks! Really enjoyed working with you maintainers of other projects to help them offer more native LLM observability and evaluation to their users/communities. There is a lot that goes into making the observability/eval part scalable/useful and requirements change on a weekly basis with new advancements. Same applies to other projects and it makes a lot of sense to integrate.
Overview of community integrations: https://langfuse.com/docs/integrations/overview
Packages that depend on Langfuse: https://langfuse.com/faq/all/packages-depending-on-langfuse
Very timely post/update, was just checking out your product. IMO it is one of the best solutions I've looked at. Appreciate your dedication to self hosting, for us it's not really practical to have traces with potentially sensitive customer data sitting around on some external company's server somewhere (no offense).
Thank you for the kind words! Let us know if you have any questions or feedback regarding the self-hosting documentation and experience. We collaborate with many teams that have diverse security needs, including HIPAA, PCI, and on-premises deployments on bare metal without internet access.
You guys just saved me a lot of trouble. Amazing work everyone wow.
Seems like Langfuse is becoming the standard. Whenever I talk to other builders, they're using Langfuse.
Thank you! If these builders have some feedback to share, ask them to reach out to us :)
Been using it. Happy customer. It gave me sanity into otherwise very complex LLM infrastructure. We spend 60k+ every month on LLM calls, so having the backbone to debug when things go haywire has helped a lot.
I've been using self hosted langfuse via litellm in a juptyer notebook for a few weeks for some synthetic data experiments. It's been a nice/useful tool.
I've liked having the traces and scores in a unified browser based UI, it made sanity checking experiments way easier than doing the same thing inside the notebook.
The trace/generation retrieval API was brutally slow for bulk scanning operations, so I bypassed it and just queried the db directly. But the is the beauty of open source/self hosted code.
Thanks for the feedback, glad that you find Langfuse useful!
Can you create an issue with more details on the API performance problems? We monitor strict SLOs on the public API for Langfuse Cloud and are not aware of any ongoing issues, would love to learn more.
Awesome improvements. How does this compare to Braintrust? I've played with it a bit and we're gearing up to implement a solution in during the Christmas lull.
We use various LLMs as a core part of our app but I'm looking for ways to more quickly iterate on our prompts, test different LLM outputs against each other, etc. ideally while minimizing deploys. Would Langfuse serve that purpose?
Congratulations @Marc. Been using this product for 5ish months, love the iteration and how the team reacts to feedback. The prompt versioning has been immensely valuable!
Thanks AJ, feedback on GitHub/Discord (like yours) has been very helpful to evolve prompt management from a quick addition of the core platform to one of the most-used features -- for which we then actually needed to change a lot of infrastructure to make it reliable and fast (see blog post linked in the original post)
I promise this isn’t astroturfing ;)
I happened to have been triaging LLM observability, dataset, and eval solutions yesterday at the day job, and congratulations, Langfuse was the second solution that I tried, and simple enough to get set up locally with my existing stack for me to stop looking (ye olde time constraints, and I know good-enough when I see it!)
Thanks for your and your team’s work.
thank you, that is genuinely nice to hear and motivating for our team.
we're available if you ever run into any issues (github, email etc.)
Congrats on the launch :) happy users @ Samsara.
Key to our LLM customer feedback flywheel and dataset building.
Thank you! Working with your team has been great. I love seeing you ship LLM-powered features and appreciate the feedback you have shared along the way.
Been using Langfuse OSS for almost 15 months from the start. By far the best solution. No dark patterns found in other projects such as Portkey.
All core features are fully open-source and identical to those in Langfuse Cloud, with no limitations on capabilities or scalability (e.g. all v3 infrastructure changes).
We also offer some optional commercial add-on features that can help iterate faster or support very large teams using Langfuse. However, these features are entirely optional and we do our best to be transparent about this across our docs.
Have been a very happy Langfuse user since March - dead simple to use and has helped us a lot with LLM observability and debugging - great work guys :))
thank you! if you have any ideas for improvements after having used Langfuse for a while, please contribute them via github discussions: https://langfuse.com/ideas
In this example:
Why do I need to import openai from langfuse?This is an optional instrumentation of the OpenAI SDK which simplifies getting started, tracking token counts, model parameters and streaming latencies.
Langfuse is not in the critical path, this just helps with instrumentation.
You can use the Langfuse Python SDK / Decorator to track any LLM (with some instrumentation code) or use one of the framework integrations.
Here is a fully-featured example using the Amazon Bedrock SDK: https://langfuse.com/docs/integrations/amazon-bedrock
Nice work but, Sorry but I don’t feel comfortable either proxying my llm calls through a 3rd party unless the 3rd party is a llm gateway like litellm or arch or storing my prompts in a SaaS. For tracing, I use OTEL libraries which is more than sufficient for my use case.
If you use an OSS Gateway already, some (e.g. LiteLLM) can natively forward logs to Langfuse: https://docs.litellm.ai/docs/proxy/logging#langfuse
We are looking into adding an Otel Collector as OTel-semantics are maturing around LLMs. For now many features that are key to LLMOPs are difficult to make work with OTel instrumentation as the space is moving quickly. Main thread on this is here: https://github.com/orgs/langfuse/discussions/2509
Looks awesome! Been using for over a year now and it's a great product :) The new improvements seems exciting.
Are the Traces OpenTelemetry Compatible?
The Langfuse data model is closely inspired by OpenTelemetry and we plan to add a Collector soonish. Until now, the OTel-semantics for LLMs have not been very stable and exhaustive while LLM capabilities are changing frequently (think prompt caching, realtime, multi-modal).
We are tracking this here including potential tradeoffs: https://github.com/orgs/langfuse/discussions/2509
great to see how you guys worked with the community on discord over the last year to build Langfuse
Thanks! IMO, Discord is good, but GitHub Discussions is the better option for building a growing open-source community. It is indexed and makes it easier to revisit conversations weeks later. Currently we use both but have a strong preference for GitHub Discussions.
Great work! Easy to integrate :)
great product & great team, kudos & congrats! :)
Great work, guys!
great product, so easy to use. love it.
[flagged]
It would be good if you did not spam your website in every single one of your comments.
> YC > OSS
Nice try