Launch HN: Flower (YC W23) – Train AI models on distributed or sensitive data

180 points by niclane7 a year ago

Hey HN - we're Daniel, Taner, and Nic, and we're building Flower (https://flower.dev/), an open-source framework for training AI on distributed data. We move the model to the data instead of moving the data to the model. This enables regulatory compliance (e.g. HIPAA) and ML use cases that are otherwise impossible. Our GitHub is at https://github.com/adap/flower, and we have a tutorial here: https://flower.dev/docs/tutorial/Flower-0-What-is-FL.html.

Flower lets you train ML models on data that is distributed across many user devices or “silos” (separate data sources) without having to move the data. This approach is called federated learning.

A silo can be anything from a single user device to the data of an entire organization. For example, your smartphone keyboard suggestions and auto-corrections can be driven by a personalized ML model learned from your own private keyboard data, as well as data from other smartphone users, without the data being transferred from anyone’s device.

Most of the famous AI breakthroughs—from ChatGPT and Google Translate to DALL·E and Stable Diffusion—were trained with public data from the web. When the data is all public, you can collect it in a central place for training. This “move the data to the computation” approach fails when the data is sensitive or distributed across organizational silos and user devices.

Many important use cases are affected by this limitation:

* Generative AI: Many scenarios require sensitive data that users or organizations are reluctant to upload to the cloud. For example, users might want to put themselves and friends into AI-generated images, but they don't want to upload and share all their photos.

* Healthcare: We could potentially train cancer detection models better than any doctor, but no single organization has enough data.

* Finance: Preventing financial fraud is hard because individual banks are subject to data regulations, and in isolation, they don't have enough fraud cases to train good models.

* Automotive: Autonomous driving would be awesome, but individual car makers struggle to gather the data to cover the long tail of possible edge cases.

* Personal computing: Users don't want certain kinds of data to be stored in the cloud, hence the recent success of privacy-enhancing alternatives like the Signal messenger or the Brave browser. Federated methods open the door to using sensitive data from personal devices while maintaining user privacy.

* Foundation models: These get better with more data, and more diverse data, to train them on. But again, most data is sensitive and thus can't be incorporated, even though these models continue to grow bigger and need more information.

Each of us has worked on ML projects in various settings, (e.g., corporate environments, open-source projects, research labs). We’ve worked on AI use cases for companies like Samsung, Microsoft, Porsche, and Mercedes-Benz. One of our biggest challenges was getting the data to train AI while being compliant with regulations or company policies. Sometimes this was due to legal or organizational restrictions; other times, it was difficulties in physically moving large quantities of data or natural concerns over user privacy. We realized issues of this kind were making it too difficult for many ML projects to get off the ground, especially in domains like healthcare and finance.

Federated learning offers an alternative — it doesn't require moving data in order to train models on it, and so has the potential to overcome many barriers for ML projects.

In early 2020, we began developing the open-source Flower framework to simplify federated learning and make it user-friendly. Last year, we experienced a surge in Flower's adoption among industry users, which led us to apply to YC. In the past, we funded our work through consulting projects, but looking ahead, we’re going to offer a managed version for enterprises and charge per deployment or federation. At the same time, we’ll continue to run Flower as an open-source project that everyone can continue to use and contribute to.

Federated learning can train AI models on distributed and sensitive data by moving the training to the data. The learning process collects whatever it can, and the data stays where it is. Because the data never moves, we can train AI on sensitive data spread across organizational silos or user devices to improve models with data that could never be leveraged until now.

Here’s how it works: (0) Initialize the global model parameters on the server; (1) Send the model parameters to a number of organizations/devices (client nodes); (2) Train model locally on the data of each organization/device (client node); (3) Return the updated model parameters back to the server; (4) On the server, aggregate the model updates (e.g., by averaging them) into a new global model; (5): Repeat steps 1 to 4 until the model converges.

This, of course, is more challenging than centralized learning: we must move AI models to data silos or user devices, train locally, send updated models back, aggregate them, and repeat. Flower provides the open-source infrastructure to easily do this, as well as supporting other privacy-enhancing technologies (PETs). It is compatible with PyTorch, TensorFlow, JAX, Hugging Face, Fastai, Weights & Biases and all the other tools used in ML projects regularly. The only dependency on the server side is NumPy, but even that can be dropped if necessary. Flower uses gRPC under the hood, so a basic client can easily be auto-generated, even for most languages that are not supported today.

Flower is open-source (Apache 2.0 license) and can be run in all kinds of environments: on a personal workstation for development and simulation, on Google Colab, on a compute cluster for large-scale simulations or on a cluster of Raspberry Pi’s (or similar devices) to build research systems, or deployed on public cloud instances (AWS, Azure, GCP, others) or private on-prem hardware. We are happy to help users when deploying Flower systems and will soon make this even easier through our managed cloud service.

You can find PyTorch example code here: https://flower.dev#examples, and more at https://github.com/adap/flower/tree/main/examples.

We believe that AI technology must evolve to be more collaborative, open and distributed than it is today (https://flower.dev/blog/2023-03-08-flower-labs/). We’re eager to hear your feedback, experiences regarding difficulties in training, data access, data regulation, privacy and anything else related to federated (or related) learning methods!

guites a year ago

Hey! Glad to see flower getting attention on hn.

I've been working on a project for over a year that uses flower to train cv models on medical data.

One aspect that we see being brought up again and again is how we can prove to our clients that no unnecessary data is being shared over the network.

Do you have any tips on solving that particular problem? I.e. proving that no data apart from model weights are being transferred to the centralized server?

Thanks a lot for the project.

edit: Just to clarify I am aware of differential privacy, I'm talking more on a "how to convince a medical institution that we are not sending its images over the network" level.

  • cpmpcpmp a year ago

    If you're concerned about data leakage, it's worth noting that model weights can very easily be used to reconstruct the original data that it was trained on: so it could be misleading to claim that user data isn't being shared over the network. To avoid this, you'd need to look into techniques like Secure Aggregation or local differential privacy. Flower does provide some of this, FWIW.

    • onethought a year ago

      This doesn’t sound right, if they don’t know the structure of the NN how can the reconstruct from the weights alone? (Perhaps the structure is communicated within the weights?)

      • aix1 a year ago

        Every agent training the model on their proprietary data has to have access to the model form in some way (otherwise how would they train it?)

        For this reason, one must assume that the model form is known to the adversary.

        With this, the question becomes: is it possible to reconstruct training data from a trained model? We already know that, at least for some image models, the answer to that question is "yes": https://arxiv.org/pdf/2301.13188.pdf

        • onethought a year ago

          That must only be true if there isn’t a one way compression step occurring, or any approximation in the whole model.

          • aix1 a year ago

            I don't think lossy compression is sufficient. The very first example in the paper I linked to is clearly not identical to the original image (=lossily compressed) yet leaks a training image in a way that would be highly problematic in certain domains, e.g. medical imaging.

            • onethought a year ago

              I see what you are saying. Agree. Seems we need some set patterns in NN models that will reliably remove reversibility without effecting loss too drastically.

  • tanto a year ago

    Hi guites, Thank you! That is undoubtedly something relatable. We have it on the screen and plan to provide helpful material and presentations helping to convince stakeholders. If you are up for a call to share the specific challenges, we could ideate with you.

    • guites a year ago

      Would love to! You can grab my email on my profile. Could you ping me over there? Thanks

  • danieljanes a year ago

    Thanks, glad you like it!

    One approach to increase the transparency on the client side (and build trust with the organization where the Flower clien is deployed) is to integrate a review step that asks the someone to confirm the update that gets send back to the server.

    On top of that, you should definitely use differential privacy. To quote Andrew Trask here: "friends don't let friends use FL without DP". Other approaches like Secure Aggregation can also help, depending on what kind of exposure your clients are concerned about.

    My general take is that the best way to solve for transparency and trust is to tackle it on multiple layers of the stack.

    • guites a year ago

      A review steps sounds like a good idea. Our implementation involves very little interaction on the client side, besides setting up the datasets etc, so maybe a way to log information sent for later inspection would help.

      I'll be looking into secure aggregation as I'm not fully aware of how it works. As of now we rely on differential privacy only.

      Thanks!

      • ngneer a year ago

        Cool. I saw a proposal to use TEEs for secure aggregation. OpenFL uses Gramine for that. Not sure if that provides sufficient protection, really, but worth having on the radar.

        https://arxiv.org/abs/2105.06413 https://openfl.readthedocs.io/en/latest/index.html https://gramineproject.io/

        • niclane7 a year ago

          Flower has an agreement to develop interoperable components with OpenFL. This is part of the broader plan by Intel to work with a consortium of players (that includes Flower Labs) and have the output code sit with the Linux Foundation. Enabling TEE support within OpenFL for SA assessible to Flower users is precisely the type of opportunities we seek to make possible by working with Intel on this.

          This is the official press release for those who are interesed: https://www.intel.com/content/www/us/en/newsroom/news/transi...

          More broadly, in regards too your comment -- our current SA support does not require hardware support, which is what we targeted first, so that can be broadly adopted in many potential hosts of FL aggregation servers. It is suitable for most applications in need of privacy, although still requires certain assumptions to be met such as the number of nodes within a round, and other factors.

    • jorgeili a year ago

      What about MPC + DP? Are you planning to integrate any SMPC algorithms on flower or do you find any limitations for not doing so.

      I'm trying to apply federated learning to the medical domain too and I'm trying to define the best "stack" that guarantees privacy and compliance with regulations like the GDPR

      • williamtrask a year ago

        I can’t speak for Flower’s core dev roadmap, but PySyft is in the process of integrating Flower and some Secure Enclave options which would let you do this.

        Congrats on the launch Flower team!

        • danieljanes a year ago

          Thanks! We're huge fans of the work that PySyft is doing, and we're very supportive of the Flower PySyft integration.

      • danieljanes a year ago

        Agreed that this is an interesting direction. The core Flower abstractions are "federated learning agnostic", which means that they can be used for different kinds of distributed/federated workloads, not just federated learning. We'll add examples for more approaches (like SMPC) in the future, we just don't have the bandwidth to do it immediately.

JohnFen a year ago

Isn't this still moving your data to a central repository? It's encoded in a neural net rather than in a more accessible form, but it's still being moved out of your control.

  • niclane7 a year ago

    It is reasonable to think of it that way. Certainly high-level information from the data is extracted and embedded within a model, but only the information necessary for the model being trained. Whereas if the data itself was being sent, then all of the information is available. Additionally, through added protections (differential privacy being one) it is possible to engineer the federated system such that the data itself can not be reconstructed from model itself.

    • dang a year ago

      Can you say more about what differential privacy is and how it works, for those of us who don't know or don't remember?

      • aix1 a year ago

        I too would be interested in understanding this better.

        Let's say we're building a medical segmentation model, which takes a patient image and outlines a tumour (or some other feature that's unique to them). I am not sure this matters here, but let's say the model is a basic 2D U-net. Image pixels in, binary pixel labels out (cancer/non-cancer).

        At a high level, how would a differentially-private setup work for training such a model across multiple institutions without pooling their patient data?

cs02rm0 a year ago

In the past, we funded our work through consulting projects, but looking ahead, we’re going to offer a managed version for enterprises and charge per deployment or federation.

Interesting.

Flower seems to fit well for people who are sensitive about their data and don't want to hand it over to a third party, but this seems to move towards a model where they have to hand that sensitive data over to a third party.

Perhaps that still works for the bulk of users, especially commercial rather than government. It's difficult to pursue both a managed solution and simultaneously maintain an open source offering without one departing from the other.

  • popinman322 a year ago

    I didn't read it as a move towards centralizing data, but instead as working with companies to federate over their userbase or between a collection of companies.

  • wjnc a year ago

    Charge per deployment is on-prem? You bring the hardware, they send you the software.

    • tanto a year ago

      Hey wjnc, You can think of it how you can use GitLab on gitlab.com or deploy it yourself on-prem. The only difference being instead of per user we would charge per deployment. As in case of GitLab you can decide to host it yourself.

dontreact a year ago

There is so much hype around federated learning but often the hard and insurmountable part of this is federated labeling.

For example for your cancer use case, you have to convince multiple hospitals to feed the system labels and this is a very very tall ask.

For healthcare it’s also not clear how to get a regulatory clearance if you can’t actually test the performance of the federated deployments.

So while federated learning solves some problems generated by an unwillingness to share data, it doesn’t solve all of them. Describe the use cases of your product carefully.

  • niclane7 a year ago

    Regarding federated labeling, you might be interested in some recent prototypes built on Flower that use forms of self supervised learning. By combining SSL with federated learning we can start to leverage unlabeled data and this will be a big deal once it becomes common place. I'd suggest looking at these two research papers that build on Flower and include members of the Flower team as authors:

    https://arxiv.org/abs/2207.01975

    https://arxiv.org/abs/2204.02804

yawnxyz a year ago

Hi! As someone new to all of this — how would I interact with the trained data after it's been trained?

Is it possible to create a conversation or QA style interaction with it? I see there's examples of "pytorch" but as a someone new— I'm not sure what that means in terms of public use cases.

I guess I'm asking is "ok I use Flower to train on a bunch of stuff... then what do I do with that?"

Thanks!

  • danieljanes a year ago

    Hi there - the data never moves if you train a model using federated learning. It stays on user devices or in organizational silos. After the training, you have the model parameters of the model on the server, without the server having ever seen a single data example.

    After the training, you can deploy the model in different ways. If you want to use it on device (or in one of the organizational silos), you can send the final model parameters there and deploy it locally. Or you just deploy the model on the server behind an API. It all depends on the use case.

    Hope that helps, I'm happy to provide more details.

jaggirs a year ago

It has been shown that the input data can be reverse-engineered from the model weights. How do you deal with this issue?

  • technologia a year ago

    looks like they've introduced some differential privacy wrappers, the changelog points to that: https://github.com/adap/flower/blob/94a1f942abfce5dff4e9aff2...

    • niclane7 a year ago

      Yes, we have developed modular and efficient secure aggregation and differential privacy solutions that can help people dial in the amount of protection they need. We have documented an early version of the secure aggregation here: https://flower.dev/docs/secagg.html Documentation and updates on both methods will be released soon.

      • technologia a year ago

        I hate to ask a product comparison question, but why would I use this versus other projects like PySyft.

        • niclane7 a year ago

          Thanks for the question, very natural to ask. We are also fans of PySft. It offers support for a very wide range of privacy enhancing machine learning tools. But where Flower and PySft differ is in focus. Federated learning is difficult and requires many technical moving parts all working together (e.g., secure aggregation, differential privacy, scalable simulation, device deployments, integration with conventional ML frameworks etc.). All of these need to tightly integrated, and in a manner that performs federated learning efficiently. This is where Flower currently excels. It offers comprehensive, extensible and, most important, easy to use construction of federations that need these different parts together. We believe it offers the best user experience for federated learning currently out there. We hope in the future many tool suites that offer private machine learning (like PySft and others) will actually adopt Flower components so we can all work better together.

          • williamtrask a year ago

            Can confirm that PySyft is currently in the process of integrating with Flower. Best of both worlds.

            • danieljanes a year ago

              Indeed - looking forward to this

          • technologia a year ago

            I appreciate you taking the time to break this down, I’ve spent a decent chunk of time having to roll my own stuff so when pygrid/pysyft came along it was just easier. I will say the flower components look interesting and I’ll give it a shot

    • danieljanes a year ago

      Thanks for adding this here! We added these DP wrappers, and we're working on something similar for Secure Aggregation, but I must admit that we have to document them better to make using them easier for everyone

brookst a year ago

Very interesting project. Your write up here does a much better job of explaining the market need and value prop than the GitHub readme.md… consider bringing some of this text over as the “why / what” story?

  • tanto a year ago

    Thank you! We'll make sure to improve the readme and add more explanation to it.

photochemsyn a year ago

This looks very interesting. I'd like to see a model trained on the complete body of scientific research literature from the past 100 years or so, I wonder if this approach could facilitate that?

  • niclane7 a year ago

    Yes, this would be exciting to see. One approach wouldn't require federated learning however. If you had direct access to the data then you could build a conventionally trained large language model (i.e., collect all the data together placed in a data center). However, given the context of this discussion -- you are probably asking about if we could use Flower to train in a federated manner. I believe so. Although again, we'd probably be training a LLM which brings added complications due to its size (and other factors). Internally at Flower we have been testing methods to overcome this and are confident we can pull this off. One could imagine someone hosting a pre-trained LLM and contributing institutions acting as nodes in the network, each performing some small part of the training based on the fraction of the literature they have access to. We plan to release LLM based federated technology in the coming months.

    For those that are interested: The best work currently I've seen on training very large models under federated learning, that also makes very realistic assumptions about the likely underlying participating hardware, is this: https://arxiv.org/abs/2206.11239 -- although I expect more in this direction to come soon.

  • haldujai a year ago

    I'm not sure that this would be as useful as one might think at face value. When you stretch out the training corpus like that you're going to have more noise/inaccuracies/refuted facts then you will have correct information.

    It's also unclear how useful full scientific articles are, Microsoft/PubMedBERT interestingly showed PMC abstracts was better than full text.

juanma91p a year ago

Great to see Flower here! We use the framework for our projects because of its modularity, scalability, and ease to use. Another important aspect of FL, on top of the already mentioned privacy preservation, is network resource utilisation. By transferring only the weights of the model, less bandwidth is required, which can reduce network congestion. This is especially important given that it is expected that by 2030, more than 50 billion devices will be connected and transferring data.

  • danieljanes a year ago

    Great to hear, thanks for sharing - modularity, scalability, and user friendliness are what we think a lot about :)

elijahbenizzy a year ago

Congratulations! Really excited for you!

I love how you found a niche, valuable problem, built a framework, and are seeing a lot of success. A question (and I'm far from an expert so let me know if the assumptions are wrong):

It seems to me that the federated users have to be coordinated around timing for this to work. Otherwise this could take weeks/lots of slack messages for a single model to train. E.G. one team is having infra issues and doesn't get a job started, the other team is ready but then their lead goes on vacation, etc... In the internal-to-an-organization case this is probably fine (E.G. a hospital where the data has to be separated by patient/cohort), but if there are different teams managing the data then (a) have you seen this problem and (b) do you have tooling to fix it?

  • danieljanes a year ago

    Thanks, we're excited too!

    Flower tries to automate this as much as it can. In cases where multiple organizations are involved, the workload can run in a fully automated manner if that's fine for all organizations. If a review step is required, that can be integrated (either on the client side or on the server side) - the availability of reviewers will then become the bottleneck for end-to-end latency.

    In the long run, we will evolve the permissioning system to allow workloads to be automatically executed if they fall within pre-approved boundaries, or require manual review if they don't. Pre-approved boundaries could, for example, be used to configure a particular combination of models and hyperparemter ranges that are ok to run without additional (manual) approvals.

    • elijahbenizzy a year ago

      Awesome! Makes sense. I think the challenge is going to be coordinating with the various orchestration systems -- timeouts, etc.. Excited to see how you pull it off!

jleguina a year ago

This is a great project!

Have you though about what happens at inference? Suppose I train in a federated healthcare environment using PII features from patient records. Once I get the weights back how can I ever deploy it if I don't have access to the same features? The models would become highly coupled to the training environments no?

Best of luck!

techwizrd a year ago

I've been working with Flower to implement and study Federated Learning for a few years, and have just started contributing back on Slack and Github. Congrats on launching on HN!

  • tanto a year ago

    Really happy to hear that and your support is much appreciated! I saw you answering many questions before we could do so :) Thank you for that. We are reaching out to all contributor. Let me know in Slack if you are up to a short call to understand better how we can support you.

    • techwizrd a year ago

      Absolutely. Happy to help!

northlondoner a year ago

Many congratulations! Glad to hear about UK & EU collaborative innovation in open-source projects. Keep up the fantastic work!

Others asked similar question regarding comparable projects. What's your take on OpenFL from Intel? Do you think Flower moves into more commercial-MLOps direction? Looks like OpenFL particularly focused on to academic imaging community.

spangry a year ago

Another interesting use case - government training models on legislatively protected data (e.g. tax data). Lots of data the government holds is governed by confidentiality restrictions built into legislation, limiting its utility. Sounds like federated learning could be a way around that.

rjtc a year ago

How is your approach different than tf federated or any of the other federated libraries out there?

  • danieljanes a year ago

    There are some similarities, but also some differences. Flower's take is that it wants to support the entire FL workflow from experimental research to large-scale production deployments and operation. Some other FL frameworks fall either in the "research" or "production deployment" bucket, but few have good support for both.

    Flower does a lot under the hood to support these different usage scenarios: it has both a networked engine (gRPC, experimental support for REST, and the possibility to "bring your own communication stack") and a simulation engine to support both real deployment on edge devices/server and simulation of large-scale federations on single machines or compute clusters.

    This is - to the best of our knowledge - one of the drivers of our large and active community. The community is very collaborative and there are many downstream projects in the ecosystem that build on top of Flower (GitHub lists 748 dependent projects: https://github.com/adap/flower/network/dependents).

blintz a year ago

This is really cool. Federated learning seems like it could unlock a lot of value in healthcare settings.

Have you had any luck convincing hospitals / insurers / etc that this satisfies HIPAA and is safe? How do you convince them?

  • JohnFen a year ago

    What about patients? I would be very, very worried about going to a health provider that participated in this.

7e a year ago

Why not train 10000x faster using H100 secure enclaves with remote attestation? The FL window is closing, because it is a PITA to use and replacements are superior.

fedgenerativeai a year ago

This is not new at all. There is a much stronger competitor existing in the market already: FedML (https://fedml.ai). They have a much larger open-source community, and a well-managed and widely-used MLOps (https://open.fedml.ai).

  • danieljanes a year ago

    One of the creators of Flower here - I can only say that the team behind Flower appreciates the contributions of FedML to the field of federated learning. Their work helps to make federated learning more widely known, and they published significant advances in making federated learning more robust and scalable.

    In fact, we are in the process of implementing LightSecAgg, and we'd welcome their feedback once we have a working version.

  • sisso97 a year ago

    I'm very familiar with FL frameworks. I chose Flower after having spent a lot of time benchmarking many of them. FedML wasn't able to carry on the easiest workloads I tried, like the baseline of the Shakespeare dataset. The simulator you have was the slowest compared to the other frameworks I tried. Your platform doesn't give developers like me what they want. Flower was the best to use since day one.

  • fednongenai a year ago

    As a founder of an FL startup, I strongly advise against using FedML as it is one of the worst frameworks in terms of scalability. Regrettably, the FedML team has also a poor reputation due to their toxicity and suspicious behavior. If you value the trustworthiness of your product, I suggest avoiding FedML at all costs.

    • fedgenerativeai a year ago

      OMG, hard to believe that this is from Flower founder's words. I am just a federated learning user and told the truth to the audience here. I have followed the FL community for more than 2 years, and I clearly feel that Flower has been copying and following FedML's product plan without any originality at all. These nonfactual attacks further makes me distrust on Flower team. You are ruining the reputation of YC. You need to gain the trust of users like FedML team, not attack their reputation. Please stop doing so.

      FedML is definitely much simpler and more powerful in both research and production. In terms of scalability, I like the benchmarking results in this paper (https://arxiv.org/abs/2303.01778); it shows that FedML is much more scalable. I also tried it myself, it's faster and more scalable. In terms of "research to production", I don't see Flower supporting any MLOps functionalities. Is Flower trying to copy FedML again in this direction?

      • fednongenai a year ago

        *just a federated learning user* flashing a two-week old academic junk from the FedML team. Sus af. No f'ing way I am using a privacy product from the glorious land that gave us Tiktok and mass government surveillance.

        • fedgenerativeai a year ago

          Because that one is an important work for scalability, so I followed it closely.

          I will stop here due to your disrespectful words. Good luck.

    • ulikis2 a year ago

      LOL, starting FL wars, are we? I never tested FedML in production but it was cool for simple experiments. I concur with what you are saying though about their devs. The CTO is super sus.

  • dang a year ago

    I'm not sure what pre-existing conflict is breaking out in this subthread between accounts created for the purpose, but would all of you please stop?