TensorFlow Lite is TensorFlow’s lightweight solution for mobile and embedded devices! TensorFlow has always run on many platforms, from racks of servers to tiny devices, but as the adoption of machine learning models has grown over the last few years, so has the need to deploy them on mobile and embedded devices. TensorFlow Lite enables low-latency inference of on-device machine learning models.
Looking forward to your feedback as you try it out.
> Looking forward to your feedback as you try it out.
Thanks Rajat. We use typical Cortex-A9/A7 SoCs running plain Linux rather than Android. We would use it for inference.
1. Platform choice
Why make TFL Android/iOS only? TF works on plain Linux. TFL even uses NDK and it would appear the inference part could work on plain Linux.
2. Performance
I did not find any info on performance of TensorFlow Lite. Mainly interested in inference performance. The tag "low-latency inference" catches my eye, just want to know how low is low latency here? milliseconds?
1. The code is standard C/C++ with minimal dependencies so it should be buildable on even non-standard platforms. Linux is easy.
2. The interpreter is more optimized for being low overhead and the kernels are better optimized especially for ARM CPUs currently. While model performance varies by model - we have seen significant improvements on most models going from TensorFlow to TensorFlow Lite. We'll share benchmarks soon.
We want to provide a great experience across all our supported platforms, and are exploring ways to provide a simpler experience with good acceleration on iOS as well.
Woah this is cool. I’ve been waiting for this since you announced it. I was thinking about benchmarking it against other solutions . What do you think about other similar frameworks like coreml ?
- As mentioned below - flatbuffers makes the startup time faster while trading off some flexibility
- Smaller code size means trading off dependency on some libraries and broader support vs writing more things from scratch more focused on the user cases people care about
Do you have performance/memory comparisons from using flatbuffer vs protobuf in TF? A quick writeup with how switching effected performance would be really interesting :)
We have had huge issues in trying to figure out how to save models (freeze graph,etc) and load it on Android. If you look at my previous thread - it also mentions bugs,threads and support requests where people are consistently confused.
petewarden (https://news.ycombinator.com/item?id=15596990) from Google is also working on this - so im really hopeful you guys will have something soon. This is a serious blocker for doing anything reasonable in TF.
One nice thing about Lite is that it's a lot easier to just include the operations you need (compared to TensorFlow 'classic'), there's fusion for common patterns, and the base interpreter is only 70KB. That covers a lot of the advantages of using XLA for mobile apps. In return you have the ability to load models separately from the code, and the ops are hand-optimized for ARM.
I'm still a fan of XLA, and I expect the two will grow closer over time, but I think Lite is better for a lot of scenarios on mobile.
How about quantization? Does tensorflow lite perform quantization or is it tensorflow supposed to do it? Is it iterative process or straightforward? Or are you training quantized models as nn api docs say?
The quantization is done with a special training script that is quantization aware. We will be open sourcing a mobilenet quantized training script to show how to do this soon.
TensorFlow Lite is an interpreter in contrast with XLA which is a compiler. The advantage of TensorFlow lite is that a single interpreter can handle several models rather than needing specialized code for each model and each target platform. TensorFlow Lite’s core kernels have also been hand-optimized for common machine learning patterns. The advantage of compiler approaches is fusing many operations to reduce memory bandwidth (and thus speed). TensorFlow lite fuses many common patterns in the TensorFlow converter. We are of course excited about the possibility of using JIT techniques and using XLA technology within the TensorFlow Lite interpreter or as part of the TensorFlow Lite converter as a possible future direction.
This should be possible, but we haven't tried it. We're likely going to add a simplified target that has minimal dependencies (like no Eigen) that allows building on simple platforms.
It might be a tad annoying, but the rest of Tensorflow uses Bazel so it makes sense that Tensorflow Lite also uses it. It also probably matches the internal Google workflow better since Google uses Blaze internally.
CoreML doesn't actually support Tensorflow. It's support for Tensorflow is only through Keras which is fine if you just want to build stock standard models but if you're doing crazy research implementations then that's not going to work.
Is all in the converter tool, if the converter tool can get the tf file into a .mlmodel properly, then it will be supported. Inside is just a bunch of weights and layers and parameters. We just need a proper script to translate it
"just a bunch of weights and layers and parameters" -- I think you and the GP are agreeing. That's the definition of standard: If the model can be expressed using the currently-blessed set of layer definitions in CoreML, then yes. But if you're doing nonstandard stuff with weird control flow behavior, or RNNs that don't map into some of the common flavors, then all bets are off.
An example: Some of my colleagues put a QP solver in tandem with a DNN, so that the neural network could 'shell out' to the solver as part of its learning, and learned to solve small sudoku problems from examples alone: https://arxiv.org/abs/1703.00443 The pytorch code for it is one of the examples I like to use as a stress-test for doing funky things in the machine learning context.
TensorFlow is a very generic dataflow library at its heart - which happens to have a lot of DNN-specific functionality as ops. It's possible to express arbitrary computations in it, whereas CoreML and and similar frameworks make more assumptions that the computation will fit a particular mould, and optimize it thereby.
Looks like you are right, CoreML only support these 3 DNNs: Feedforward, convolutional, recurrent. I suppose capsule nets are not any one of those, if it were implemented in TF
We developed TensorFlow lite to be small enough to target really small devices that lack MMU’s like the ARM Cortex M MCU series, but we haven’t done the actual work to target those devices. That being said, we are excited when the ecosystem and community around machine learning expands.
You can deploy TensorFlow model binaries as serverless APIs on Google Cloud ML Engine [1]. But I would also be interested in seeing a TensorFlow Lite implementation.
Thanks, @rasmi. I have a feedback for you guys. The pricing for predictions inference in GCP is not very fair. If I deploy a small model (like a SqueezeNet or Mobilenet) I pay almost the same price of someone deploying large models (like Resnet or VGG). That’s why I’m deploying my models on serverless environments and paying about 5 dollars for 1 million inferences.
The pricing of GCP is: $0.10 per thousand predictions, plus $0.40 per hour. That’s more than 100 dollars for 1 million inferences.
I see what you mean. To some companies, ML Engine's cost as a managed service may be worth it. To others, spinning up a VM with TensorFlow Serving on it is worth the cost savings. If you've taken other approaches to serving TensorFlow models to get around ML Engine's per-prediction cost, I'm curious to hear about them.
The main TensorFlow interpreter provides a lot of functionality for larger machines like servers (e.g. Desktop GPU support and distributed support). Of course, TensorFlow lite does run on standard PCs and servers, so using it on non-mobile/small devices is possible. If you wanted to create a very small microservice, TensorFlow lite would likely work, and we’d love to hear about your experiences, if you try this.
Thanks for the answer. Currently I’m using AWS Lambda to deploy my TensorFlow models. But it’s pretty hard and hacky. I need to remove a considerable portion of the code base that is not needed for inference only routines. I do that so the code loads faster and to fit the deployment package size limit.
If TensorFlow Lite is already a compressed code, then it may be much easier to deploy it to a serverless environment.
I’ll be trying it in my next deployments.
With TensorFlow and TF Lite we are looking to provide a great experience across all platforms, and are exploring ways to provide a simpler experience with good acceleration on iOS as well.
Yes to your first question, from the article: ”As you may know, TensorFlow already supports mobile and embedded deployment of models through the TensorFlow Mobile API. Going forward, TensorFlow Lite should be seen as the evolution of TensorFlow Mobile, and as it matures it will become the recommended solution for deploying models on mobile and embedded devices.”
This is pretty Android/iPhone-only, wish it can be more flexible to be used on other edge devices such as home routers or other embedded products where resource is constrained.
The current examples talk about Android/iPhone, however the core runtime is pretty lightweight with the goal of supporting all kinds of embedded products.
Do let us know if you build/run on other platforms.
Quantization comes in many different forms. TensorFlow lite provides optimized kernels for 8-bit uint quantization. This specific form of evaluation is not directly supported in TensorFlow right now (though it can train such a model). We will be releasing training scripts that show how to setup such models for evaluation.
> “Google won't say much more about this new project. But it has revealed that TensorFlow Lite will be part of the primary TensorFlow open source project later this year”
how does this relate to other hardware beyond iOS / Javascript i.e. raspberry pi, nvidia jetson etc andddddd.........whats the likelihood of libraries that sit on top of TF supporting this like keras and pytorch. Just some questions that spring to my mind
And we'll start to see more battery-efficient hardware to run those apps without consuming all of your battery. :)
(I'm saying that glibly, but I'm dead serious -- look at what we've seen emerge just this year in Apple's Neural Engine, the Pixel Visual Core, rumored chips from Qualcomm, and the Movidius Myriad 2. The datacenter was the first place to get dedicated DNN accelerators in the form of Google's TPU, but the phones -- and even smaller devices, like the "clips" camera -- are the clear next spot. And this is why, for example, TensorFlow Lite can call into the Android DNNAPI to take advantage of local accelerators as they evolve.
Being able to run locally, if battery life is preserved, is a huge win in latency, privacy, potentially bandwidth, etc. It'll be good, though it does need advances in both the HW and the DNN techniques (things like Mobilenet, but we need far more).
TensorFlow Lite is TensorFlow’s lightweight solution for mobile and embedded devices! TensorFlow has always run on many platforms, from racks of servers to tiny devices, but as the adoption of machine learning models has grown over the last few years, so has the need to deploy them on mobile and embedded devices. TensorFlow Lite enables low-latency inference of on-device machine learning models.
Looking forward to your feedback as you try it out.
> Looking forward to your feedback as you try it out.
Thanks Rajat. We use typical Cortex-A9/A7 SoCs running plain Linux rather than Android. We would use it for inference.
1. Platform choice
Why make TFL Android/iOS only? TF works on plain Linux. TFL even uses NDK and it would appear the inference part could work on plain Linux.
2. Performance
I did not find any info on performance of TensorFlow Lite. Mainly interested in inference performance. The tag "low-latency inference" catches my eye, just want to know how low is low latency here? milliseconds?
1. The code is standard C/C++ with minimal dependencies so it should be buildable on even non-standard platforms. Linux is easy.
2. The interpreter is more optimized for being low overhead and the kernels are better optimized especially for ARM CPUs currently. While model performance varies by model - we have seen significant improvements on most models going from TensorFlow to TensorFlow Lite. We'll share benchmarks soon.
> The code is standard C/C++ with minimal dependencies so it should be buildable on even non-standard platforms. Linux is easy.
Glad to hear that Rajat. Since it is easy as you say, I look forward to your upcoming release with Linux as standard. :-)
Also interested in answers to these two questions, as well as OpenCL performance in vanilla linux (iMX6 and above).
Will CoreML (or any hardware acceleration) on iOS be supported?
We want to provide a great experience across all our supported platforms, and are exploring ways to provide a simpler experience with good acceleration on iOS as well.
Woah this is cool. I’ve been waiting for this since you announced it. I was thinking about benchmarking it against other solutions . What do you think about other similar frameworks like coreml ?
What tradeoffs did you make compared to the original?
A few tradeoffs we had to make:
- As mentioned below - flatbuffers makes the startup time faster while trading off some flexibility
- Smaller code size means trading off dependency on some libraries and broader support vs writing more things from scratch more focused on the user cases people care about
Do you have performance/memory comparisons from using flatbuffer vs protobuf in TF? A quick writeup with how switching effected performance would be really interesting :)
Flatbuffers also uses less memory.
Using FlatBuffers, for one?
Hi, I have written about this before ( https://news.ycombinator.com/item?id=15595689 ) , but are there serialization fixes between cloud training and mobile ?
We have had huge issues in trying to figure out how to save models (freeze graph,etc) and load it on Android. If you look at my previous thread - it also mentions bugs,threads and support requests where people are consistently confused.
Agree, that is a big problem that we are working hard to solve. It isn't solved in this release, but it is high up on our task list.
hey, thanks for the reply!
petewarden (https://news.ycombinator.com/item?id=15596990) from Google is also working on this - so im really hopeful you guys will have something soon. This is a serious blocker for doing anything reasonable in TF.
Will this leverage the Pixel Visual Core SoC on a Pixel 2 device?
This release of Tensorflow Lite doesn't leverage the Pixel Visual Core. We will explore different hardware options available to us in the future.
TF Lite supports Android NN API that allows each phone to accelerate these models leveraging the custom accelerator on the phone.
What about using XLA to compile libraries for mobile deployment fusing only the operations needed by the model?
One nice thing about Lite is that it's a lot easier to just include the operations you need (compared to TensorFlow 'classic'), there's fusion for common patterns, and the base interpreter is only 70KB. That covers a lot of the advantages of using XLA for mobile apps. In return you have the ability to load models separately from the code, and the ops are hand-optimized for ARM.
I'm still a fan of XLA, and I expect the two will grow closer over time, but I think Lite is better for a lot of scenarios on mobile.
How about quantization? Does tensorflow lite perform quantization or is it tensorflow supposed to do it? Is it iterative process or straightforward? Or are you training quantized models as nn api docs say?
The quantization is done with a special training script that is quantization aware. We will be open sourcing a mobilenet quantized training script to show how to do this soon.
TensorFlow Lite is an interpreter in contrast with XLA which is a compiler. The advantage of TensorFlow lite is that a single interpreter can handle several models rather than needing specialized code for each model and each target platform. TensorFlow Lite’s core kernels have also been hand-optimized for common machine learning patterns. The advantage of compiler approaches is fusing many operations to reduce memory bandwidth (and thus speed). TensorFlow lite fuses many common patterns in the TensorFlow converter. We are of course excited about the possibility of using JIT techniques and using XLA technology within the TensorFlow Lite interpreter or as part of the TensorFlow Lite converter as a possible future direction.
Is it lite enough to compile with Emscripten and use via WebAssembly?
This should be possible, but we haven't tried it. We're likely going to add a simplified target that has minimal dependencies (like no Eigen) that allows building on simple platforms.
Cool. I have something else that uses Eigen in WebAssembly, so that hasn't caused any issues btw.
So it uses Bazel on Android....
Google devs, could you please get yourself together in one room and agree in ONE BUILD SYSTEM for Android?!?
Gradle stable, cmake, ndk-build, Gradle unstable plugin, GN, Bazel, ..., whatever someone else does with their 20%.
I keep collecting build systems just to build Gooogle stuff for Android.
It might be a tad annoying, but the rest of Tensorflow uses Bazel so it makes sense that Tensorflow Lite also uses it. It also probably matches the internal Google workflow better since Google uses Blaze internally.
I thought it was to be used outside Google, not that we have to learn every single build system they happen to use inside.
If all Google teams who make use of external build systems were going to agree on one (not likely), it would be Bazel.
Why would I use this for iOS when I can use CoreML and convert TensorFlow into a CoreML model where there is already native support?
Native support for tensorflow? I don’t think so...
https://developer.apple.com/documentation/coreml/converting_...
CoreML doesn't actually support Tensorflow. It's support for Tensorflow is only through Keras which is fine if you just want to build stock standard models but if you're doing crazy research implementations then that's not going to work.
Is all in the converter tool, if the converter tool can get the tf file into a .mlmodel properly, then it will be supported. Inside is just a bunch of weights and layers and parameters. We just need a proper script to translate it
"just a bunch of weights and layers and parameters" -- I think you and the GP are agreeing. That's the definition of standard: If the model can be expressed using the currently-blessed set of layer definitions in CoreML, then yes. But if you're doing nonstandard stuff with weird control flow behavior, or RNNs that don't map into some of the common flavors, then all bets are off.
An example: Some of my colleagues put a QP solver in tandem with a DNN, so that the neural network could 'shell out' to the solver as part of its learning, and learned to solve small sudoku problems from examples alone: https://arxiv.org/abs/1703.00443 The pytorch code for it is one of the examples I like to use as a stress-test for doing funky things in the machine learning context.
TensorFlow is a very generic dataflow library at its heart - which happens to have a lot of DNN-specific functionality as ops. It's possible to express arbitrary computations in it, whereas CoreML and and similar frameworks make more assumptions that the computation will fit a particular mould, and optimize it thereby.
Looks like you are right, CoreML only support these 3 DNNs: Feedforward, convolutional, recurrent. I suppose capsule nets are not any one of those, if it were implemented in TF
How would this differ from uTensor? Did they make uTensor redundant?
https://github.com/neil-tan/uTensor
We developed TensorFlow lite to be small enough to target really small devices that lack MMU’s like the ARM Cortex M MCU series, but we haven’t done the actual work to target those devices. That being said, we are excited when the ecosystem and community around machine learning expands.
Cortex-M compatibility was literally my first thought when I read this -- especially low-memory systems. Might have to hack it up myself.
Would that be a viable option to deploy TensorFlow models on serverless environments (Lambda, Functions)?
You can deploy TensorFlow model binaries as serverless APIs on Google Cloud ML Engine [1]. But I would also be interested in seeing a TensorFlow Lite implementation.
[1] https://cloud.google.com/ml-engine/docs/deploying-models
Disclaimer: I work for Google Cloud.
Thanks, @rasmi. I have a feedback for you guys. The pricing for predictions inference in GCP is not very fair. If I deploy a small model (like a SqueezeNet or Mobilenet) I pay almost the same price of someone deploying large models (like Resnet or VGG). That’s why I’m deploying my models on serverless environments and paying about 5 dollars for 1 million inferences.
The pricing of GCP is: $0.10 per thousand predictions, plus $0.40 per hour. That’s more than 100 dollars for 1 million inferences.
I see what you mean. To some companies, ML Engine's cost as a managed service may be worth it. To others, spinning up a VM with TensorFlow Serving on it is worth the cost savings. If you've taken other approaches to serving TensorFlow models to get around ML Engine's per-prediction cost, I'm curious to hear about them.
The main TensorFlow interpreter provides a lot of functionality for larger machines like servers (e.g. Desktop GPU support and distributed support). Of course, TensorFlow lite does run on standard PCs and servers, so using it on non-mobile/small devices is possible. If you wanted to create a very small microservice, TensorFlow lite would likely work, and we’d love to hear about your experiences, if you try this.
Thanks for the answer. Currently I’m using AWS Lambda to deploy my TensorFlow models. But it’s pretty hard and hacky. I need to remove a considerable portion of the code base that is not needed for inference only routines. I do that so the code loads faster and to fit the deployment package size limit. If TensorFlow Lite is already a compressed code, then it may be much easier to deploy it to a serverless environment. I’ll be trying it in my next deployments.
Sounds really interested. We're excited to hear about how that goes.
Is it possible that a future version may be able to leverage CoreML on iOS?
With TensorFlow and TF Lite we are looking to provide a great experience across all platforms, and are exploring ways to provide a simpler experience with good acceleration on iOS as well.
Someone can just write a CoreML tool to convert. There could already be one that converts non-lite version straight to CoreML
But on iOS we still cannot use swift with that, see https://github.com/tensorflow/tensorflow/issues/19 ?! Btw. what about Kotlin?
UPDATE: It seems some third party developer have developed some swift compatible APIs.
On iOS, does TensorFlow Lite utilize the GPU for inference when needed or is it CPU only?
If so, does it use OpenCL or something?
Is this the next iteration of TensorFlow for Mobile? Is on-device training something planned for the future?
Yes to your first question, from the article: ”As you may know, TensorFlow already supports mobile and embedded deployment of models through the TensorFlow Mobile API. Going forward, TensorFlow Lite should be seen as the evolution of TensorFlow Mobile, and as it matures it will become the recommended solution for deploying models on mobile and embedded devices.”
Also check out this post for more info and examples: https://research.googleblog.com/2017/11/on-device-conversati...
This is pretty Android/iPhone-only, wish it can be more flexible to be used on other edge devices such as home routers or other embedded products where resource is constrained.
The current examples talk about Android/iPhone, however the core runtime is pretty lightweight with the goal of supporting all kinds of embedded products.
Do let us know if you build/run on other platforms.
I was hoping this link might be to a version of TensorFlow that sheds the heavyweight java dependency for building. Sadly not; still bazel-infested.
How does this compare to using XLA for AOT compilation?
XLA for AOT is useful for cases when you know exactly what architecture you are shipping to, and are ok updating the code whenever the model changes.
TF Lite addresses the segment where you need more flexibility
- you ship single app to many types of devices
- would like to update the model independent of the code itself e.g. no change to Android APK, and update the model over the wire.
Even with this generality, TF Lite is still quite fast and lightweight as that was the focus building it up.
How do I know which handsets or tablets have "New hardware specific to neural networks processing" for the NNAPI?
Is the Lite convertor also doing some sort of quantization or is it purely for file format conversion?
Tensorflow has supported quantization for a long time (and is recommended for mobile devices) so it very likely is.
Quantization comes in many different forms. TensorFlow lite provides optimized kernels for 8-bit uint quantization. This specific form of evaluation is not directly supported in TensorFlow right now (though it can train such a model). We will be releasing training scripts that show how to setup such models for evaluation.
What are the minimum requirements? Would something like ARM M4F with 72 MHz and 512 KB RAM work?
I would like to how small an Inception-V3 model becomes when converted into .tflite format.
Didnt they announce this at Google I/O? Where it was supposed to be available that day
Definitely announced at I/O, but all the language I'm finding from around that time is of the "want to" and "will" variety, like this Wired piece:
https://www.wired.com/2017/05/google-really-wants-put-ai-poc...
> “Google won't say much more about this new project. But it has revealed that TensorFlow Lite will be part of the primary TensorFlow open source project later this year”
how does this relate to other hardware beyond iOS / Javascript i.e. raspberry pi, nvidia jetson etc andddddd.........whats the likelihood of libraries that sit on top of TF supporting this like keras and pytorch. Just some questions that spring to my mind
I'm wondering if TF has something like pytorch's autograd. Does anyone know?
I only just briefly read the doc for autograd, but automatic differentiation is the strong default in TF if that's what you're asking.
Yes, it does have auto differentiation from day one. There's also a new autograd like functional API as part of eager. See https://research.googleblog.com/2017/10/eager-execution-impe...
Any React Native APIs?
Not at this time. However, in principle it would be possible to create such bindings.
It's spelled "light" - you'd think google could hire an editor.
So we'll start to see more and more battery-consuming "AI" apps in mobile devices?
And we'll start to see more battery-efficient hardware to run those apps without consuming all of your battery. :)
(I'm saying that glibly, but I'm dead serious -- look at what we've seen emerge just this year in Apple's Neural Engine, the Pixel Visual Core, rumored chips from Qualcomm, and the Movidius Myriad 2. The datacenter was the first place to get dedicated DNN accelerators in the form of Google's TPU, but the phones -- and even smaller devices, like the "clips" camera -- are the clear next spot. And this is why, for example, TensorFlow Lite can call into the Android DNNAPI to take advantage of local accelerators as they evolve.
Being able to run locally, if battery life is preserved, is a huge win in latency, privacy, potentially bandwidth, etc. It'll be good, though it does need advances in both the HW and the DNN techniques (things like Mobilenet, but we need far more).
Thank you.
Yes.