Ask HN: Vision Models to Parse UI
Hello. I was curious whether there are any existing vision models out there for either object classification of individual UI elements, or object detection of UI elements within their broader context.
Hello. I was curious whether there are any existing vision models out there for either object classification of individual UI elements, or object detection of UI elements within their broader context.
Not that I'm aware of, but that's a very interesting idea. If you had one that maps from ui to html with inline styling, you could automate turning image mock-ups into html with inline styling.
It's not clear exactly how you would implement it, though. Maybe by recursively dividing the problem into rectangles, directed by the model? E.g. start with full image, train the model to locate the first element of the html, and output an attention mask for that element and the corresponding html tag and maybe style. Then recursively run the model with the attention mask as an input and with the inverted attention mask as the input (two runs), and have it extract the next element of each.
Not sure if that would work, but it seems like it might.
I think you need to think outside the Web in those one. The web already has many ways to access UI elements, like the DOM, CSS selectors, etc.
This is really more useful for things like desktop GUIs where you have no other option than to OCR then inject mouse clicks.
Is there an OCR that reports character coordinates? Couldn’t find a tesseract-ocr option for that, but maybe it’s in a block-detector-something source code?
LayoutLM [1] is the closest that I have seen you what you are asking. It is applied to documents but essentially takes positional and visual information into account for text extraction. For example, extracting a total from the line that reads TOTAL - I think this would be the best place to start.
1. https://arxiv.org/abs/2204.08387
I too would like to see this! I'm pretty sure it can be accomplished by using screenshots as training data for an object recognition algorithm. As with a lot of machine learning, gathering and tagging the data would be tricky and overfitting to specific design systems could be a big problem.
But the hierarchy piece I think is a bit tricker.
I'm really curious to see what the comments come up with.
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding https://arxiv.org/abs/2210.03347
https://github.com/google-research/pix2struct
Thanks will take a look at these
Android voice access blogged about how they use a model to detect and classify buttons:
https://ai.googleblog.com/2021/01/improving-mobile-app-acces...
I'm currently working on this! Currently using traditional computer vision methods (e.g. canny edge detection) which already works quite well for most websites or applications, but am working towards curating a dataset for deep learning. I'm keen to chat!
Let me know if you want a good place to host and iterate on the dataset, I'm working on a version control system optimized for deep learning datasets. Free to host public datasets right now, just looking for feedback.
It's called Oxen and is super fast at versioning image data.
https://github.com/Oxen-AI/oxen-release
I'm interest too if any one knows of anything.