Ask HN: Vision Models to Parse UI

23 points by darepublic 2 years ago

Hello. I was curious whether there are any existing vision models out there for either object classification of individual UI elements, or object detection of UI elements within their broader context.

smaddox 2 years ago

Not that I'm aware of, but that's a very interesting idea. If you had one that maps from ui to html with inline styling, you could automate turning image mock-ups into html with inline styling.

It's not clear exactly how you would implement it, though. Maybe by recursively dividing the problem into rectangles, directed by the model? E.g. start with full image, train the model to locate the first element of the html, and output an attention mask for that element and the corresponding html tag and maybe style. Then recursively run the model with the attention mask as an input and with the inverted attention mask as the input (two runs), and have it extract the next element of each.

Not sure if that would work, but it seems like it might.

orev 2 years ago

I think you need to think outside the Web in those one. The web already has many ways to access UI elements, like the DOM, CSS selectors, etc.
This is really more useful for things like desktop GUIs where you have no other option than to OCR then inject mouse clicks.
- wruza 2 years ago
  
  Is there an OCR that reports character coordinates? Couldn’t find a tesseract-ocr option for that, but maybe it’s in a block-detector-something source code?

carom 2 years ago

LayoutLM [1] is the closest that I have seen you what you are asking. It is applied to documents but essentially takes positional and visual information into account for text extraction. For example, extracting a total from the line that reads TOTAL - I think this would be the best place to start.

1. https://arxiv.org/abs/2204.08387

throwaway2016a 2 years ago

I too would like to see this! I'm pretty sure it can be accomplished by using screenshots as training data for an object recognition algorithm. As with a lot of machine learning, gathering and tagging the data would be tricky and overfitting to specific design systems could be a big problem.

But the hierarchy piece I think is a bit tricker.

I'm really curious to see what the comments come up with.

felixr 2 years ago

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding https://arxiv.org/abs/2210.03347

https://github.com/google-research/pix2struct

darepublic 2 years ago

Thanks will take a look at these

lunixbochs 2 years ago

Android voice access blogged about how they use a model to detect and classify buttons:

https://ai.googleblog.com/2021/01/improving-mobile-app-acces...

jetnew 2 years ago

I'm currently working on this! Currently using traditional computer vision methods (e.g. canny edge detection) which already works quite well for most websites or applications, but am working towards curating a dataset for deep learning. I'm keen to chat!

gschoeni 2 years ago

Let me know if you want a good place to host and iterate on the dataset, I'm working on a version control system optimized for deep learning datasets. Free to host public datasets right now, just looking for feedback.
It's called Oxen and is super fast at versioning image data.
https://github.com/Oxen-AI/oxen-release

sharemywin 2 years ago

I'm interest too if any one knows of anything.