I remember this from way back in the 00's as part of the Web 2.0/semantic web era. I was a big fan of Dan Cederholm's design work (https://simplebits.com - he did the design for the site).
I do like the principle of trying to use semantic html, but I don't think that things like this ever got the kind of mass adoption that would give them staying power. Still a nice nostalgia trip to see it here.
Hah I just had to chime in here, blast from the past for me as well! Dan Cederholm was my guiding light when I was learning CSS and semantic HTML. I used to View Source and study the code behind SimpleBits :)
Agree with the mass adoption part. Back then I was so hopeful for what the web might become... The days before social media blowing up.
The IndieWeb people use Microformats[1] extensively for things like Webmention[2] and such. Seems quite neat, though maybe I'd prefer the tags to be data attributes instead of classes.
W3C was way too optimistic about XML namespaces leading to creation of infinitely extensible vocabularies (XHTML2 was DOA, and even XHTML1 couldn't break past tagsoup-compatible minimum).
This was the alternative – simpler, focused, fully IE-compatible.
W3C tried proper Semantic Web again with RDF, RDFa, JSON-LD. HTML5 tried Microdata a compromise between extensibility of RDF and simplicity of Microformats, but nothing really took off.
Eventually HTML5 gave up on it, and took position that invisible metadata should be avoided. Page authors (outside the bunch who have Valid XHTML buttons on their pages) tend to implement and maintain only the minimum needed for human visitors, so on the Web invisible markup has a systemic disadvantage. It rarely exists at all, and when it does it can be invalid, out of date, or most often a SEO spam.
Schema.org metadata (using microdata, RDFa or JSON-LD) is quite common actually, search engines rely on it for "rich" SERP features. With LLMs being able to sanity-check the metadata for basic consistency with the page contents, SEO spam will ultimately be at a disadvantage. It just becomes easier and cheaper to penalize/ignore spam while still rewarding sites that include accurate data.
The schema.org vocab is being actively maintained, the latest major version came out last March w/ the latest minor release in September.
For a while, Google gave you good boy points for including microformats, and they still offer tests and validators [0] to tell you what the crawlers get out of your page. Supposedly microformats would not just give you better SEO ranking but also help Google connect people (like the fediverse) to accounts, so that you could surface things relevant to person by searching for the person.
If you go back to the time when they were invented, many semantic elements, like article or footer, didn't exist in HTML. People tried to find conventions and efforts like microformats were an attempt to standardize those when the best solution (updating the HTML standard) was difficult. In terms of timing, it's worth looking at the arc of Firefox, WHATWG, the advent of Safari and Chrome, and table use for layout.
Google was a driver in practice. Accessibility and better web experiences were important to those involved. The reality was that people interested in this area were at the bleeding edge. Many people still held onto tables for site layout and Flash was still a default option for some in the period when microformats emerged.
ARIA and accessibility microformats were separate from the ones the fediverse was excited about (and thus the GP was talking about)—things like hCard for identifying people, places, and things. Accessibility is useful to many people, but hCard et al. were probably never really useful to anybody other than Google. Still, many of us back then were obsessive-compulsive about using them in the hope that one day computers would better be able to understand authoritative information about identities and relationships between identities. I still have microdata on my personal web page.
I don't think anyone desires fragmentation. It's just the reality of the space. People were exploring options but didn't have support from the key stakeholders who were the browser makers (IE was at its peak) and Google. Firefox and WHATWG advanced some of the ideas in time.
People always mention RDF when the semantic web comes up. It's really important to understand where W3C was in the early-2000s and that RDF was driven by those with an academic bent. No one working with microformats was interested in anything beyond the RDF basics because they were too impractical for use by web devs. Part of this was complexity (OWL, anyone?), but the main part was browser and tool support.
> People always mention RDF when the semantic web comes up.
There's nothing wrong with RDF itself, the modern plain-text and JSON serializations are very simple and elegant. Even things like OWL are being reworked now with efforts like SHACL and ShEx (see e.g. https://arxiv.org/abs/2108.06096 for a description of how these relate to the more logical/formal, OWL-centered point of view).
Whatever relevance it has, it's more than microformats.
I personally think semweb-related technologies could play a significant and productive role in synthetic data generation, but that's a whole different conversation that is beyond the current era of LLMs.
You got downvoted but I also think it's largely a pointless "technology" today as well.
LLMs have been able to find stuff and excavate meaning without needing any kind of specialized tags or schema so it really is a crutch at best.
And there is the fact that people/orgs lie/maintain poorly and thus you cannot trust any open tag/schema anymore than the data it is supposed to describe. You end up in a conundrum where your data may be sketchy but on top of that the schema/tags may be even worse, at the end of the day if you got to trust one thing it's got to be the data that is readily visible by everyone because it's more likely to be accurate and that's actually what's relevant.
To top all of that, tagging your stuff with RDFas just makes it easier for google and others to parse your stuff and then exploit the data/information without ever sending anyone to your site. If you are Wikipedia, it's mostly fine, but almost anyone else would benefit from receiving the traffic for a chance at value conversion/transaction.
All those metadata things are really idealized academic endeavor; they may make sense for a few noncommercial stuffs but the reality is that most website on the web need to find a way to pay the bills, making it easier for others to exploit your work is largely self-defeating.
And yes, LLMs didn't even need that to vacuum most of the web and still derive meaning, so at the end of the day it's mostly a waste of time...
You can use JSON-LD to cleanly embed RDF into LLM-friendly JSON. Not sure if there's a direct Markdown equivalent, but the Turtle plain-text syntax for RDF is very simple and modern LLMs should be able to cope quite fine with it.
The purpose of RDF is to enable reasoning ("I'm sharing this with you, and I'm also sharing how to reason about this thing I shared with you, so we can all reason about it the same way").
If you show me an LLM that can take in serialized RDF and perform reasoning on it, I will be surprised. Once the LLM takes in the RDF serialization, it is a dead end for that knowledge: by principle, you can't rely on anything the LLM does with it.
In a world of LLMs, it makes much more sense to put the semweb technologies alongside the training step instead. You create ontologies and generate text from triples, then feed the generated text as training for the model. This is good because you can tweak and transform the triples-to-text mechanism in all sorts of ways (you can tune the data while retaining its meaning).
It doesn't make much sense to do it now, but if (or when?) training data becomes scarce, converting triples to text might be a viable approach for synthetic data, much more stable than having models themselves generate text.
> If you show me an LLM that can take in serialized RDF and perform reasoning on it, I will be surprised
It's easy, you just ask the LLM to convert your question to a SPARQL query, then you sanity-check it and run it on your dataset. The RDF input step is just so that the LLM knows what your schema looks like in the first place.
RDFa/Microdata is more interesting for people whom sell objects instead of content. e.g. marking up that a page is about a kitchen cabinet that is 60cm wide and in the color white might lead to more sales in the long run. As people whom are looking for 60cm wide cabinets might get to your page instead of one about one 36 inch wide.
That's an oddly specific search and even Google doesn't have any kind of tools for queries like that.
What is more likely is that you'll find companies specialized in selling cabinets and they'll have a browser/search to restrict choice by given dimensions.
There is not a lot of benefits for them to expose all that data to various search engine, best case scenario they end up competing with a bunch of other brands on a generic search engine page where they have absolutely no control how things are presented etc...
And even before thinking about that, you can actually put the dimensions in a description, which some do (like Ikea) and Google is definitely able to pick up on that, no RDFa was ever needed. As far as I can tell, LLMs can work that out just fine as well.
The problem with the metadata discussion is that if they are actually useful, there is no reason that they are not useful to humans as well, so instead of trying to make the human work for the machine it is much better to make the machine understand humans.
It sounds harsh, but maybe that was never a good business model in the first place. And I fully realize that this includes most news sites. In order for the web to grow, I think we need to figure out some way to get past banner ads as the only way to make money. It's been 30 years.
I remember this from way back in the 00's as part of the Web 2.0/semantic web era. I was a big fan of Dan Cederholm's design work (https://simplebits.com - he did the design for the site).
I do like the principle of trying to use semantic html, but I don't think that things like this ever got the kind of mass adoption that would give them staying power. Still a nice nostalgia trip to see it here.
Hah I just had to chime in here, blast from the past for me as well! Dan Cederholm was my guiding light when I was learning CSS and semantic HTML. I used to View Source and study the code behind SimpleBits :)
Agree with the mass adoption part. Back then I was so hopeful for what the web might become... The days before social media blowing up.
The IndieWeb people use Microformats[1] extensively for things like Webmention[2] and such. Seems quite neat, though maybe I'd prefer the tags to be data attributes instead of classes.
[1]: https://indieweb.org/microformats
[2]: https://indieweb.org/Webmention
W3C was way too optimistic about XML namespaces leading to creation of infinitely extensible vocabularies (XHTML2 was DOA, and even XHTML1 couldn't break past tagsoup-compatible minimum).
This was the alternative – simpler, focused, fully IE-compatible.
W3C tried proper Semantic Web again with RDF, RDFa, JSON-LD. HTML5 tried Microdata a compromise between extensibility of RDF and simplicity of Microformats, but nothing really took off.
Eventually HTML5 gave up on it, and took position that invisible metadata should be avoided. Page authors (outside the bunch who have Valid XHTML buttons on their pages) tend to implement and maintain only the minimum needed for human visitors, so on the Web invisible markup has a systemic disadvantage. It rarely exists at all, and when it does it can be invalid, out of date, or most often a SEO spam.
Schema.org metadata (using microdata, RDFa or JSON-LD) is quite common actually, search engines rely on it for "rich" SERP features. With LLMs being able to sanity-check the metadata for basic consistency with the page contents, SEO spam will ultimately be at a disadvantage. It just becomes easier and cheaper to penalize/ignore spam while still rewarding sites that include accurate data.
The schema.org vocab is being actively maintained, the latest major version came out last March w/ the latest minor release in September.
I could never understand why microformats were so popular among fediverse people when XML is right there. Seems like unnecessary fragmentation.
Google.
For a while, Google gave you good boy points for including microformats, and they still offer tests and validators [0] to tell you what the crawlers get out of your page. Supposedly microformats would not just give you better SEO ranking but also help Google connect people (like the fediverse) to accounts, so that you could surface things relevant to person by searching for the person.
[0] https://developers.google.com/search/docs/appearance/structu...
If you go back to the time when they were invented, many semantic elements, like article or footer, didn't exist in HTML. People tried to find conventions and efforts like microformats were an attempt to standardize those when the best solution (updating the HTML standard) was difficult. In terms of timing, it's worth looking at the arc of Firefox, WHATWG, the advent of Safari and Chrome, and table use for layout.
Google was a driver in practice. Accessibility and better web experiences were important to those involved. The reality was that people interested in this area were at the bleeding edge. Many people still held onto tables for site layout and Flash was still a default option for some in the period when microformats emerged.
ARIA and accessibility microformats were separate from the ones the fediverse was excited about (and thus the GP was talking about)—things like hCard for identifying people, places, and things. Accessibility is useful to many people, but hCard et al. were probably never really useful to anybody other than Google. Still, many of us back then were obsessive-compulsive about using them in the hope that one day computers would better be able to understand authoritative information about identities and relationships between identities. I still have microdata on my personal web page.
If there’s one thing semantic web folks like it’s fragmentation
I don't think anyone desires fragmentation. It's just the reality of the space. People were exploring options but didn't have support from the key stakeholders who were the browser makers (IE was at its peak) and Google. Firefox and WHATWG advanced some of the ideas in time.
People always mention RDF when the semantic web comes up. It's really important to understand where W3C was in the early-2000s and that RDF was driven by those with an academic bent. No one working with microformats was interested in anything beyond the RDF basics because they were too impractical for use by web devs. Part of this was complexity (OWL, anyone?), but the main part was browser and tool support.
> People always mention RDF when the semantic web comes up.
There's nothing wrong with RDF itself, the modern plain-text and JSON serializations are very simple and elegant. Even things like OWL are being reworked now with efforts like SHACL and ShEx (see e.g. https://arxiv.org/abs/2108.06096 for a description of how these relate to the more logical/formal, OWL-centered point of view).
This is dead. If you need something similar, better look for RDFa instead.
In an era of LLMs, do you think RDFa still has a place?
Whatever relevance it has, it's more than microformats.
I personally think semweb-related technologies could play a significant and productive role in synthetic data generation, but that's a whole different conversation that is beyond the current era of LLMs.
You got downvoted but I also think it's largely a pointless "technology" today as well. LLMs have been able to find stuff and excavate meaning without needing any kind of specialized tags or schema so it really is a crutch at best.
And there is the fact that people/orgs lie/maintain poorly and thus you cannot trust any open tag/schema anymore than the data it is supposed to describe. You end up in a conundrum where your data may be sketchy but on top of that the schema/tags may be even worse, at the end of the day if you got to trust one thing it's got to be the data that is readily visible by everyone because it's more likely to be accurate and that's actually what's relevant.
To top all of that, tagging your stuff with RDFas just makes it easier for google and others to parse your stuff and then exploit the data/information without ever sending anyone to your site. If you are Wikipedia, it's mostly fine, but almost anyone else would benefit from receiving the traffic for a chance at value conversion/transaction.
All those metadata things are really idealized academic endeavor; they may make sense for a few noncommercial stuffs but the reality is that most website on the web need to find a way to pay the bills, making it easier for others to exploit your work is largely self-defeating.
And yes, LLMs didn't even need that to vacuum most of the web and still derive meaning, so at the end of the day it's mostly a waste of time...
You can use JSON-LD to cleanly embed RDF into LLM-friendly JSON. Not sure if there's a direct Markdown equivalent, but the Turtle plain-text syntax for RDF is very simple and modern LLMs should be able to cope quite fine with it.
The purpose of RDF is to enable reasoning ("I'm sharing this with you, and I'm also sharing how to reason about this thing I shared with you, so we can all reason about it the same way").
If you show me an LLM that can take in serialized RDF and perform reasoning on it, I will be surprised. Once the LLM takes in the RDF serialization, it is a dead end for that knowledge: by principle, you can't rely on anything the LLM does with it.
In a world of LLMs, it makes much more sense to put the semweb technologies alongside the training step instead. You create ontologies and generate text from triples, then feed the generated text as training for the model. This is good because you can tweak and transform the triples-to-text mechanism in all sorts of ways (you can tune the data while retaining its meaning).
It doesn't make much sense to do it now, but if (or when?) training data becomes scarce, converting triples to text might be a viable approach for synthetic data, much more stable than having models themselves generate text.
> If you show me an LLM that can take in serialized RDF and perform reasoning on it, I will be surprised
It's easy, you just ask the LLM to convert your question to a SPARQL query, then you sanity-check it and run it on your dataset. The RDF input step is just so that the LLM knows what your schema looks like in the first place.
I don't understand how this can work.
Can you make me a quick demonstration using a publicly available model and the dbpedia SPARQL endpoint?
https://dbpedia.org/sparql
Only if you think the transition to a zero-click Internet is happening too slowly.
Is zero click a real problem for actual site owners, or is it just affecting SEO job security?
I imagine it is if the content you paid for to create and host is only monetised by a search engine.
RDFa/Microdata is more interesting for people whom sell objects instead of content. e.g. marking up that a page is about a kitchen cabinet that is 60cm wide and in the color white might lead to more sales in the long run. As people whom are looking for 60cm wide cabinets might get to your page instead of one about one 36 inch wide.
That's an oddly specific search and even Google doesn't have any kind of tools for queries like that. What is more likely is that you'll find companies specialized in selling cabinets and they'll have a browser/search to restrict choice by given dimensions. There is not a lot of benefits for them to expose all that data to various search engine, best case scenario they end up competing with a bunch of other brands on a generic search engine page where they have absolutely no control how things are presented etc...
And even before thinking about that, you can actually put the dimensions in a description, which some do (like Ikea) and Google is definitely able to pick up on that, no RDFa was ever needed. As far as I can tell, LLMs can work that out just fine as well.
The problem with the metadata discussion is that if they are actually useful, there is no reason that they are not useful to humans as well, so instead of trying to make the human work for the machine it is much better to make the machine understand humans.
That doesn't sound relevant to zero click.
It sounds harsh, but maybe that was never a good business model in the first place. And I fully realize that this includes most news sites. In order for the web to grow, I think we need to figure out some way to get past banner ads as the only way to make money. It's been 30 years.
It would be good to know why it was never a good business model.
It is VERY real, sadly
that's a terribly naive question.
Great idea, always used the markup, but is it used anywhere outside of some extremely niche services?