Search tool that only returns content created before ChatGPT's public release

341 points by dmitrygr 5 hours ago

swyx 4 hours ago

somebody said once we are mining "low-background tokens" like we are mining low-background (radiation) steel post WW2 and i couldnt shake the concept out of my head

(wrote up in https://www.latent.space/i/139368545/the-concept-of-low-back... - but ironically repeating something somebody else said online is kinda what i'm willingly participating in, and it's unclear why human-origin tokens should be that much higher signal than ai-origin ones)

mwidell an hour ago

Low background steel is no longer necessary.
"...began to fall in 1963, when the Partial Nuclear Test Ban Treaty was enacted, and by 2008 it had decreased to only 0.005 mSv/yr above natural levels. This has made special low-background steel no longer necessary for most radiation-sensitive uses, as new steel now has a low enough radioactive signature."
https://en.wikipedia.org/wiki/Low-background_steel
- juvoly an hour ago
  
  Interesting. I guess that analogously, we might find that X years after some future AI content production ban, we could similarly start ignoring the low background token issue?
  
  actionfromafar 20 minutes ago
  
  We used a rather low number of atmospheric bombs, while we are carpet bombing the internet every day with AI marketing copy.
  
  huflungdung 22 minutes ago
  
  [dead]
jrjfjgkrj 3 hours ago

every human generation built upon the slop of the previous one
but we appreciated that, we called it "standing on the shoulders of giants"
- bigiain 2 hours ago
  
  > we called it "standing on the shoulders of giants"
  We do not see nearly so far though.
  Because these days we are standing on the shoulders of giants that have been put into a blender and ground down into a slippery pink paste and levelled out to a statistically typical 7.3mm high layer of goo.
  
  _kb an hour ago
  
  The secret is you then have to heat up that goo. When the temperature gets high enough things get interesting again.
- groestl an hour ago
  
  We have two optimization mechanisms though which reduce noise with respect to the optimization functions: evolution and science. They are implicitly part of "standing on the shoulders of giants", you pick the giant to stand on (or it is picked for you).
  Whether or not the optimization functions align with human survival, and thus our whole existence is not a slop, we're about to find out.
- rebuilder 2 hours ago
  
  That's because the things we built on weren't slop
- kgwgk 2 hours ago
  
  Nothing conveys better the idea of a solid foundation to build upon than the word ‘slop’.
- hoppp 2 hours ago
  
  You can't build on slop because slop is a slippery slope
  
  cindyllm 2 hours ago
  
  [dead]
- walrusted 2 hours ago
  
  the only structure you can build with slop is a burial mound
- teiferer 2 hours ago
  
  Because the pyramids, the theory of general relativity and the Linux kernel are all totally comparable to ChatGPT output. /s
  Why is anybody still surprised that the AI bubble made it that big?
  
  jrjfjgkrj an hour ago
  
  for every theory of relativity the is the religious non-sense and superstitions of the medieval ages or today
  
  JumpCrisscross an hour ago
  
  > for every theory of relativity the is the religious non-sense and superstitions of the medieval ages or today
  If Einstein came up with relativity by standing on "the religious non-sense and superstitions of the medieval ages," you'd have a point.
jeffchuber 4 hours ago

that was me swyx
- rollulus 3 hours ago
  
  Multiple people have coined the idea repeatedly, way before you. The oldest comment on HN I could find was in December 2022 by user spawarotti: https://news.ycombinator.com/item?id=33856172

tkgally 4 hours ago

Somewhat related, the leaderboard of em-dash users on HN before ChatGPT:

https://www.gally.net/miscellaneous/hn-em-dash-user-leaderbo...

a5c11 an hour ago

Apparently, it's not only em-dash that's distinctive. I've went through comments of the leader, and spot he also uses the backtick "’" instead of the apostrophe.
- baiwl 27 minutes ago
  
  Just to be clear this is done automatically by macOS or iOS browsers when configured properly.
- kuschku 34 minutes ago
  
  I (~100 in the leaderboard, regardless of how you sort) also frequently use ’ (unicode apostrophe) instead of ' :D
maplethorpe 4 hours ago

They should include users who used a double hyphen, too -- not everyone has easy access to em dashes.
- gblargg 3 hours ago
  
  Does AI use double hyphens? I thought the point was to find who wasn't AI that used proper em dashes.
  
  jader201 2 hours ago
  
  Anytime I do this — and I did it long before AI did — they are always em dashes, because iOS/macOS translates double dashes to em dashes.
  I think there may be a way to disable this, but I don’t care enough to bother.
  If people want to think my posts are AI generated, oh well.
  
  JumpCrisscross an hour ago
  
  > Anytime I do this — and I did it long before AI did — they are always em dashes
  It depends if you put the space before and after the dashes--that, to be clear, are meant to be there--or if you don't.
  
  oniony an hour ago
  
  I cannot remember ever reading a book where there was a space around the dashes.
  
  kuschku 30 minutes ago
  
  That depends on the language — whereas German puts spaces around —, English afaik usually doesn’t.
  (Similarly, French puts spaces before and after . ? !, while English and German only put spaces afterwards.)
  
  LoganDark 28 minutes ago
  
  Technically, there are supposed to be hair spaces around the dashes, not regular spaces. They're small enough to be sometimes confused for kerning.
  
  fragmede 43 minutes ago
  
  What, no love for our friend the en-dash?
  - vs – vs —
  
  chickensong 25 minutes ago
  
  I once spent a day debugging some data that came from an English doc written by someone in Japan that had been pasted into a system and caused problems. Turned out to be an en-dash issue that was basically invisible to the eye. No love for en-dash!
  
  teiferer 2 hours ago
  
  There is also the difference in using space around em-dashes.
- bigiain an hour ago
  
  That would false positive me. I have used double dashes to delimit quote attribution for decades.
  Like this:
  "You can't believe everything you read on the internet." -- Abraham Lincoln, personal correspondence, 1863
- venturecruelty 3 hours ago
  
  Oof, I feel like you'll accidentally capture a lot of getopt_long() fans. ;)
  
  Kinrany 3 hours ago
  
  Excluding those with asymmetrical whitespace around might be enough

permo-w 3 hours ago

besides for training future models, is this really such a big deal? most of the AI-gened text content is just replacing content-farm SEO-spam anyway. the same stuff that any half-awares person wouldn't have read in the past is now slightly better written, using more em dashes and instances of the word "delve". if you're consistently being caught out by this stuff then likely you need to improve your search hygiene, nothing so drastic as this

the only place I've ever had any issue with AI content is r/chess, where people love to ask ChatGPT a question and then post the answer as if they wrote it, half the time seemingly innocently, which, call me racist, but I suspect is mostly due to the influence of the large and young Indian contingent. otherwise I really don't understand where the issue lies. follow the exact same rules you do for avoiding SEO spam and you will be fine

never_inline 3 minutes ago

A colleague sent me a confident ChatGPT formatted bug report.
It misidentified what the actual bug was.
But the tone was so confident, and he replied to my later messages using chat gpt itself, which insisted I was wrong.
I don't like this future.
Cadwhisker 3 hours ago

In the past, I'd find one wrong answer and I could easily spot the copies. Now there's a dozen different sites with the same wrong answer, just with better formatting and nicer text.
- finaard 2 hours ago
  
  The trick is to only search for topics where there are no answers, or only one answer leading to that blog post you wrote 10 years ago and forgot about.
darkwater 36 minutes ago

> besides for training future models, is this really such a big deal? most of the AI-gened text content is just replacing content-farm SEO-spam anyway.
Yes, it is because of the other side of the coin. If you are writing human-generated, curated content, previously you would just do it in your small patch of Internet, and probably SEs (Google...) will pick it up anyway because it was good quality content. You just didn't care about SEO-driven shit anyway. Now you nicely hand-written content is going to be fed into LLM training and it's going to be used - whatever you want it or not - in the next generation of AI slop content.
pajamasam 3 hours ago

SEO-spam was often at least somewhat factual and not complete generated garbage. Recipe sites, for example, usually have a button that lets you skip the SEO stuff and get to the actual recipe.
Also, the AI slop is covering almost every sentence or phrase you can think of to search. Before, if I used more niche search phrases and exact searches, I was pretty much guaranteed to get specific results. Now, I have to wade through pages and pages of nonsense.
zwnow an hour ago

Yes it is a big deal. I cant find new artists without having a fear of their art being AI generated, same for books and music. I also cant post my stuff to the internet anymore because I know its going to be fed into LLM training data. The internet is dead to me mostly and thankfully I lost almost all interest of being on my computer as much as I used to be.
system2 3 hours ago

Yes indeed, it is a problem. Now the old good sites have turned into AI-slop sites because they can't fight the spammers by writing slowly with humans.

themanmaran 4 hours ago

The low-background steel of the internet

https://en.wikipedia.org/wiki/Low-background_steel

HelloUsername 2 hours ago

As mentioned half a year ago at https://news.ycombinator.com/item?id=44239481

tobr 3 hours ago

For images, https://same.energy is a nice option that, being abandoned but still functioning since a few years, seems to naturally not have crawled any AI images. And it’s all around a great product.

zkmon an hour ago

Most of college courses and school books haven't changed in decades. Some reputed college keep courses for Pascal and Fortran instead of Python or Java, just because, it might affect their reputation of being classical or pure or to match their campus buildings style.

anticensor 4 hours ago

You should call it Predecember, referring to the eternal December.

unfunco 4 hours ago

September?
- littlestymaar 3 hours ago
  
  ChatGPT was released exactly 3 years ago (on the 30th of November) so December it is in this context.
  
  permo-w 3 hours ago
  
  surely that would be eternal November then
  
  littlestymaar 3 hours ago
  
  No, being released on Nov 30th means November was still before the slop era.
  
  retsibsi 2 hours ago
  
  In the end the analogy doesn't really work, because 'eternal September' referred to what used to be a regular, temporary thing (an influx of noobs disrupting the online culture, before eventually leaving or assimilating) becoming the new normal. 'Eternal {month associated with ChatGPT}' doesn't fit because LLM-generated content was never a periodic phenomenon.
  
  permo-w 2 hours ago
  
  to be honest, GPT-3, which was pretty solid and extremely capable of producing webslop, had been out for a good while before ChatGPT, and GPT-2 even had been being used for blogslop years before. maybe ChatGPT was the beginning of when the public became aware of it, but it was going on well beforehand. and, as the sibling commenter points out, the analogy doesn't quite fit structurally either
  
  AlecSchueler 2 hours ago
  
  Yes, and this site is for everything before the slop era, hence eternal November.

GaryBluto 4 hours ago

Why use this when you can use the before: syntax on most search engines?

aDyslecticCrow 11 minutes ago

doesn't actually do anything anymore in Google or bing.

ricardo81 2 hours ago

FWIW Mojeek (an organic search engine in the classic sense) can do this with the before: operator.

https://www.mojeek.com/search?q=britney+spears+before%3A2010...

defraudbah an hour ago

ChatGPT also returns content only created before ChatGPT release, which is why I still have to google damn it!

fragmede an hour ago

Click the globe icon below the input box to enable web searching by ChatGPT.

RomanPushkin 2 hours ago

For that purpose I do not update my book on LeanPub about Ruby. I just know one day people gonna read it more, because human-written content would be gold.

1gn15 4 hours ago

Does this filter out traditional SEO blogfarms?

JKCalhoun 4 hours ago

Yeah, might prefer AI-slop to marketing-slop.
- al_borland 4 hours ago
  
  They are the same. I was looking for something and tried AI. It gave me a list of stuff. When I asked for its sources, it linked me to some SEO/Amazon affiliate slop.
  All AI is doing is making it harder to know what is good information and what is slop, because it obscures the source, or people ignore the source links.
  
  venturecruelty 3 hours ago
  
  I've started just going to more things in person, asking friends for recommendations, and reading more books (should've been doing all of these anyway). There are some niche communities online I still like, and the fediverse is really neat, but I'm not sure we can stem the Great Pacific Garbage Patch-levels of slop, at this point. It's really sad. The web, as we know and love it, is well and truly dead.

phplovesong an hour ago

The slop is getting worse, as there is so much llm generated shit online, now new models are getting trained on the slop. Slop training slop, and slop. We have gone full circle just in a matter of a few years.

progman32 4 hours ago

Not affiliated, but I've been using kagi's date range filter to similar effect. The difference in results for car maintenance subjects is astounding (and slightly infuriating).

voiper1 2 hours ago

Of course my first thought was: Let's use this as a tool for AI searches (when I don't need recent news).

cryptozeus 2 hours ago

technically you can ask chatgpt to return the same result by asking it to filter by year

ETH_start an hour ago

I'm grateful that I published a large body of content pre-ChatGPT so that I have proof that I'm not completely inarticulate without AI.

pknerd 2 hours ago

Something generated by humans does not mean high quality.

Krssst 2 hours ago

Yes, but AI-generated is always low quality so it makes sense to filter it out.
- IshKebab 2 hours ago
  
  I wouldn't say always... Especially because you probably only noticed the bad slop. Usually it is crap though.
a5c11 an hour ago

At least when reading a human-made material you can spot author's uncertainty in some topics. Usually, when someone doesn't have knowledge of something, he doesn't try to describe that. AI, however, will try to convince you that pigs can fly.
decremental 2 hours ago

[dead]

johng 5 hours ago

I don't know how this works under the hood but it seems like no matter how it works, it could be gamed quite easily.

qwertygnu 4 hours ago

True, but there's probably many ways to do this and unless AI content starts falsifying tons of its metadata (which I'm sure would have other consequences), there's definitely a way.
Plus other sites that link to the content could also give away it's date of creation, which is out of the control of the AI content.
- layman51 3 hours ago
  
  I have heard of a forum (I believe it was Physics Forums) which was very popular in the older days of the internet where some of the older posts were actually edited so that they were completely rewritten with new content. I forget what the reasoning behind it was, but it did feel shady and unethical. If I remember correctly, the impetus behind it was that the website probably went under new ownership and the new owners felt that it was okay to take over the accounts of people who hadn't logged on in several years and to completely rewrite the content of their posts.
  I believe I learned about it through HN, and it was this blog post: https://hallofdreams.org/posts/physicsforums/
  It kind of reminds me of why some people really covet older accounts when they are trying to do a social engineering attack.
  
  joshuaissac 37 minutes ago
  
  > website probably went under new ownership
  According to the article, it was the founder himself who was doing this.
cryzinger 4 hours ago

If it's just using Google search "before <x date>" filtering I don't think there's a way to game it... but I guess that depends on whether Google uses the date that it indexed a page versus the date that a page itself declares.
- madars 4 hours ago
  
  Date displayed in Google Search results is often the self-described date from the document itself. Take a look at this "FOIA + before Jan 1, 1990" search: https://www.google.com/search?q=foia&tbs=cdr:1,cd_max:1/1/19...
  None of these documents were actually published on the web by then, incl., a Watergate PDF bearing date of Nov 21, 1974 - almost 20 years before PDF format got released. Of course, WWW itself started in 1991.
  Google Search's date filter is useful for finding documents about historical topics, but unreliable for proving when information actually became publicly available online.
  
  littlestymaar 3 hours ago
  
  Are you sure it works the same way for documents that Google indexed at the time of publication? (Because obviously for things that existed before Google, they had to accept the publication date at face value).
  
  madars 2 hours ago
  
  Yes, it works the same way even for content Google indexed at publication time. For example, here are chatgpt.com links that Google displays as being from 2010-2020, a period when Google existed but ChatGPT did not:
  https://www.google.com/search?q=site%3Achatgpt.com&tbs=cdr%3...
  So it looks like Google uses inferred dates over its own indexing timestamps, even for recently crawled pages from domains that didn't exist during the claimed date range.
CGamesPlay 4 hours ago

"Gamed quite easily" seems like a stretch, given that the target is definitionally not moving. The search engine is fundamentally searching an immutable dataset that "just" needs to be cleaned.
- johng 2 hours ago
  
  How? They have an index from a previous date and nothing new will be allowed since that date? A whole copy of the internet? I don't think so.... I'm guessing, like others, it's based on the date the user/website/blog lists in the post. Which they can change at any time.
  
  fragmede 2 hours ago
  
  Yes they do. It's called common crawl, and is available from your chosen hyperscaler vendor.

ListAndFuse an hour ago

[dead]

tejaallu 3 hours ago

[dead]

tejaallu 3 hours ago

[dead]

tejaallu 3 hours ago

[dead]

hekkle 4 hours ago

[flagged]

VoidWhisperer 4 hours ago

Besides this being spam, the linked leaderboard is pre-chatgpt, it doesnt care about comments made now
- hekkle 4 hours ago
  
  [flagged]

k_roy 4 hours ago

You know what's almost worse than AI generated slop?

Every corner of the Internet now screaming about AI generated slop, whenever a single pixel doesn't line up.

It's just another generation of technology. And however much nobody might like it, it is here to stay. Same thing happened with airbrushing, and photoshop, and the Internet in general.

maplethorpe 3 hours ago

Is it really here to stay? If the wheels fells off the investment train and ChatGPT etc. disappeared tomorrow, how many people would be running inference locally? I suspect most people either wouldn't meet the hardware requirements or would be too frustrated with the slow token generation to bother. My mom certainly wouldn't be talking to it anymore.
Remember that a year or two ago, people were saying something similar about NFTs —that they were the future of sharing content online and we should all get used to it. Now, they still might exist, it's true, but they're much less pervasive and annoying than they once were.
- Daz912 2 hours ago
  
  >that they were the future of sharing content online
  nobody was saying that
  
  sethops1 an hour ago
  
  People right here on HN were adamant my next house would be purchased using an NFT. And similar absurd claims about blockchain before that.
- fragmede 2 hours ago
  
  Maybe you don't love your mom enough to do this, but if ChatGPT disappeared tomorrow and it was something she really used and loved, I wouldn't think twice before buying her a rig powerful enough to run a quantized downlodable model on, though I'm not current on which model or software would be the best for her purposes. I get that your relationship with your mother, or your financial situation might be different though.
rockskon 4 hours ago

"You know what's almost worse than something bad? People complaining about something bad."
- k_roy 4 hours ago
  
  Shrug. Sure.
  Point still stands. It’s not going anywhere. And the literal hate and pure vitriol I’ve seen towards people on social media, even when they say “oh yeah; this is AI”, is unbelievable.
  So many online groups have just become toxic shitholes because someone once or twice a week posts something AI generated
  
  venturecruelty 3 hours ago
  
  The entire US GDP for the last few quarters is being propped up by GPU vendors and one singular chatbot company, all betting that they can make a trillion dollars on $20-per-month "it's not just X, it's Y" Markov chain generators. We have six to 12 more months of this before the first investor says "wait a minute, we're not making enough money", and the house of cards comes tumbling down.
  Also, maybe consider why people are upset about being consistently and sneakily lied to about whether or not an actual human wrote something. What's more likely: that everyone who's angry is wrong, or that you're misunderstanding why they're upset?
  
  permo-w 2 hours ago
  
  I feel like this is the kind of dodgy take that'll be dispelled by half an hour's concerted use of the thing you're talking about
  short of massive technological regression, there's literally never going to be a situation where the use of what amounts to a second brain with access to all the world's public information is not going to be incredibly marketable
  I dare you to try building a project with Cursor or a better cousin and then come back and repeat this comment
  >What's more likely: that everyone who's angry is wrong, or that you're misunderstanding why they're upset?
  your patronising tone aside, GP didn't say everyone was wrong, did he? if he didn't, which he didn't, then it's a completely useless and fallacious rhetorical. what he actually said was that it's very common. and, factually, it is. I can't count the number of these type of instagram comments I've seen on obviously real videos. most people have next to no understanding of AI and its limitations and typical features, and "surprising visual occurrence in video" or "article with correct grammar and punctuation" are enough for them to think they've figured something out
  
  kuschku 17 minutes ago
  
  > I dare you to try building a project with Cursor or a better cousin and then come back and repeat this comment
  I always try every new technology, to understand how it works, and expand my perspective. I've written a few simple websites with Cursor (one mistake and it wiped everything, and I could never get it to produce any acceptable result again), tried writing the script for a YouTube video with ChatGPT and Claude (full of hallucinations, which – after a few rewrites – led to us writing a video about hallucinations), generated subtitles with Whisper (with every single sentence having at least some mistake) and finally used Suno and ChatGPT to generate some songs and images (both of which were massively improved once I just made them myself).
  Whether Android apps or websites, scripts, songs, or memes, so far AI is significantly worse at internet research and creation than a human. And cleaning up the work AI did always ended up being taking longer just doing it myself from scratch. AI certainly makes you feel more productive, and it seems like you're getting things done faster, even though it's not.
  
  fragmede 2 hours ago
  
  Fascinatingly, as we found out from this HN post Markov chains don't work when scaled up, for technical reasons, so that whole transformers thing is actually necessary for this current generation of AI.
  https://news.ycombinator.com/item?id=45958004
  
  littlestymaar 3 hours ago
  
  This kind of pressure is good actually, because it helps fighting against “lazy AI use” while letting people use AI in addition to their own brain.
  And that's a hood thing because I much as I like LLMs as a technology, I really don't want people blindly copy-pasting stuff from it without thinking.
  
  rockskon 4 hours ago
  
  What isn't going anywhere? You're kidding yourself if you think every single place AI is used will withstand the test of time. You're also kidding yourself if you think consumer sentiment will play no part in determining which uses of AI will eventually die off.
  I don't think anyone seriously believes the technology will categorically stop being used anytime soon. But then again we still keep using tech thats 50+ years old as it is.