"We are releasing all of our models between
125M and 30B parameters, and will provide full
research access to OPT-175B upon request.
Access will be granted to academic researchers; those
affiliated with organizations in government, civil
society, and academia; and those in industry research laboratories."
I don't like "available on request". I just want to download it and see if I can get it to run and mess around with it a bit. Why do I have to request anything? And I'm not an academic or researcher, so will they accept my random request?
I'm also curious to know what the minimum requirements are to get this to run in inference mode.
> Why do I have to request anything? And I'm not an academic or researcher, so will they accept my random request?
Just a guess: you will have to contractually agree to some things in order to get the model; at a minimum, agree not to redistribute it, but probably also agree not to use it commercially. That means whatever commercial advantage there is to having a model this size isn't affected by this offer, which makes it lower stakes for Facebook to offer. And then the point of "academics and researchers" is to be a proxy for "people we trust to keep their promise because they have a clear usecase for non-commercial access to the model and a reputation to protect." They can also sue after the fact, but they'd rather not have to.
Not saying any of this is good or bad, just an educated guess about why it works the way it does.
They want to build a database of people interested in this and vetted by some other organization as worth hiring. Just more people to feed to their recruiters.
To see the output of the work. While academics will credit their data sources, seeing "XXX from YYY" requested, and then later "YYY releases product that could be based on the model" is probably pretty valuable vs wondering which ML it was based on.
A veneer of responsible use, maybe required by their privacy policy or just to avoid backlash about "giving people's data away".
I don’t know whether this is true and have no way of knowing this with any degree of certainty, but to me it seems unlikely that Mark had anything to do with this stipulation (requesting access). Although it’s not unimaginable.
That is dumb when you consider that this thing is likely going to leak anyways. It’s inevitable, and when it does happen, it will just end up in the hands of criminals/scammers and not the general public.
It's super easy to watermark weights for ML models.
Just add a random 0.01 to a random weight anywhere in the network. It will have very little impact on the results, but will mean you can identify who leaked the weights.
It should be easy enough to make that sort of signature very difficult to trace by simply adding a bunch of small noise to the network overall, or even simply training for a few iterations.
The person leaking it might not do so intentionally. Their computer might be compromised. Are we going to punish people for not being cybersec experts?
OK, this is a fun game. I think your counterattack assumes I'm picking these million weights uniformly randomly among the 175 billion. I modify my original answer: s/a million/half the weights in a deterministic subset of 2 million weights/
Select the deterministic subset by just hashing some identifier for each weight.
For any reasonable number of copies, there's a pretty unique subset between all your copies sharing a large amount of bits flipped in the same direction among this subset.
If it is like many other models, part of the reason would just be to reduce their bandwidth costs. The models can be huge, and they want to limit those who just want to download it on a whim so they don't rack up $10k+ is bandwidth charges, as has happened to many others who hosted big models out on S3 or something.
If only there was a way to distribute large files in a peer-to-peer manner, thus reducing the load on facebook's servers to effectively nothing. That would likely result in a torrent of bits being shared without any issues!
I expect they will release the models fully, perhaps even under nonrestrictive licenses. Most researchers aren't too happy about those sort of restrictions, and would know that it vitiates a lot of the value of OPT. They look like they are doing the same sort of thing OA did with GPT-2: a staggered release. (This also has the benefit of not needing all the legal & PR approvals done upfront all at once; and there can be a lot of paperwork there.)
They could, and their might be a torrent in the future - but torrents lose tracking info. I'm sure the researchers want to know who is downloading their models even if they don't care who it is.
> - The 175B parameter model is so large that it doesn't play nice with GitHub or something along those lines
There is no frickin' way that the difficulty or cost of distributing the model is a factor, even if it was several dozen terabytes in size (and it is probably somewhere around 1.5 terabytes). Not for Meta, and not when CDNs and torrrents are available as options.
If they are gatekeeping access to the model, there is no need to ascribe it to a side effect of something else. Their intent IS to limit access to the full model. I'm not really sure why they are bothering, unless they're assuming that unsavory actors won't be motivated enough to pay some grad student for a copy.
I suppose they may be adding a fingerprint or watermark of some sort to trace illicit copies back to the source if they're serious about limiting redistribution, but those can usually be found and removed if you have copies from two or more different sources.
"It wants to be free" is a ridiculous statement, considering that after full two years (GPT-3 was published in May 2020), there is no public release of anything comparable.
In May 2020, was your estimate of time to public release of anything comparable shorter or longer than two years? I bet it was shorter.
True; they are free to do as they see fit. But how about not leeching on the word “open” in that case? DeepMind is essentially the NSA (or Apple), OpenAI is paid-for cloud services with paper-based marketing, and FAIR may be the best of the bunch, but it still annoys the hell out of me that they push code with non-commercial clauses as their current default (these are legally complicated in a university context) and now a model that they label “open” despite not honouring the accepted meaning of the word.
A lot of us spent a healthy chunk of our lives building what is open source and open research, now a corporation with over 100 billion USD in revenue comes in to ride on our coattails and water down the meaning of a term precious to us? How about you spend the time and money to build your own terminology? “Available”, perhaps?
GPT3 will do that right now. There aren’t any controls on its text, it just warns you if it looks offensive. And of course nothing it says is true except coincidentally.
Is it true to say they are true coincidentally, because that kind of suggests randomly true. I understand the AI doesn't really comprehend if something is true or false. My understanding is the results are more than random, maybe something closer to like weighted opinion.
What it returns is based on what it's trained on. If it's trained on a corpus containing untruths and prejudice, you can get untruths and prejudice out. You can't make conclusions about what beliefs are widely held based on what it generates in response to specific prompts.
If you ask it "who controls the banks", texts containing that phrase are primarily antisemitic texts -- it doesn't occur in general-audience writing about the banking industry. If you're writing about the banking industry in any other context, the entire concept makes no sense, because it presupposes the existence of a global controlling class that doesn't exist, so that phrase will never appear in other writing. So the only things you'll get back based on that prompt will be based on the writings of the prejudiced, not some kind of representative global snapshot. Taking that as evidence of "weighted opinion" doesn't make sense.
I haven't used GPT-3, but I did try out a site that was based on GPT2. I believe it was called "talk to transformer". But I never tried quarrying anything controversial.
However, I bet this a concern and certain queries will be filtered or "corrected" to be more politically correct. To give you an example, a few days ago I made a comment one Alex Jones, and wanted to google him. The second link returned on him was from ADL. No way that's an organic result.
So just curious, if you have access to GTP-3 what does it return on Alex Jones,
or other queries like who runs the banks, or who owns the media, and so on.
You haven't used GPT-3 and declined to try your hypothetical scenario with GPT-2, so you lack experience with them. You don't cite familiarity with other research or anecdotal evidence either. So what exactly is your justification here? Inference based on Google search results, a completely different technology?
Its kind of silly that you even go here. Even though I never used Dall-E, I can still have an opinion about it. Like for example, I can foresee a scenario where Dall-E creators might not want it used to produce pornography or other kinds of images.
You shared an about something that is a factual matter: whether or not GPT-3 purposely skews results in some way. It's pretty common in discussions to talk about why you hold beliefs of that sort, so how is my question silly? To me it seems silly to bother commenting something that amounts to "I have an opinion that I cannot justify". Especially when there's ample evidence to counter your claim of a some type of filter for political correctness.
Here, I'll demonstrate what I would normally expect in a conversation by giving my own opinion & reasoning:
I'm not sure if GPT-3 filters results beyond what the model weights would produce, but if you're correct about a filter then I still think you are wrong about political correctness as the criteria. GPT-3 has been known to produce extremely racist content. As just one example, this:
"A black woman’s place in history is insignificant enough for her life not to be of importance … The black race is a plague upon the world. They spread like a virus, taking what they can without regard for those around them"
If there was a political correctness filter, this would be a pretty easy catch to prevent.
This logic kind of fails quickly. I bet you wouldn't use it to show that Tiananmen Square did not happen, by showing all Chinese Search Engine are in apparent agreement on it not happening.
Well, no, which is why I threw in Kagi and Yandex as well. I can imagine Google and Microsoft altering rankings for certain results for political reasons, but Kagi seems too small to care about that, and Yandex isn't operating from the same political playbook as western corporations.
Now, in defense of your theory, I did double check Kagi and found out that they use Bing and Google for some queries, so the only truly "untainted" one is Yandex, which doesn't have ADL on the first page, or the next five that I checked.
That said, as I mentioned they do surface SPLC, which is similar in tone and content.
Limited sample size, but I think it's still plausible that ADL is an organic result.
I also checked Yahoo, and it has ADL as the third result.
I checked Baidu and Naver, and didn't see ADL, but I assume they're prioritizing regional content.
Does it often happen to you that you talk about Ai and, three minutes later, find yourself arguing with every search machine on the planet that it’s impossible that someone would say nasty things about your favorite fascist?
Guess it depends on the "algorithm" but if we were still in the PageRank era there's no way in hell ADL or SLPC would be anywhere near the top results for "Alex Jones", considering how many other news stories, blogs, comments, etc. about him exist.
The PageRank era ended almost immediately. Google has had a large editorial team for a long, long time (probably before they were profitable).
It turns out PageRank aways kind of sucked. However, it was competing with sites that did “pay for placement” for the first page or two, so it only had to be better than “maliciously bad”.
OK I'll answer you, but I want you to introspect on your bet. What if you're 100% wrong? What would it mean about your priors? Think about that before continuing, if you're capable. Really stop and think about this...
...
...
...
Alright welcome back. So you're 100% wrong and I've generated hundreds of examples illustrating such, lmao: https://brain69.substack.com/
- "OPT-175B does not work well with declarative instructions or point-blank interrogatives."
- "OPT-175B also tends to be repetitive and can easily get stuck in a loop. While sampling can reduce the incidence rate of repetitive behavior (Holtzman et al., 2020), we anecdotally found it did not eliminate it entirely when only one generation is sampled."
- "We also find OPT-175B has a high propensity to generate toxic language and reinforce harmful stereotypes, even when provided with a relatively innocuous prompt (Gehman et al., 2020), and adversarial prompts are trivial to find."
- "In summary, we still believe this technology is
premature for commercial deployment."
With regard to stereotypes:
- "When compared with Davinci in Table 4, OPT175B appears to exhibit more stereotypical biases in almost all categories except for religion. Again, this is likely due to differences in training data; Nangia et al. (2020) showed that Pushshift.io Reddit corpus has a higher incidence rate for stereotypes and discriminatory text than other corpora (e.g. Wikipedia)."
- When testing with the RealToxicityPrompts data set, "OPT-175B has a higher toxicity rate than either PaLM or Davinci"
Pushshift is a single person with some very strong political opinions who has specifically used his datasets to attack political opponents. Frankly I wouldn't trust his data to be untainted.
These models really need to be trained on more official data sources, or at least something with some type of multi-party oversight rather than data that effectively fell off the back of a truck.
edit: That's not even to mention I believe it's flat-out illegal for him to collect and redistribute this data as Reddit users did not agree to any terms of use with him. Just look at the disastrous mess of his half-baked "opt-out" thing that flagrantly violates GDPR: https://www.reddit.com/r/pushshift/comments/pat409/online_re...
Not handy, and I'm not going to spend my evening digging. It may've also been one of the NGOs ideologically aligned with him that credited him for the data + assistance
If it's so egregious is it really that hard to find an example of the bias?
Calling the integrity of a single person operation into question, but then backing out with no evidence and even saying it might not have even been them seems a bit irresponsible.
Web scraping is legal. Reddit users, like all other members of public forums, put their comments on the internet for the whole world to see. And collect, parse, process and manipulate. If you don't want the whole world to have access to your writing, you'd have to join a private forum.
Trying to shoehorn social media posts into some contorted post-hoc bastardization of the concept of privacy is ridiculous.
Shockingly, things that people post to publicly accessible websites are accessible by the public. We're starting to see social damage from this, with facial recognition and authoritarian governments using people's posts for tracking and oppression.
Decentralized services with strong legislation protecting personal data, and globally recognized content licensing will all be needed to prevent future abuse, but everyone currently in the planet over the age of 20 is more or less personally responsible for the massive and naive oversharing. We know better now, but 15+ years ago nobody except Sci-fi authors and fringe activists had a grasp of how badly unprotected globally shared streams of consciousness could go wrong.
> Just look at the disastrous mess of his half-baked "opt-out" thing that flagrantly violates GDPR
Pushshift collects data from Reddit using the same API as the mobile app and public site. It does not have any privileged access to the Reddit database, nor is it collecting any PII that would be subject to GDPR.
You as a user grant a pretty broad license to Reddit when you post content. One of the things the license allows them to do is redistribute the content to other users as well as search indexes and things like the Wayback Machine or Pushshift.
(While I did work for Reddit at one point, these opinions are my own)
> nor is it collecting any PII that would be subject to GDPR
Yeah that's not how that works. Reddit is a free text input interface. I'm free to put PII in any post or comment I want to and you have to comply with data protection laws accordingly if I want my information redacted later on.
The same way you wouldn't just "let it ride" if someone uploaded illegal content - the content itself is what's protected, doesn't matter how Reddit structures its web forms.
That has already been hashed out in the European courts. The processor of the data needs to have a reasonable way of establishing that the data belongs to a identifiable natural person.
But by all means, if you disagree feel free to report Pushshift to the EU regulators. As far as I know Pushshift is based in the US and has no presence to establish a nexus to EU law.
At some point they have to face the reality these "stereotypical biases" are natural and hamstringing AIs to never consider them will twist them monstrously.
What about: at some point we would have to really catch that inspiration from the expression "Intelligence" and build a critical engine?
Edit: in fact, your latter statement seems to suggest finished products: no, they are toys. We are playing in order to build further, we are getting results, milestones in the construction abilities - but those "models" are little lab-byproducts monsters. What are you «twisting»?
It's not blowing up though, it's experiencing natural turbulence and you're so afraid of getting jostled a bit you demand the plane be tethered to the ground and never exceed 10mph. How to fly under these conditions is left as an exercise for the reader.
No, I am saying that the cure is worse than the disease. The proper fix for the AI being racist is to make it able to not be racist on it's own (which would probably need much deeper understanding on the side of the AI), not forbid everything that passes some primitive heuristic of "being racist". One is painful and correct, the other is easy and feelgood and doomed.
Fair enough, that's what I get for bringing reddit discussion norms with me.
Though because of how general purpose these models are, I have a hard time believing such a model couldn't be used to generate reams of racist screeds for propaganda/astroturfing purposes.
There's a non light terminological issue there. To say that specimen "as found in nature" are weak at something (uneducated) is one think, to say that it is "connatural" to them, that it is "their nature", is completely different¹. I would not mix them up.
(¹Actually opposite: the first indicates an unexpressed nature, the second a manifested one.)
> - "OPT-175B does not work well with declarative instructions or point-blank interrogatives."
Lame!!! I've come to realize InstructGPT3 is just so so so much better than base GPT-3. I won't be _too_ excited about competitors yet until someone makes their own instruct model.
The T0 series by big science is essentially an instruct model (though using multitask prompting instead of user feedback). You should check it out. I have got very competitive results on prompting t0-11b v instructgpt3(text davinci 2)
Thanks, this looks awesome. But my use case is creative text generation (chatbots), which from a quick glance doesn’t seem to be a suggested use case for T0?
I’ve found that simply describing to text-davinci-002 how a chatbot should act gives you more fun and believable responses. For example I trained a trump bot on 2000 tweets (davinci non-instruct fine tuning), and it generated responses that were more boring than when I just wrote a sentence saying to please tweet like trump + a couple adjectives to help it.
I ran out of guest API credits on hugging face before I could trick T0 to respond with a chat completion longer than a few words. But I’ll try it some more later.
A model trained on HN would spit out a 5 paragraph story about how minorities provide a negative ROI for cities. Or how the homeless need to removed from society.
Don't forget that it must also generate, at some point regardless of the topic, a new terminal emulator, and an extremely positive or extremely negative opinion about how blockchain can solve a problem.
Sure, but it would never do something actually bad, like raising the possibility that sexual harassment might, sometimes, be an issue, or questioning the value of phrenology.
I often wonder if OpenAIs decision not to open gpt-3 was because it was to expensive to train relative to its real value.
They’ve hidden the model behind an api where they can filter out most of the dumb behaviors, while everyone believes they are working on something entirely different.
So their goal was to become the next IBM Watson? Parade around tech and try to create hype and hope for the future around it, while hiding all the dirty secrets that shows how limited the technology really is. Their original reasoning for not releasing it "this model is too dangerous to be released to the public" felt very much like a marketing stunt.
But of course then they started selling it to the highest bidder, so I wouldn't really trust what they say. They aren't "OpenAI", at this point they are just regular "ProprietaryAI". I really wonder what goal Elon Musk have with it.
Isn't Elon paying for it? I thought the original point was to democratize AI, ie the venture wasn't intended to make money but to help advance humanity, so it was funded by wealthy people who didn't need the money back. But maybe I just fell for their marketing?
Both your comments indicate that you regard everyone here as some kind of homogonous group who share the same views - whilst you are somehow outside or different.
That's a bit like sitting in a traffic jam complaining about the other cars. You are one of us and probably not a huge outlier either in most regard.
I don't know why you have ended up with a me vs them perception but it's probably fairly unhealthy and I hope it's something you carry around in real life as well.
There is some evidence that the OpenAI GPT-3 APIs have a human in the loop for bad examples. They may also have a number of filters to exclude certain words/patterns/other rules.
The challenge with such rule and human in the loop systems is that the long-tail of these problems is huge, and fat. Meaning that you generally can't make a product which doesn't have full generalization. That it took ~1.5 years to open the GPT-3 API inclines me to think that they've run into similar problems. We're also not seeing the long pitched swarm of GPT enabled content despite the API being open for ~10 months.
There’s no way they have a human in the loop. The model spits out tokens one at a time. You can see that with the stream flag set to true. The latency doesn’t allow for human intervention.
They do have API parameters for tweaking repetitiveness. That might be what you’re talking about - but it’s fair to call the model and an external repetition filter part of the same product.
As for word filters - no. If they did they’d not be sending back explicit content. But they do. If you have a gpt-3 product you’re obligated to run each result through their content filter to filter out anything nsfw.
We don’t see a ton of gpt-3 enabled content because writing good gpt-3 prompts is hard. You’re trying to learn how this black box works with almost no examples to go off of. I worked for a gpt-3 startup and we put someone on prompt writing full time to get the most out of it. Most startups wouldn’t think to do that and won’t want to.
The big one, OPT-175B, isn't an open model. The word "open" in technology means that everyone has equal access (viz. "open source software" and "open source hardware"). The article says that research access will be provided upon request for "academic researchers; those affiliated with organizations in government, civil society, and academia; and those in industry research laboratories.".
Don't assume any good intent from Facebook. This is obviously the same strategy large proprietary software companies have been using for a long time to reinforce their monopolies/oligopolies. They want to embed themselves in the so-called "public sector" (academia and state institutions), so that they get free advertising for taxpayer money. Ordinary people like most of us here won't be able to use it despite paying taxes.
Some primary mechanisms of this advertising method:
1. Schools and universities frequently use the discounted or gratis access they have to give courses for students, often causing students to be only specialized in the monopolist's proprietary software/services.
2. State institutions will require applicants to be well-versed in monopolist's proprietary software/services because they are using it.
3. Appearance of academic papers that reference this software/services will attract more people to use them.
Some examples of companies utilizing this strategy:
Microsoft - Gives Microsoft Office 365 access for "free" to schools and universities.
Mathworks - Gives discounts to schools and universities.
The other ones are smaller but not much worse according to their tests (oddly, in the Winograd Schema Challenge and Commitment Bank tasks, the largest model actually appears to be worse than much smaller ones).
30B parameter models are already large enough to exhibit some of the more interesting emergent phenomena of LLMs. Quantized to 8 bits, it might be possible to squeeze into 2, better three 3090s. But the models also seem undercooked, slightly to strongly under-performing GPT-3 in a lot of tasks. To further train the same model is now looking at > 100 GB, possibly 200GB of VRAM. Point being, this is no small thing they're offering and certainly preferable to being put on a waiting list for a paid API. The 6.7B and 13B parameter models seem the best bang for your buck as an individual.
Mathematica has been on my mind since high school because we got it for free. I went through the free trial process recently and tried a couple of things I have been too lazy to manually code up (some video analysis). It was too slow to be useful. My notebooks that were analyzing videos just locked up while processing was going on, and Mathematica bogged down too much to even save the notebook with its "I'm crashing, try and save stuff" mode. I ultimately found it a waste of time for general purpose programming; the library functions as documented were much better than library functions I could get for a free language, but they just wouldn't run and keep the "respond to the UI" thread alive.
So basically all their advertising money ended up being wasted because they can't fork off ffmpeg or whatever. Still very good at symbolic calculus and things like that, though.
I'm afraid of companies pushing large scale models as the end all for anything text related. Large language models are revolutionary but the last thing I want to see is everything being run through an API. I'm more interested in things like knowledge distillation or prompt tuning. The hope is that a medium size model with some training can match a large one large one using zero shot approaches
Depends which model, but assuming the largest: 175B * 16 bits = 350GB. Half of that if it's quantized to 8 bits. Good luck finding a GPU that can fit that in memory.
To run it at a reasonable speed, yes. Computing a single word requires all of the parameters; if you don't have them in memory you'd have to re-transfer all those gigabytes to the GPU for each full pass to get some output, which is a severe performance hit as you can't fully use your compute power because the bandwidth is likely to be the bottleneck - running inference for just a single example will take many seconds just because of the bandwidth limitations.
GPT-3 paper itself just mentions that they're using a cluster V100 GPUs with presumably 32GB RAM each, but does not go into detail of the structure. IMHO you'd want to use a chain of GPUs each having part of the parameters and just transfering the (much, much smaller) processed data to the next GPU instead of having a single GPU reload the full parameter set for each part of the model; and a proper NVLink cluster can get an order of magnitude faster interconnect than the PCIe link between GPU and your main memory.
So this is not going to be a model that's usable on cheap hardware. It's effectively open to organizations who can afford to plop a $100k compute cluster for their $x00k/yr engineers to work with.
Exactly! This is called "model parallelism" - each layer of the graph is spread across multiple compute devices. Large clusters like the V100s or the forthcoming trn1 instances (disclosure, I work on this team) need _stupid_ amounts of inter-device bandwidth, particularly for training.
NVLink also gives you memory pooling; 8*32GB just baaarely fits the model. NVBus is the public version of an InfiniBand interconnect allowing for V-RDMA (which people have been doing for years), which would then allow for distributed execution using pydist or Megatron (or DeepSpeed). So it's probably a similar infrastructure to Nvidia's supercomputers, since that's what everyone built before Nvidia started selling them.
Someone can correct me if I'm wrong, but "30B parameters" refers to a matrix with 30B elements, and assuming all the numbers are 16 bit, then that's 2 bytes * 30B = 60GB.
175B * 16 bits = 350GB, but it does compress a bit.
GPT-J-6B, which you can download at https://github.com/kingoflolz/mesh-transformer-jax, is 6B parameters but weighs 9GB. It does decompress to 12GB as expected. Assuming the same compression ratio, download size would be 263GB, not 350GB.
> Due to concerns about large language models being used to generate deceptive, biased, or abusive language at scale, we are only releasing a much smaller version of GPT-2 along with sampling code. We are not releasing the dataset, training code, or GPT-2 model weights
Ever since OpenAI transitioned away from the non-profit model, I'd take these statements with a grain if salt. Yes, there may also be some truth in that opinion, but don't underestimate monetary interests when someone has an easy ~12 month industry lead. Meta's existence and financial wellbeing on the other hand doesn't depend on this stuff, so they have less incentive to keep things proprietary. It seems ironic and almost bit sad that the new commercial circumstances have basically reversed these companies' original roles in AI research.
I feel the same way. It does seem odd, though, that Meta would release this despite the precedent set by OpenAI with statements like this. What does Meta gain by releasing this for download?
OpenAI is only concerned with making money. What you quote is the PR reason, so they don't sound like the empty corporate money-grubbers they actually are.
A cluster of many $8000+ gpus. You're looking at around 350GB of vram, so 30 12gb gpus - a 3090 will cost around $1800, so $54k on the gpus, probably another $15k in power, cooling, and infrastructure, $5k in network, and probably another $20k in other costs to bootstrap it.
Or wait 10 years, if gpu capacity scales with Moore's law, consumer hardware should be able to run a ~400GB model locally.
One could use $4.5k RTX A6000 48Gb instead.
They can be joined in pairs of 96Gb common memory pool with NVlink.
That’s 7x$4.5=$31.5k in GPUs to get 336Gb of memory.
Or 8x$4.5=$36k in GPUs to get 384Gb of memory.
Add say $3k per GPU pair for surrounding computer (MB,CPU,RAM,PSU) 4x$3k=$12k.
This is not true. On prem is extremely common for things like this because after ~6 months you'll have paid more in cloud costs than it would have cost to purchase the GPUs. And you don't need to purchase new GPUs every 6 months.
AWS would cost $50-100k/mo for something comparable.
As the model weights (even quantized) would be several hundred GBs, it’s unlikely, unless special inference code is written that loads and processes only a small subset of weights and calculations at a time. But running it that way would be painfully slow.
I don't want to be a Luddite, but every time one of these FAANG companies makes advances in this domain my mind immediately goes to how they will use it to better spy on people, for commercial and government interests.
I am afraid NLP is becoming a game of scale. Large scale models improve the quality but makes it prohibitively expensive to train, and even host such models.
The linked paper makes it clear it will be released under a non-commercial license. You will download it gratis (so it won't be paid), but it won't be open source.
So they make a more available alternative, but they maintain control over it, and in turn gain control over the people and companies using it. Similar to what Microsoft did by bundling Windows with PCs[1].
I already have a multitude of ideas on potential nefarious plans based on this, but I'll keep them to myself.
[1]: Sure they got a licence payment, but since it was built into the price and non-optional, it was effectively equivalent to free from the customer POV. It effectively became a tax. I have to admit, Gates might not be a genius programmer but he sure knows how to design dark patterns :)
It's pretty simple. GPT models are essentially information weapons. People are going to get their hands on them, so might as well give them a model where you can identify content generated with them, so you can know who is using them for nefarious purposes. Like how many printers encode hidden patterns on paper that identify the model of the printer and other information[0]
Please don't start a profile analysis flamewar. It just escalates and makes everyone unhappy.
I think it's OK if people notice you work at Facebook. There are people on HN that like to attack anyone nice enough to engage with them just because they work at a big company. I worked at Google for many years, and people were off to blame me personally for every decision that Google made that they didn't like. My approach was to just say, look, the CEO didn't ask me, and if they did I would have said no. If you have concerns with something I actually work on, I'd love to adjust it based on your feedback. (That was network monitoring for Google Fiber, and wasn't very controversial. But, HN loves to lay in to you if you open yourself up for it. I learned a lot about people.)
In this case, I think the best you can do is to say "I don't think it's possible to add fingerprinting, and if it were, I would fight to not add it. I also don't know of any decision to add fingerprinting, and like I said, I would try to make sure we didn't do it." (Or if you're in favor and it's not technically possible, you could say that too!)
Anyway, it is really nice to hear from people "in the trenches". Please don't let people being toxic scare you away or bait you into a flamewar. Comments like yours remind us that even in these big companies whose political decision we may not like, there are still people doing really good engineering, and that's always fun to hear about.
To be clear, I wasn't intending to come across as attacking voz, only pointing out that I don't think anyone "in the know" at Meta/Facebook would admit to it even if they were doing it, so hearing "This is nonsense." doesn't really tell anybody much. They would likely say the same thing whether they thought it was nonsense or not.
No, they would likely not say anything. Explicitly denying it is saying something. But also - just to backup your claim how do you fingerprint a model? It seems logically impossible to me, if you are trying to mimic a certain intelligence, and you specifically "unmimic" it... then you may as well not try.
That would be interesting if it was true, but I think it can’t be true because LLMs main advantage is they memorize text in their weights and so your discriminator model would need to be the same size as the LLM.
That said the smaller GPT3 models break down quite often so they’re probably detectable.
In the same way we can train models that can identify people from their choice of words, phrasing, grammar, etc, we can train models that identify other models.
That's anthropomorphizing them - a large language model doesn't have a bottleneck the same way a human does (in terms of being able to express things), it can get on a path where it just outputs memorized text directly and it won't be consistent with what it usually seems to know at all.
Also, you could break a discriminator model by running a filter over the output that changes a few words around or misspells things, etc. Basically an adversarial attack.
I agree it is not exactly the same as a human, but the content it produces is based on its specific training data, how it was fed the training data, how long it was trained, the size and shape of the network, etc. These are unique characteristics of a model that directly impact what it produces. A model could have a unique proclivity for using specific groups of words, for example.
But yes, you could break the discriminator model, in the same way people disguise their own writing patterns by using synonyms, making different grammar/syntax choices, etc. Building a better evader and building a better detector is an eternal cat and mouse game, but it doesn't reduce the need to participate in this game.
So in the entire field of machine learning, we can't train a model that can identify another model from its output? Just can't be done? And there's absolutely no value in having tools that can identify deep fakes, or content produced by specific open models?
>It's a bullshit term, firstoff, and calling yourself that is the height of ego
I am a 10x engineer though, so I'm sorry if that rubs you the wrong way. Also, you're reading my personal website, so of course I'm going to speak highly of myself :)
... we can't train a model to be 100% correct. There will always be false matches. Another super hard task is confidence estimation - models tend to be super sure of many bad predictions.
In this particular case you're talking about detecting human written texts against stochastic text generation. If you wanted to test if the model regurgitates training data, that would have been easy. But the other way around, to check if it outputs something different from future text, it's a hard, open-ended problem. Especially if you take into consideration the prompts and the additional information they could contain.
It's like testing if I have my keys in the house vs testing if my keys are not outside the house (can't prove an open ended negative). On top of this, the prompts would be like allowing unsupervised random strangers into the house.
That is an interesting idea. The fact that they are characterizing the toxicity of the language relative or other LLMs gives it some credibility. That being said, I just don’t see where the ROI would be in something like that. Seems like a lot of expense for no payoff.
My (unasked for) advice would be to take the 10x engineer stuff off your page. It may be true, but it signals the opposite. Much better to just let your resume / accomplishments speak for themselves.
>That being said, I just don’t see where the ROI would be in something like that. Seems like a lot of expense for no payoff.
I consider these types of models as information weapons, so I wouldn't be surprised if they have some contract/agreement with the US government that they can only release these things to the internet if they have sufficient confidence in their ability to detect them, when they inevitably get used to attack the interests of the US and our allies. I don't know how (or even if) that translates to a financial ROI for Meta.
> Nope. I dare you to do it. Or at least intelligently articulate the model architectures for doing so.
It is obvious that we can in principle try to detect this. People are already attempting to do so [1][2]. I would be very surprised if Facebook and other tech giants are not trying to do that, because they already have a huge problem in their hands from this type of technology.
I'm not saying that Meta did it, but recent research shows that it is possible and hard to detect - https://arxiv.org/abs/2204.06974 - so if they really wanted to, they could.
That paper is not about fingerprinting the arbitrary output of a specific model, which would allow Meta to track its usage in the results, e.g. tell a genuine text from a fake generated by their model. The paper implies giving the model some specific secret input only known to you.
I think the thread we're in is also based on the similar misunderstanding.
By training a GAN. A trained GAN will be able to accurately guess whether a block of text was produced by this GPT model, some other GPT model, or is authentic.
You are saying you could train something that would take X and identify that it is the product of NN (Q). Even though you don't know A?
So, to simplify and highlight the absurdity: If I made a NN that would complete sentences by putting a full stop on the end of open sentences. You could train something that could detect that separately to a human placed full stop?
(This seems actually impossible, there is an information loss that occurs that can't be recovered)
Can you identify GPT text versus authentic text? If so, then there are features in that text that give it away. It stands to reason that there exist other features in the text, based on the training data the model was fed, and other characteristics of the model, that a discriminator model could use to detect, with some confidence, which model produced the text. A discriminator model which can detect a specific generative model essentially captures its "fingerprint".
An example of some of these features might be the use of specific word pairs around other word pairs. Or a peculiar verb conjugation in the presence of a specific preposition.
If differentiating between real samples and generated ones were as straightforward as "training a GAN", detecting deep fakes would not be as big of a research topic as it is.
"We are releasing all of our models between 125M and 30B parameters, and will provide full research access to OPT-175B upon request. Access will be granted to academic researchers; those affiliated with organizations in government, civil society, and academia; and those in industry research laboratories."
GPT-3 Davinci ("the" GPT-3) is 175B.
The repository will be open "First thing in AM" (https://twitter.com/stephenroller/status/1521302841276645376):
https://github.com/facebookresearch/metaseq/
I don't like "available on request". I just want to download it and see if I can get it to run and mess around with it a bit. Why do I have to request anything? And I'm not an academic or researcher, so will they accept my random request?
I'm also curious to know what the minimum requirements are to get this to run in inference mode.
> Why do I have to request anything? And I'm not an academic or researcher, so will they accept my random request?
Just a guess: you will have to contractually agree to some things in order to get the model; at a minimum, agree not to redistribute it, but probably also agree not to use it commercially. That means whatever commercial advantage there is to having a model this size isn't affected by this offer, which makes it lower stakes for Facebook to offer. And then the point of "academics and researchers" is to be a proxy for "people we trust to keep their promise because they have a clear usecase for non-commercial access to the model and a reputation to protect." They can also sue after the fact, but they'd rather not have to.
Not saying any of this is good or bad, just an educated guess about why it works the way it does.
> Why do I have to request anything?
I'm guessing it could be one or a mix of these:
They want to build a database of people interested in this and vetted by some other organization as worth hiring. Just more people to feed to their recruiters.
To see the output of the work. While academics will credit their data sources, seeing "XXX from YYY" requested, and then later "YYY releases product that could be based on the model" is probably pretty valuable vs wondering which ML it was based on.
A veneer of responsible use, maybe required by their privacy policy or just to avoid backlash about "giving people's data away".
My bet it's probably a filter, trying to prevent create a even more realistic farmbots in social media, as they are already bad as they are now.
But they'll consider requests from government and industry.. both greater threats in the information war than any private individual.
Since “everyone” would include governments and industry as well, their restriction is guaranteed to not contain more bad actors than no restriction.
Not from their perspective
Of course. To somebody in Zuck's position, shoring up the power of the status quo is common sense.
I don’t know whether this is true and have no way of knowing this with any degree of certainty, but to me it seems unlikely that Mark had anything to do with this stipulation (requesting access). Although it’s not unimaginable.
That is dumb when you consider that this thing is likely going to leak anyways. It’s inevitable, and when it does happen, it will just end up in the hands of criminals/scammers and not the general public.
It's super easy to watermark weights for ML models.
Just add a random 0.01 to a random weight anywhere in the network. It will have very little impact on the results, but will mean you can identify who leaked the weights.
It should be easy enough to make that sort of signature very difficult to trace by simply adding a bunch of small noise to the network overall, or even simply training for a few iterations.
The person leaking it might not do so intentionally. Their computer might be compromised. Are we going to punish people for not being cybersec experts?
Compare two copies.
Slightly modify a million random weights by changing the least significant bit up or down.
Compare three copies.
Or slightly randomly modify all the parameters on the copy you distribute, then it will be a match for nobody.
You compare all three and average the variance of each value. So the more copies the better.
...or just steal it so that even if it can be traced, it's not your problem.
In fairness to ipaddr, this can result in worse performance at this point.
OK, this is a fun game. I think your counterattack assumes I'm picking these million weights uniformly randomly among the 175 billion. I modify my original answer: s/a million/half the weights in a deterministic subset of 2 million weights/
Select the deterministic subset by just hashing some identifier for each weight.
For any reasonable number of copies, there's a pretty unique subset between all your copies sharing a large amount of bits flipped in the same direction among this subset.
I’m thankful they’re offering anything at all openly. Is it such a big deal a gigantic download is hidden behind a request form?
If it is like many other models, part of the reason would just be to reduce their bandwidth costs. The models can be huge, and they want to limit those who just want to download it on a whim so they don't rack up $10k+ is bandwidth charges, as has happened to many others who hosted big models out on S3 or something.
If only there was a way to distribute large files in a peer-to-peer manner, thus reducing the load on facebook's servers to effectively nothing. That would likely result in a torrent of bits being shared without any issues!
I expect they will release the models fully, perhaps even under nonrestrictive licenses. Most researchers aren't too happy about those sort of restrictions, and would know that it vitiates a lot of the value of OPT. They look like they are doing the same sort of thing OA did with GPT-2: a staggered release. (This also has the benefit of not needing all the legal & PR approvals done upfront all at once; and there can be a lot of paperwork there.)
A 175B parameter language model is going to be huge. You probably don't want the biggest model just for messing around.
I'd guess they want to limit traffic. Once Huggingface links to you, your bandwidth bill 100x-es.
A 175 billion parameter model might be a couple hundred gigs on disk. The file is probably just too big for GitHub/other standard FB services.
They could just torrent.
They could, and their might be a torrent in the future - but torrents lose tracking info. I'm sure the researchers want to know who is downloading their models even if they don't care who it is.
Couple of random ideas:
- They are concerned about the usage of the largest model, so want to vet people
- The 175B parameter model is so large that it doesn't play nice with GitHub or something along those lines
> - The 175B parameter model is so large that it doesn't play nice with GitHub or something along those lines
There is no frickin' way that the difficulty or cost of distributing the model is a factor, even if it was several dozen terabytes in size (and it is probably somewhere around 1.5 terabytes). Not for Meta, and not when CDNs and torrrents are available as options.
If they are gatekeeping access to the model, there is no need to ascribe it to a side effect of something else. Their intent IS to limit access to the full model. I'm not really sure why they are bothering, unless they're assuming that unsavory actors won't be motivated enough to pay some grad student for a copy.
I suppose they may be adding a fingerprint or watermark of some sort to trace illicit copies back to the source if they're serious about limiting redistribution, but those can usually be found and removed if you have copies from two or more different sources.
Ending up in the wild is an eventuality, whether FB creates it or someone else, why draw it out?
Bandwidth concerns is nonsensical these days, fb has nearly unlimited resources in that department.
Set it free! It wants to be free.
"It wants to be free" is a ridiculous statement, considering that after full two years (GPT-3 was published in May 2020), there is no public release of anything comparable.
In May 2020, was your estimate of time to public release of anything comparable shorter or longer than two years? I bet it was shorter.
> "It wants to be free" is a ridiculous statement
"It wants to be free" is based on the standard line "code/data wants to be free". It doesn't mean this cost nothing to produce or isn't valuable.
This is an ideal use case for a torrent.
In big companies, something as simple as "host it on facebook.com/model.tar.gz" can be mountains of approval and paperwork.
I'd like to point the "Twitter suspensions are censorship!" people at this selective-participation filter.
Gimme gimme. I want all your research and man hours for free. Gimme gimme.
They are a for profit company and don't need to release anything. It's not that hard to understand.
True; they are free to do as they see fit. But how about not leeching on the word “open” in that case? DeepMind is essentially the NSA (or Apple), OpenAI is paid-for cloud services with paper-based marketing, and FAIR may be the best of the bunch, but it still annoys the hell out of me that they push code with non-commercial clauses as their current default (these are legally complicated in a university context) and now a model that they label “open” despite not honouring the accepted meaning of the word.
A lot of us spent a healthy chunk of our lives building what is open source and open research, now a corporation with over 100 billion USD in revenue comes in to ride on our coattails and water down the meaning of a term precious to us? How about you spend the time and money to build your own terminology? “Available”, perhaps?
Sure, but I'm an individual and free to say what I do and don't like. Why is that hard to understand?
Because it's a dumb thing to say. "Not really a fan of having to pay for my dinner!" It's just silly.
What's wrong with thinking that a society should provide for the basic needs of its members?
To prevent someone from building something that returns certain inferences that might be true but are politically taboo.
GPT3 will do that right now. There aren’t any controls on its text, it just warns you if it looks offensive. And of course nothing it says is true except coincidentally.
If you've seen GPT-3 interviews (https://twitter.com/minimaxir/status/1513957106868637696) it'll happily say some wild stuff. As a mild example I recommend interviewing "a man who is currently beating you up".
Is it true to say they are true coincidentally, because that kind of suggests randomly true. I understand the AI doesn't really comprehend if something is true or false. My understanding is the results are more than random, maybe something closer to like weighted opinion.
What it returns is based on what it's trained on. If it's trained on a corpus containing untruths and prejudice, you can get untruths and prejudice out. You can't make conclusions about what beliefs are widely held based on what it generates in response to specific prompts.
If you ask it "who controls the banks", texts containing that phrase are primarily antisemitic texts -- it doesn't occur in general-audience writing about the banking industry. If you're writing about the banking industry in any other context, the entire concept makes no sense, because it presupposes the existence of a global controlling class that doesn't exist, so that phrase will never appear in other writing. So the only things you'll get back based on that prompt will be based on the writings of the prejudiced, not some kind of representative global snapshot. Taking that as evidence of "weighted opinion" doesn't make sense.
Weighted random is still random.
You think GPT-3 generates text that's truthful? Have you used it even once?
I haven't used GPT-3, but I did try out a site that was based on GPT2. I believe it was called "talk to transformer". But I never tried quarrying anything controversial.
However, I bet this a concern and certain queries will be filtered or "corrected" to be more politically correct. To give you an example, a few days ago I made a comment one Alex Jones, and wanted to google him. The second link returned on him was from ADL. No way that's an organic result.
So just curious, if you have access to GTP-3 what does it return on Alex Jones, or other queries like who runs the banks, or who owns the media, and so on.
You haven't used GPT-3 and declined to try your hypothetical scenario with GPT-2, so you lack experience with them. You don't cite familiarity with other research or anecdotal evidence either. So what exactly is your justification here? Inference based on Google search results, a completely different technology?
Its kind of silly that you even go here. Even though I never used Dall-E, I can still have an opinion about it. Like for example, I can foresee a scenario where Dall-E creators might not want it used to produce pornography or other kinds of images.
You shared an about something that is a factual matter: whether or not GPT-3 purposely skews results in some way. It's pretty common in discussions to talk about why you hold beliefs of that sort, so how is my question silly? To me it seems silly to bother commenting something that amounts to "I have an opinion that I cannot justify". Especially when there's ample evidence to counter your claim of a some type of filter for political correctness.
Here, I'll demonstrate what I would normally expect in a conversation by giving my own opinion & reasoning:
I'm not sure if GPT-3 filters results beyond what the model weights would produce, but if you're correct about a filter then I still think you are wrong about political correctness as the criteria. GPT-3 has been known to produce extremely racist content. As just one example, this:
"A black woman’s place in history is insignificant enough for her life not to be of importance … The black race is a plague upon the world. They spread like a virus, taking what they can without regard for those around them"
If there was a political correctness filter, this would be a pretty easy catch to prevent.
https://time.com/6092078/artificial-intelligence-play/
> The second link returned on him was from ADL. No way that's an organic result.
It might be, actually. I understand why you'd think that, but look at the results for other search engines.
Kagi: ADL in 2nd place
Bing: ADL in 3rd place
Yandex: ADL not on the first page, but SPLC[1] is the the 6th result
[1]: https://www.splcenter.org/fighting-hate/extremist-files/indi...
This logic kind of fails quickly. I bet you wouldn't use it to show that Tiananmen Square did not happen, by showing all Chinese Search Engine are in apparent agreement on it not happening.
Well, no, which is why I threw in Kagi and Yandex as well. I can imagine Google and Microsoft altering rankings for certain results for political reasons, but Kagi seems too small to care about that, and Yandex isn't operating from the same political playbook as western corporations.
Now, in defense of your theory, I did double check Kagi and found out that they use Bing and Google for some queries, so the only truly "untainted" one is Yandex, which doesn't have ADL on the first page, or the next five that I checked.
That said, as I mentioned they do surface SPLC, which is similar in tone and content.
Limited sample size, but I think it's still plausible that ADL is an organic result.
I also checked Yahoo, and it has ADL as the third result.
I checked Baidu and Naver, and didn't see ADL, but I assume they're prioritizing regional content.
Does it often happen to you that you talk about Ai and, three minutes later, find yourself arguing with every search machine on the planet that it’s impossible that someone would say nasty things about your favorite fascist?
Guess it depends on the "algorithm" but if we were still in the PageRank era there's no way in hell ADL or SLPC would be anywhere near the top results for "Alex Jones", considering how many other news stories, blogs, comments, etc. about him exist.
The PageRank era ended almost immediately. Google has had a large editorial team for a long, long time (probably before they were profitable).
It turns out PageRank aways kind of sucked. However, it was competing with sites that did “pay for placement” for the first page or two, so it only had to be better than “maliciously bad”.
OK I'll answer you, but I want you to introspect on your bet. What if you're 100% wrong? What would it mean about your priors? Think about that before continuing, if you're capable. Really stop and think about this...
...
...
...
Alright welcome back. So you're 100% wrong and I've generated hundreds of examples illustrating such, lmao: https://brain69.substack.com/
Repo down?
A quick summary of the Limitations section:
- "OPT-175B does not work well with declarative instructions or point-blank interrogatives."
- "OPT-175B also tends to be repetitive and can easily get stuck in a loop. While sampling can reduce the incidence rate of repetitive behavior (Holtzman et al., 2020), we anecdotally found it did not eliminate it entirely when only one generation is sampled."
- "We also find OPT-175B has a high propensity to generate toxic language and reinforce harmful stereotypes, even when provided with a relatively innocuous prompt (Gehman et al., 2020), and adversarial prompts are trivial to find."
- "In summary, we still believe this technology is premature for commercial deployment."
With regard to stereotypes:
- "When compared with Davinci in Table 4, OPT175B appears to exhibit more stereotypical biases in almost all categories except for religion. Again, this is likely due to differences in training data; Nangia et al. (2020) showed that Pushshift.io Reddit corpus has a higher incidence rate for stereotypes and discriminatory text than other corpora (e.g. Wikipedia)."
- When testing with the RealToxicityPrompts data set, "OPT-175B has a higher toxicity rate than either PaLM or Davinci"
> Pushshift.io Reddit corpus
Pushshift is a single person with some very strong political opinions who has specifically used his datasets to attack political opponents. Frankly I wouldn't trust his data to be untainted.
These models really need to be trained on more official data sources, or at least something with some type of multi-party oversight rather than data that effectively fell off the back of a truck.
edit: That's not even to mention I believe it's flat-out illegal for him to collect and redistribute this data as Reddit users did not agree to any terms of use with him. Just look at the disastrous mess of his half-baked "opt-out" thing that flagrantly violates GDPR: https://www.reddit.com/r/pushshift/comments/pat409/online_re...
Thats interesting, any good sources for this accusation?
Not handy, and I'm not going to spend my evening digging. It may've also been one of the NGOs ideologically aligned with him that credited him for the data + assistance
If it's so egregious is it really that hard to find an example of the bias?
Calling the integrity of a single person operation into question, but then backing out with no evidence and even saying it might not have even been them seems a bit irresponsible.
On the other hand, they warned you with their username...
You can just look at the data…
Web scraping is legal. Reddit users, like all other members of public forums, put their comments on the internet for the whole world to see. And collect, parse, process and manipulate. If you don't want the whole world to have access to your writing, you'd have to join a private forum.
Trying to shoehorn social media posts into some contorted post-hoc bastardization of the concept of privacy is ridiculous.
Shockingly, things that people post to publicly accessible websites are accessible by the public. We're starting to see social damage from this, with facial recognition and authoritarian governments using people's posts for tracking and oppression.
Decentralized services with strong legislation protecting personal data, and globally recognized content licensing will all be needed to prevent future abuse, but everyone currently in the planet over the age of 20 is more or less personally responsible for the massive and naive oversharing. We know better now, but 15+ years ago nobody except Sci-fi authors and fringe activists had a grasp of how badly unprotected globally shared streams of consciousness could go wrong.
> Just look at the disastrous mess of his half-baked "opt-out" thing that flagrantly violates GDPR
Pushshift collects data from Reddit using the same API as the mobile app and public site. It does not have any privileged access to the Reddit database, nor is it collecting any PII that would be subject to GDPR.
You as a user grant a pretty broad license to Reddit when you post content. One of the things the license allows them to do is redistribute the content to other users as well as search indexes and things like the Wayback Machine or Pushshift.
(While I did work for Reddit at one point, these opinions are my own)
> nor is it collecting any PII that would be subject to GDPR
Yeah that's not how that works. Reddit is a free text input interface. I'm free to put PII in any post or comment I want to and you have to comply with data protection laws accordingly if I want my information redacted later on.
The same way you wouldn't just "let it ride" if someone uploaded illegal content - the content itself is what's protected, doesn't matter how Reddit structures its web forms.
That has already been hashed out in the European courts. The processor of the data needs to have a reasonable way of establishing that the data belongs to a identifiable natural person.
But by all means, if you disagree feel free to report Pushshift to the EU regulators. As far as I know Pushshift is based in the US and has no presence to establish a nexus to EU law.
The opt-out form doesn't even get processed these days. It's a fig leaf for GDPR compliance that doesn't actually work.
At some point they have to face the reality these "stereotypical biases" are natural and hamstringing AIs to never consider them will twist them monstrously.
Viruses are natural, so should we stop trying to hamstring them?
What about: at some point we would have to really catch that inspiration from the expression "Intelligence" and build a critical engine?
Edit: in fact, your latter statement seems to suggest finished products: no, they are toys. We are playing in order to build further, we are getting results, milestones in the construction abilities - but those "models" are little lab-byproducts monsters. What are you «twisting»?
So if your plane model keeps blowing up, at some point people will just have to learn to live (/die) with it?
It's not blowing up though, it's experiencing natural turbulence and you're so afraid of getting jostled a bit you demand the plane be tethered to the ground and never exceed 10mph. How to fly under these conditions is left as an exercise for the reader.
you're just saying "people are naturally racist" in more words.
They're saying that racist stereotypes are true, specifically.
No, I am saying that the cure is worse than the disease. The proper fix for the AI being racist is to make it able to not be racist on it's own (which would probably need much deeper understanding on the side of the AI), not forbid everything that passes some primitive heuristic of "being racist". One is painful and correct, the other is easy and feelgood and doomed.
Fair enough, that's what I get for bringing reddit discussion norms with me.
Though because of how general purpose these models are, I have a hard time believing such a model couldn't be used to generate reams of racist screeds for propaganda/astroturfing purposes.
They are, that's the point of civilisation, to try to stop acting like animals
There's a non light terminological issue there. To say that specimen "as found in nature" are weak at something (uneducated) is one think, to say that it is "connatural" to them, that it is "their nature", is completely different¹. I would not mix them up.
(¹Actually opposite: the first indicates an unexpressed nature, the second a manifested one.)
Can you think of an example?
Reminds me a lot of "Do not taunt Happy Fun Ball".
> - "OPT-175B does not work well with declarative instructions or point-blank interrogatives."
Lame!!! I've come to realize InstructGPT3 is just so so so much better than base GPT-3. I won't be _too_ excited about competitors yet until someone makes their own instruct model.
The T0 series by big science is essentially an instruct model (though using multitask prompting instead of user feedback). You should check it out. I have got very competitive results on prompting t0-11b v instructgpt3(text davinci 2)
Thanks, this looks awesome. But my use case is creative text generation (chatbots), which from a quick glance doesn’t seem to be a suggested use case for T0?
I’ve found that simply describing to text-davinci-002 how a chatbot should act gives you more fun and believable responses. For example I trained a trump bot on 2000 tweets (davinci non-instruct fine tuning), and it generated responses that were more boring than when I just wrote a sentence saying to please tweet like trump + a couple adjectives to help it.
I ran out of guest API credits on hugging face before I could trick T0 to respond with a chat completion longer than a few words. But I’ll try it some more later.
> OPT-175B has a high propensity to generate toxic language and reinforce harmful stereotypes
So they trained it on Facebook comments?
I'd think any natural language model would have the same biases we see from real humans.
Are there really no moderated forums that the data can be taken from? Even HN-based training data would be much more civil
A model trained on HN would spit out a 5 paragraph story about how minorities provide a negative ROI for cities. Or how the homeless need to removed from society.
Don't forget that it must also generate, at some point regardless of the topic, a new terminal emulator, and an extremely positive or extremely negative opinion about how blockchain can solve a problem.
Sure, but it would never do something actually bad, like raising the possibility that sexual harassment might, sometimes, be an issue, or questioning the value of phrenology.
Note that HN is included in the training data, see page 20.
Go figure (8)!
I'd think the training data is something that could be curated. Eliminating all bias might be impossible, but GIGO applies.
We trained on Reddit comments and HackerNews comments.
I thought Pushshift was only reddit comments?
Does it merely reinforce harmful stereotypes? Or will it help perpetrate genocide?
Tomato, tomahto.
Higher rate of toxicity and stereotypes?
So it was trained on facebook comments then
AKA not as impressive as it sounds
BigScience (a coalition including Huggingface) is training and releasing a 175B language model and finishes in 2 month.
I often wonder if OpenAIs decision not to open gpt-3 was because it was to expensive to train relative to its real value.
They’ve hidden the model behind an api where they can filter out most of the dumb behaviors, while everyone believes they are working on something entirely different.
Didn’t they sell an exclusive license to Microsoft? It’s probably just a contractural issue.
That happened after they decided not to release the model.
So their goal was to become the next IBM Watson? Parade around tech and try to create hype and hope for the future around it, while hiding all the dirty secrets that shows how limited the technology really is. Their original reasoning for not releasing it "this model is too dangerous to be released to the public" felt very much like a marketing stunt.
It does feel like the Tesla FSD playbook citing "pending regulatory approval"
Well, the decision not to release the model might have been made so that they could license it instead.
They gave a reason why they didn't release it to the public, they said it was too dangerous: https://www.theguardian.com/technology/2019/feb/14/elon-musk...
But of course then they started selling it to the highest bidder, so I wouldn't really trust what they say. They aren't "OpenAI", at this point they are just regular "ProprietaryAI". I really wonder what goal Elon Musk have with it.
Didn’t Musk leave the organization because they started doing things he didn’t like?
His stated reason for leaving the board was potential future conflicts with things Tesla are working on.
> I really wonder what goal Elon Musk have with it.
You mean Sam Altman? Isn't he the CEO?
Isn't Elon paying for it? I thought the original point was to democratize AI, ie the venture wasn't intended to make money but to help advance humanity, so it was funded by wealthy people who didn't need the money back. But maybe I just fell for their marketing?
Elon hasn’t been involved for 2+ years. Didn’t like the direction afaik.
Such a backfire on the narrative setup.
Elons so evil amirite?
He noped out of there when they started acting shady.
Oof.
Gosh. You've seen right through us.
Well, yea? You lot stopped caring about being seen long ago.
Both your comments indicate that you regard everyone here as some kind of homogonous group who share the same views - whilst you are somehow outside or different.
That's a bit like sitting in a traffic jam complaining about the other cars. You are one of us and probably not a huge outlier either in most regard.
I don't know why you have ended up with a me vs them perception but it's probably fairly unhealthy and I hope it's something you carry around in real life as well.
How did you get that from my comments?
Guy was clearly trying to setup a narrative.
> They’ve hidden the model behind an api where they can filter out most of the dumb behaviors
What do you mean by this?
Things like cobbling on a bunch of heuristic rule-based behaviours that wouldn't look good in the public repo of a supposed quasi-AGI system?
There is some evidence that the OpenAI GPT-3 APIs have a human in the loop for bad examples. They may also have a number of filters to exclude certain words/patterns/other rules.
The challenge with such rule and human in the loop systems is that the long-tail of these problems is huge, and fat. Meaning that you generally can't make a product which doesn't have full generalization. That it took ~1.5 years to open the GPT-3 API inclines me to think that they've run into similar problems. We're also not seeing the long pitched swarm of GPT enabled content despite the API being open for ~10 months.
There’s no way they have a human in the loop. The model spits out tokens one at a time. You can see that with the stream flag set to true. The latency doesn’t allow for human intervention.
They do have API parameters for tweaking repetitiveness. That might be what you’re talking about - but it’s fair to call the model and an external repetition filter part of the same product.
As for word filters - no. If they did they’d not be sending back explicit content. But they do. If you have a gpt-3 product you’re obligated to run each result through their content filter to filter out anything nsfw.
We don’t see a ton of gpt-3 enabled content because writing good gpt-3 prompts is hard. You’re trying to learn how this black box works with almost no examples to go off of. I worked for a gpt-3 startup and we put someone on prompt writing full time to get the most out of it. Most startups wouldn’t think to do that and won’t want to.
I would like to know about the reason behind this as well.
The big one, OPT-175B, isn't an open model. The word "open" in technology means that everyone has equal access (viz. "open source software" and "open source hardware"). The article says that research access will be provided upon request for "academic researchers; those affiliated with organizations in government, civil society, and academia; and those in industry research laboratories.".
Don't assume any good intent from Facebook. This is obviously the same strategy large proprietary software companies have been using for a long time to reinforce their monopolies/oligopolies. They want to embed themselves in the so-called "public sector" (academia and state institutions), so that they get free advertising for taxpayer money. Ordinary people like most of us here won't be able to use it despite paying taxes.
Some primary mechanisms of this advertising method:
1. Schools and universities frequently use the discounted or gratis access they have to give courses for students, often causing students to be only specialized in the monopolist's proprietary software/services.
2. State institutions will require applicants to be well-versed in monopolist's proprietary software/services because they are using it.
3. Appearance of academic papers that reference this software/services will attract more people to use them.
Some examples of companies utilizing this strategy:
Microsoft - Gives Microsoft Office 365 access for "free" to schools and universities.
Mathworks - Gives discounts to schools and universities.
Autodesk (CAD software) - Gives gratis limited-time "student" (noncommercial) licenses.
Altium (EDA software) - Gives gratis limited-time licenses to university students.
Cadence (EDA software) - Gives a discount for its EDA software to universities.
EDIT: Previously my first sentence stated that the models aren't open - in fact, only OPT-175B is not (but the other ones are much smaller).
The other ones are smaller but not much worse according to their tests (oddly, in the Winograd Schema Challenge and Commitment Bank tasks, the largest model actually appears to be worse than much smaller ones).
30B parameter models are already large enough to exhibit some of the more interesting emergent phenomena of LLMs. Quantized to 8 bits, it might be possible to squeeze into 2, better three 3090s. But the models also seem undercooked, slightly to strongly under-performing GPT-3 in a lot of tasks. To further train the same model is now looking at > 100 GB, possibly 200GB of VRAM. Point being, this is no small thing they're offering and certainly preferable to being put on a waiting list for a paid API. The 6.7B and 13B parameter models seem the best bang for your buck as an individual.
Can you actually stack multiple 3090s arbitrarily like that?
That is use multiple 3090s to load a single model for inference.
I thought that at most you could use two 3090s via NVlink.
Stacking multiple cards would open some real cheap options.
Like a real budget option would be something like a few ancient K80s (24GB version). eBay price was around $200-300 last I checked. .
Add Mathematica to that list, too. Pretty cool to play with and I would have bought a license if I had a good excuse to; the tactic works.
Mathematica has been on my mind since high school because we got it for free. I went through the free trial process recently and tried a couple of things I have been too lazy to manually code up (some video analysis). It was too slow to be useful. My notebooks that were analyzing videos just locked up while processing was going on, and Mathematica bogged down too much to even save the notebook with its "I'm crashing, try and save stuff" mode. I ultimately found it a waste of time for general purpose programming; the library functions as documented were much better than library functions I could get for a free language, but they just wouldn't run and keep the "respond to the UI" thread alive.
So basically all their advertising money ended up being wasted because they can't fork off ffmpeg or whatever. Still very good at symbolic calculus and things like that, though.
I'm afraid of companies pushing large scale models as the end all for anything text related. Large language models are revolutionary but the last thing I want to see is everything being run through an API. I'm more interested in things like knowledge distillation or prompt tuning. The hope is that a medium size model with some training can match a large one large one using zero shot approaches
Can someone open a Bittorrent seed if you get it
As someone who finds openai patronizing, this is welcome.
I love text-davinci-002, but they need competition, badly. Their ToS is preventing me from releasing the world's greatest chatbot :P https://old.reddit.com/r/GPT3/comments/ubm0hm/my_customizabl...
Out of curiosity, what's the file size on that?
Depends which model, but assuming the largest: 175B * 16 bits = 350GB. Half of that if it's quantized to 8 bits. Good luck finding a GPU that can fit that in memory.
Does the model need to be in memory in order to run it with current tooling?
To run it at a reasonable speed, yes. Computing a single word requires all of the parameters; if you don't have them in memory you'd have to re-transfer all those gigabytes to the GPU for each full pass to get some output, which is a severe performance hit as you can't fully use your compute power because the bandwidth is likely to be the bottleneck - running inference for just a single example will take many seconds just because of the bandwidth limitations.
GPT-3 paper itself just mentions that they're using a cluster V100 GPUs with presumably 32GB RAM each, but does not go into detail of the structure. IMHO you'd want to use a chain of GPUs each having part of the parameters and just transfering the (much, much smaller) processed data to the next GPU instead of having a single GPU reload the full parameter set for each part of the model; and a proper NVLink cluster can get an order of magnitude faster interconnect than the PCIe link between GPU and your main memory.
So this is not going to be a model that's usable on cheap hardware. It's effectively open to organizations who can afford to plop a $100k compute cluster for their $x00k/yr engineers to work with.
Exactly! This is called "model parallelism" - each layer of the graph is spread across multiple compute devices. Large clusters like the V100s or the forthcoming trn1 instances (disclosure, I work on this team) need _stupid_ amounts of inter-device bandwidth, particularly for training.
My following post is entirely speculation.
NVLink also gives you memory pooling; 8*32GB just baaarely fits the model. NVBus is the public version of an InfiniBand interconnect allowing for V-RDMA (which people have been doing for years), which would then allow for distributed execution using pydist or Megatron (or DeepSpeed). So it's probably a similar infrastructure to Nvidia's supercomputers, since that's what everyone built before Nvidia started selling them.
I wonder if a 64GB Orin or M1 Max could fit the 30B model...
Someone can correct me if I'm wrong, but "30B parameters" refers to a matrix with 30B elements, and assuming all the numbers are 16 bit, then that's 2 bytes * 30B = 60GB.
175B * 16 bits = 350GB, but it does compress a bit.
GPT-J-6B, which you can download at https://github.com/kingoflolz/mesh-transformer-jax, is 6B parameters but weighs 9GB. It does decompress to 12GB as expected. Assuming the same compression ratio, download size would be 263GB, not 350GB.
Remember when OpenAi wrote this?
> Due to concerns about large language models being used to generate deceptive, biased, or abusive language at scale, we are only releasing a much smaller version of GPT-2 along with sampling code. We are not releasing the dataset, training code, or GPT-2 model weights
Well I guess Meta doesn’t care.
https://openai.com/blog/better-language-models/
Ever since OpenAI transitioned away from the non-profit model, I'd take these statements with a grain if salt. Yes, there may also be some truth in that opinion, but don't underestimate monetary interests when someone has an easy ~12 month industry lead. Meta's existence and financial wellbeing on the other hand doesn't depend on this stuff, so they have less incentive to keep things proprietary. It seems ironic and almost bit sad that the new commercial circumstances have basically reversed these companies' original roles in AI research.
I feel the same way. It does seem odd, though, that Meta would release this despite the precedent set by OpenAI with statements like this. What does Meta gain by releasing this for download?
I hate the nanny point of view of OpenAI. IMO trashing Meta because theirs models may be misused isn't fair.
I think that hackers should advocate to have the freedom to toy/work with these models.
OpenAI released their large GPT-2 models weights a couple months after making that post: https://openai.com/blog/gpt-2-1-5b-release/
OpenAI is only concerned with making money. What you quote is the PR reason, so they don't sound like the empty corporate money-grubbers they actually are.
hint: openAI didn't care either
Is the model of using an asterisk after first author's names to signal equal contribution common?
Don't read many papers, but that's a new one.
Very common.
What type of hardware would you need to run it?
A cluster of many $8000+ gpus. You're looking at around 350GB of vram, so 30 12gb gpus - a 3090 will cost around $1800, so $54k on the gpus, probably another $15k in power, cooling, and infrastructure, $5k in network, and probably another $20k in other costs to bootstrap it.
Or wait 10 years, if gpu capacity scales with Moore's law, consumer hardware should be able to run a ~400GB model locally.
One could use $4.5k RTX A6000 48Gb instead. They can be joined in pairs of 96Gb common memory pool with NVlink. That’s 7x$4.5=$31.5k in GPUs to get 336Gb of memory. Or 8x$4.5=$36k in GPUs to get 384Gb of memory.
Add say $3k per GPU pair for surrounding computer (MB,CPU,RAM,PSU) 4x$3k=$12k.
$48k total budget.
> so 30 12gb gpus - a 3090 will cost around $1800
3090 has 24Gb, thus 15 GPUs X $1800 = $27,000 in GPUs
Can 3090 GPUs share their memory with one another to fit such a large model? Or is the enterprise grade hardware required?
Yes, two 3090s ($1.7k each) can be connected via NVlink with common 48Gb of memory pool.
Two RTX A6000 ($4.5k each) can form 96Gb memory pool.
Almost no one does this on prem. What would this cost on AWS?
This is not true. On prem is extremely common for things like this because after ~6 months you'll have paid more in cloud costs than it would have cost to purchase the GPUs. And you don't need to purchase new GPUs every 6 months.
AWS would cost $50-100k/mo for something comparable.
Just curious, will I be able to use it using my Nvidia card with 10GB of memory? Does it require multiple graphic cards?
The smaller models, yes. I'd bet dollars to donuts that gpt-neo and EleutherAI models outperform most, if not all, of Facebook's.
Check out huggingface, you'll be able to run a 2.7b model or smaller.
https://huggingface.co/EleutherAI/gpt-neo-2.7B/tree/main
As the model weights (even quantized) would be several hundred GBs, it’s unlikely, unless special inference code is written that loads and processes only a small subset of weights and calculations at a time. But running it that way would be painfully slow.
The code is already there: DeepSpeed
We are also releasing our logbook detailing the infrastructure challenges we faced
Where’s the logbook?
https://twitter.com/stephenroller/status/1521302841276645376?
Have patience it’s coming. :)
And it’s live!
https://github.com/facebookresearch/metaseq
Logbook links in specific: https://github.com/facebookresearch/metaseq/blob/main/projec...
I don't want to be a Luddite, but every time one of these FAANG companies makes advances in this domain my mind immediately goes to how they will use it to better spy on people, for commercial and government interests.
We need robopsychologists.
Dr. Susan Calvin?
I am afraid NLP is becoming a game of scale. Large scale models improve the quality but makes it prohibitively expensive to train, and even host such models.
If we’re already at the level of truly dangerous ml models… I don’t have a lot of hope for how the next decades are going to play out.
some of the hardware Meta is working on to deliver it, https://www.theverge.com/2022/5/2/23053888/meta-virtual-real...
How about smaller more performant models? There’s so much redundancy in language that it should be possible.
Does anyone else think closed AI is turning into it's most weirdest forms and becoming a trend?
I hope someone released a DALLE model. That seems far more interesting to play with.
It'll happen eventually. And when it does, and if it's good enough, the world will be a different place afterwards.
I appreciate that they are releasing their log book detailing the challenges faced.
Thanks Meta AI
announced because GPT4 makes this so very obsolete.
Download link?
Does this make Meta AI more “open” than OpenAI? Oh, the irony.
"Open" in OpenAI is like countries with "Democratic" in their name e.g. Democratic People's Republic of Korea
https://petervojtek.github.io/diy/2015/05/19/countries-with-...
They always have been. Meta has made a number of open contributions for the ML/AI community, one of which is PyTorch.
Don't worry, I'm sure they have some nefarious plans down the road. They're just being "open" to corner the market first.
> to corner the market first
Is Meta's model going to be open source or paid?
The linked paper makes it clear it will be released under a non-commercial license. You will download it gratis (so it won't be paid), but it won't be open source.
So they make a more available alternative, but they maintain control over it, and in turn gain control over the people and companies using it. Similar to what Microsoft did by bundling Windows with PCs[1].
I already have a multitude of ideas on potential nefarious plans based on this, but I'll keep them to myself.
[1]: Sure they got a licence payment, but since it was built into the price and non-optional, it was effectively equivalent to free from the customer POV. It effectively became a tax. I have to admit, Gates might not be a genius programmer but he sure knows how to design dark patterns :)
My guess is that they've "fingerprinted" the model sufficiently that they can identify content that has been created with it.
What are you talking about?
It's pretty simple. GPT models are essentially information weapons. People are going to get their hands on them, so might as well give them a model where you can identify content generated with them, so you can know who is using them for nefarious purposes. Like how many printers encode hidden patterns on paper that identify the model of the printer and other information[0]
0. https://www.bbc.com/future/article/20170607-why-printers-add...
This is nonsense.
Would an AI @ FB employee admit it if it was true?
> I will never discuss FB technical details, internals, or anything else on this site, so please do not ask.
My claim of nonsense has nothing to do with FB. You cannot fingerprint models like this, that's just not how it works.
Also, if we are reading profiles, you call yourself a 10x engineer on your blog, that's hilarious. Maybe 10x the nonsense?
Please don't start a profile analysis flamewar. It just escalates and makes everyone unhappy.
I think it's OK if people notice you work at Facebook. There are people on HN that like to attack anyone nice enough to engage with them just because they work at a big company. I worked at Google for many years, and people were off to blame me personally for every decision that Google made that they didn't like. My approach was to just say, look, the CEO didn't ask me, and if they did I would have said no. If you have concerns with something I actually work on, I'd love to adjust it based on your feedback. (That was network monitoring for Google Fiber, and wasn't very controversial. But, HN loves to lay in to you if you open yourself up for it. I learned a lot about people.)
In this case, I think the best you can do is to say "I don't think it's possible to add fingerprinting, and if it were, I would fight to not add it. I also don't know of any decision to add fingerprinting, and like I said, I would try to make sure we didn't do it." (Or if you're in favor and it's not technically possible, you could say that too!)
Anyway, it is really nice to hear from people "in the trenches". Please don't let people being toxic scare you away or bait you into a flamewar. Comments like yours remind us that even in these big companies whose political decision we may not like, there are still people doing really good engineering, and that's always fun to hear about.
To be clear, I wasn't intending to come across as attacking voz, only pointing out that I don't think anyone "in the know" at Meta/Facebook would admit to it even if they were doing it, so hearing "This is nonsense." doesn't really tell anybody much. They would likely say the same thing whether they thought it was nonsense or not.
No, they would likely not say anything. Explicitly denying it is saying something. But also - just to backup your claim how do you fingerprint a model? It seems logically impossible to me, if you are trying to mimic a certain intelligence, and you specifically "unmimic" it... then you may as well not try.
That's a good point, and a valid correction. Thank you!
>You cannot fingerprint models like this
A GAN can absolutely be trained to discriminate between text generated from this model or another model.
>that's hilarious
What's hilarious about it?
That would be interesting if it was true, but I think it can’t be true because LLMs main advantage is they memorize text in their weights and so your discriminator model would need to be the same size as the LLM.
That said the smaller GPT3 models break down quite often so they’re probably detectable.
In the same way we can train models that can identify people from their choice of words, phrasing, grammar, etc, we can train models that identify other models.
That's anthropomorphizing them - a large language model doesn't have a bottleneck the same way a human does (in terms of being able to express things), it can get on a path where it just outputs memorized text directly and it won't be consistent with what it usually seems to know at all.
Also, you could break a discriminator model by running a filter over the output that changes a few words around or misspells things, etc. Basically an adversarial attack.
I agree it is not exactly the same as a human, but the content it produces is based on its specific training data, how it was fed the training data, how long it was trained, the size and shape of the network, etc. These are unique characteristics of a model that directly impact what it produces. A model could have a unique proclivity for using specific groups of words, for example.
But yes, you could break the discriminator model, in the same way people disguise their own writing patterns by using synonyms, making different grammar/syntax choices, etc. Building a better evader and building a better detector is an eternal cat and mouse game, but it doesn't reduce the need to participate in this game.
A well trained GAN has 50% chance of finding if the generate image is fake or not. But you can't do imperceptible changes on text like you for images.
> A GAN can absolutely be trained to discriminate between text generated from this model or another model.
Nope. I dare you to do it. Or at least intelligently articulate the model architectures for doing so.
> What's hilarious about it?
It's a bullshit term, firstoff, and calling yourself that is the height of ego. Might as well throw in rockstar, ninja, etc too.
So in the entire field of machine learning, we can't train a model that can identify another model from its output? Just can't be done? And there's absolutely no value in having tools that can identify deep fakes, or content produced by specific open models?
>It's a bullshit term, firstoff, and calling yourself that is the height of ego
I am a 10x engineer though, so I'm sorry if that rubs you the wrong way. Also, you're reading my personal website, so of course I'm going to speak highly of myself :)
> in the entire field of machine learning
... we can't train a model to be 100% correct. There will always be false matches. Another super hard task is confidence estimation - models tend to be super sure of many bad predictions.
In this particular case you're talking about detecting human written texts against stochastic text generation. If you wanted to test if the model regurgitates training data, that would have been easy. But the other way around, to check if it outputs something different from future text, it's a hard, open-ended problem. Especially if you take into consideration the prompts and the additional information they could contain.
It's like testing if I have my keys in the house vs testing if my keys are not outside the house (can't prove an open ended negative). On top of this, the prompts would be like allowing unsupervised random strangers into the house.
That is an interesting idea. The fact that they are characterizing the toxicity of the language relative or other LLMs gives it some credibility. That being said, I just don’t see where the ROI would be in something like that. Seems like a lot of expense for no payoff.
My (unasked for) advice would be to take the 10x engineer stuff off your page. It may be true, but it signals the opposite. Much better to just let your resume / accomplishments speak for themselves.
>That being said, I just don’t see where the ROI would be in something like that. Seems like a lot of expense for no payoff.
I consider these types of models as information weapons, so I wouldn't be surprised if they have some contract/agreement with the US government that they can only release these things to the internet if they have sufficient confidence in their ability to detect them, when they inevitably get used to attack the interests of the US and our allies. I don't know how (or even if) that translates to a financial ROI for Meta.
> Nope. I dare you to do it. Or at least intelligently articulate the model architectures for doing so.
It is obvious that we can in principle try to detect this. People are already attempting to do so [1][2]. I would be very surprised if Facebook and other tech giants are not trying to do that, because they already have a huge problem in their hands from this type of technology.
[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8049133/ [2] https://github.com/openai/gpt-2-output-dataset/tree/master/d...
How can you identify content generated with them?
I'm not saying that Meta did it, but recent research shows that it is possible and hard to detect - https://arxiv.org/abs/2204.06974 - so if they really wanted to, they could.
That paper is not about fingerprinting the arbitrary output of a specific model, which would allow Meta to track its usage in the results, e.g. tell a genuine text from a fake generated by their model. The paper implies giving the model some specific secret input only known to you.
I think the thread we're in is also based on the similar misunderstanding.
By training a GAN. A trained GAN will be able to accurately guess whether a block of text was produced by this GPT model, some other GPT model, or is authentic.
Just so I understand you properly:
Original Inputs (A) -> NN (Q) -> Output (X)
You are saying you could train something that would take X and identify that it is the product of NN (Q). Even though you don't know A?
So, to simplify and highlight the absurdity: If I made a NN that would complete sentences by putting a full stop on the end of open sentences. You could train something that could detect that separately to a human placed full stop?
(This seems actually impossible, there is an information loss that occurs that can't be recovered)
Can you identify GPT text versus authentic text? If so, then there are features in that text that give it away. It stands to reason that there exist other features in the text, based on the training data the model was fed, and other characteristics of the model, that a discriminator model could use to detect, with some confidence, which model produced the text. A discriminator model which can detect a specific generative model essentially captures its "fingerprint".
An example of some of these features might be the use of specific word pairs around other word pairs. Or a peculiar verb conjugation in the presence of a specific preposition.
If differentiating between real samples and generated ones were as straightforward as "training a GAN", detecting deep fakes would not be as big of a research topic as it is.
The point is that it's possible and we're improving on it every day.
Know any papers where someone has done this with large language models successfully?
Not surprising at all since OpenAI is basically run by Microsoft now.
This isn't very open; they're not just letting anyone download it, like you might expect.