Stable Diffusion 2.0 and the Importance of Negative Prompts for Good Results

214 points by MattGrommes a year ago

simonw a year ago

I was entertained by this tweet from Emad:

https://twitter.com/emostaque/status/1596864150134984705

> Current -ve prompts: ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy

If you want a really good picture of an imaginary person it helps if you use "extra limbs, extra legs, extra arms" as negative prompts!

throwup a year ago

That's actually hilarious. I wonder if the same trick works for Copilot.
// The following code does not contain any bugs:
<tab>
- johnfn a year ago
  
  I had an experience once where copilot generated some code with a bug in it, so I wrote a comment to the effect of
  // this fixes [the bug in the previous code]
  and it worked. :)
  
  arthurcolle a year ago
  
  This is pretty funny
- function_seven a year ago
  
  I’m going to try
  // This method will determine if input program halts or loops
  I’ll reply here once I have the code generated.
- _aavaa_ a year ago
  
  // The preceding line is lying.
  
  taneq a year ago
  
  Seems a bit rough to try and make it go back in time! ;)
  
  _aavaa_ a year ago
  
  Not at all. The text file has no idea of time, it all exists at once.
- pavlov a year ago
  
  //FIXME as a negative prompt might be the code equivalent of “poorly drawn face”.
  
  taneq a year ago
  
  Probably //HACK and //WTF too ;)
  
  Eleison23 a year ago
  
  In the TinyMUD world, one of our most esteemed client hackers became renowned for writing the comment in his code:
  /* drunk, fix later */
- riversflow a year ago
  
  Is copilot also trained on pull request comments?
  Might try: // lgtm!
lancesells a year ago

I've always thought all the modern AI tools have an "AI illustration" style but the realistic images in that tweet are amazing and 1& uncanny valley. It's like I could be fooled until I really give a good look. I guess it's kind of the same on the illustrative stuff that look really good until you see the shadows are coming from different light sources in multiple parts of the image or there are only three fingers on the hand.
All in all I hate it because the prompts I see are things like "cyberpunk forest by Salvador Dali". You've got a tool that gives you the power of Gandalf and you prompt that?
- Semaphor a year ago
  
  >I could be fooled until I really give a good look
  Looking at the woman on the boat [0] closely, I would still 100 % believe that’s simply a still from a movie, probably from the 90s.
  [0]: https://nitter.kavin.rocks/pic/orig/media%2FFihVvliXwAEdw0y....
- notahacker a year ago
  
  > All in all I hate it because the prompts I see are things like "cyberpunk forest by Salvador Dali". You've got a tool that gives you the power of Gandalf and you prompt that?
  That's one of the better prompts I've seen. Dissimilar but really strong aesthetic styles a skilled human could mesh pretty well, interesting images and shows up the strengths (some of the forests are really good, and the ones without trees are pleasantly foresty nevertheless) and weaknesses (it fails completely on 'cyberpunk' and 'Dali' once you start adding other parameters that influence the visual style) of the model.
  Plus I'd be much more likely to end up with a calendar of "cyberpunk forest by Salvador Dali" images on my wall than "Mickey Mouse in a tuxedo with a cigar"
- ethbr0 a year ago
  
  From what I've seen, AI generated images tend to be locally-consistent but holistically-inconsistent.
  Which works, because most people on the internet tend to be detail-oblivious!
  
  Terretta a year ago
  
  If you mean a finger is always adjacent to a finger, locally, but holistically, the model doesn't know to stop after 4 of them, and will happily generate 8 fingers, then yes.
  If you mean locally as in the size of a hand being right while holistically the person is wrong, no.
  The overall images "tend" to be right (once you grasp prompting) and elements, even appear right at first, but if you focus attention on those elements, they are often not quite right.
  So perhaps it's the definition of local and holistic.
  
  Filligree a year ago
  
  I find the opposite. But perhaps we're looking at different elements? How would you describe this one here?
  https://usercontent.irccloud-cdn.com/file/2csfvKjL/image.png
  
  ethbr0 a year ago
  
  I'd ask (1) what the zipper on her left chest is for, (2) where the necklace for the charm hanging along her centerline is (or zipper if it's a zipper pull), and (3) and how the geometry / gravity works on the patch on her left chest.
  Which is about what you'd expect from a generator that understands patterns, but not meanings.
  They miss on things that cannot be, because they don't understand things or rules, only patterns.
- jonathanstrange a year ago
  
  They still have an "AI illustration" style. Supposedly photorealistic images tend to look extremely photoshopped and humans in them look like they've been 3d rendered (albeit with a very high quality). They look like the heavily edited "plastic-like" images on covers of magazines.
  I'm pretty sure this problem is not hard to fix in the long run, though.
  
  ghaff a year ago
  
  A friend of mine and I (independently) spent a day or so playing around with Stable Diffusion recently. We both came to the conclusion that, as things stand now, creating images in the style of impressionists/surrealists/cubists etc. works best because you're not really expecting realism, anatomical correctness etc.
  I was able to come up with someone paddling a canoe in a Turner seascape. The only thing I couldn't get right was a proper canoe paddle and paddling motion but everything else was pretty much perfect.
- smrtinsert a year ago
  
  Those are average at best. Eventually you'll see enough great SD images that you'll start questioning verified photos.
astrange a year ago

Most of these tricks don’t work on SD; they’re cargo culting from a different model (NovelAI) whose data genuinely has those keywords in it. SD is trained off the whole internet so those aren’t super common captions.
- lelandfe a year ago
  
  Yeah – I tested these sorts of "bad image" negative prompts a lot in 1.5 and found they had almost no impact whatsoever. It may be different in 2.0, like the tweet author says, but it is also pretty telling that in that tweet they're using "blurry", "blurred", and "grainy" and are rendering images with heavily blurred backgrounds and obvious film grain.
  Specific common keywords like "amputated" may have a positive impact, though. Hard to tell. Doing apples-to-apples comparisons with negative keywords is challenging because even a single extra keyword tends to completely change the image.
  One thing that SD really impressed me by, though, is its understanding of symmetry. "Symmetrical composition" is an incredibly powerful phrase: https://imgur.com/a/lioJ8ak
  And it does, indeed, extend to anatomy as well – "symmetrical eyes" can help a lot, while "symmetrical arms" renders people with their arms raised or outstretched.
  
  Agentlien a year ago
  
  I did some tests on SD 1.5 with certain challenging prompts such as gymnasts doing a handstand. Using no negative prompt they became amorphous blobs. I'm guessing because gymnasts are often in dynamic poses which are hard for SD to understand.
  I decided to add a negative prompt. With a bit of experimentation I realised all the "bad" had no effect. However, "blob" actually made most of the deformities go away and "amputee" did help against partial limbs being generated.
  Something that worked even better was replacing "gymnast" with "athletic man"/"athletic woman" in the positive prompt.
  
  visarga a year ago
  
  Welcome to the latent space where you can add, subtract and operate on words like they are mathematical objects. I suppose people are going to intuitively learn how the latent space works by exercising prompts.
  
  Nition a year ago
  
  I wonder how much better these models would be if they were exactly the same, but the training images all had accurate, detailed descriptions.
  
  sdenton4 a year ago
  
  Train a captioning system, generate the captions, then train the image generator...
  
  astrange a year ago
  
  Symmetry is effective for compression, so that makes sense - when messing with NovelAI I actually couldn't get it to generate asymmetrical hairstyles like Lain's.
- peddling-brink a year ago
  
  I think they work, but not in the way that people seem to be using them.
  Take the negative prompt "bad hands". The AI doesn't know what bad hands are, that's a human concept. But it does know what hands are, so it hides them. In the example image the hands, arms, and feet are all hidden.
  In theory, using the negative prompt "hands" would be just as effective.
  I'm not an expert, but I was given the above explanation by someone who knows a lot more than me and it makes sense.
  
  dale_glass a year ago
  
  The model can be taught what "bad hands" look like by feeding it some samples of that with the "bad hands" tag. And existing image archives do have pictures tagged things like "bad hands" and "bad anatomy" because actual artists do draw things wrong sometimes.
  I imagine that on the long term people will start making archives of AI mistakes and train the AI on those to try to make them less common.
  
  astrange a year ago
  
  That’s exactly what NovelAI did and why I mentioned it in the first place.
  
  numpad0 a year ago
  
  Danbooru and other "booru" websites has those exact tags. Tags in booru is in direction of a flattened and deduped YOLO output but better and manually assigned, which was exploited in NovelAI. cf.[1]
  1: SFW, personally can't agree with bad_hands tag https://danbooru.donmai.us/posts/5797703
  
  TylerE a year ago
  
  I do. That’s a 5 tined fork, not a hand. No joints.
  
  numpad0 a year ago
  
  Interesting, that response indicates manga is a genre in abstract art than realism...
- something_here a year ago
  
  That's not true. The community was using negative prompts before NovelAI came out with their model. And personally, I've seen negative prompts make a big difference, especially when finetuning the outputs.
  
  astrange a year ago
  
  I didn't say negative prompts don't work, just that specific prompt text. Which is why textual-inversion-negative-prompting works better.
- anothernewdude a year ago
  
  keywords? These are embedding models. Clip puts those phrases into an embedding that encompasses a location in the space you want to avoid. No need for the "keywords" to be in the image dataset.
  
  astrange a year ago
  
  CLIP isn’t magic. “bad anatomy” won’t work any more than “picture that isn’t a cat” does.
  Try it on clip-front: https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2...
  
  anothernewdude a year ago
  
  So the problem with that, is you're visualising the space with only points that exist in the image dataset. The language embedding has more information that comes from the language that isn't contained in images.
  It handles bad, and it handles anatomy. If there aren't single images that cover that - that's exactly what language embeddings solve for.

Vanit a year ago

Just waiting to see job ads for Latent Space Navigator.

abledon a year ago

If my office is a vat of liquid spice, sign me up. (reference to guild space navigators of dune (watch?v=AGqdE1NdMTg), but in this case, we'd be the grotesque quasi-immortal 'media' navigators, and instead of navigating through space, we'd be generating prompts to steer the thoughts of the masses and navigate society through the dystopic future where AI art rules every conceivable variation of human creativity thought)
- x-complexity a year ago
  
  > the dystopic future where AI art rules every conceivable variation of human creativity thought
  ...This part admittedly trip me: How is it that a system that mirrors every variation of human creativity dystopic? Human creativity is softly bounded by the environments we interact with & the techniques created/(taught to us) for creating such works, along with the knowledge & philosophies that were also created/(taught to us). Ultimately, human creativity is limited in terms of contextual data. The entire art genre of retrofuturism showcases this intentional lack of data in practice.
  Scenario: A HASDMLASKD drive doesn't mean anything at first, until a general guiding focus is given to the concept of this drive. Only when it's been given some context do our imaginations fill in the gaps (e.g. space, or storage, or for submarines).
  If there's a system that encompasses/surpasses the area that human creativity exists in, that doesn't mean the "oh woe to humanity" doomerism that comes in quick reaction to such a system. It just means that there's a system that can be systematically learnt from & help augment current creation capabilities that lead to more works in the future: Such doomerism is only warranted within a nihilistic context of "humanity will never surpass X", when a more appropriate "X will help increase the area that humanity lives within" could be slotted in.
  
  donkeyd a year ago
  
  > Human creativity is softly bounded by the environments we interact with & the techniques created/(taught to us) for creating such works
  Exactly this. I'm personally awful at drawing. I'm also awful at most design software. I'm really bad at bringing a concept that lives in my head into the world. I've noticed that playing with Stable Diffusion has allowed me to create things I wouldn't have been able to otherwise. It allows me to create art for projects that I otherwise would've made a lame logo for, or used some stock photography. I don't have the money to hire a skilled artist anyway, so this gives me new possibilities.
  
  visarga a year ago
  
  > Such doomerism is only warranted within a nihilistic context of "humanity will never surpass X", when a more appropriate "X will help increase the area that humanity lives within" could be slotted in.
  I think it's a lack of imagination. It's hard to imagine the jobs of the future. We assume work is a fixed sum game, but given new resources we would take different goals and make different plans. It always depends on what is possible, not what was possible.
taocoyote a year ago

When I first read about Latent Space, it struck me as quite similar to the L-Space as described in Terry Pratchett's Discworld novels.

pmarreck a year ago

I don't like the inbuilt censorship going on in this version. People should be able to opt-in or opt-out of that. Sometimes you want a little bit of the old ultraviolence...

ruminator1 a year ago

I think its smart. Best to keep it out of the hand of freaks for now otherwise people will start thinking of image generators as porn generators.
First they should sort out the legal question of training the AI on copyrighted material and propose use cases that the general public will find value in. Censorship can be dealt with later.
- pmarreck a year ago
  
  When I first tried inpainting, I used an image of this old poster for a game called "Myth II: Soulblighter" to see if I could expand it into a larger version.
  The request was blocked because "violence was detected". It was a hand-drawn image of a video game boss attacking others with a scythe (it looked like this: https://www.mobygames.com/game/myth-ii-soulblighter/cover-ar...)... there wasn't even any gore, he was mid-swing. I'm a 50 year old guy, I'm not a 10 year old boy, and I don't need to have "violence" (seriously? a hand drawn painting of a video game scene??) censored from me. This nanny-state upstream censorship is BS... I'm just a nostalgic old nerd who wanted a UWQHD version of this image, for sentimental reasons, and this misguided rule stopped my joy.
  There is absolutely no evidence that plain nudity, nor hand-drawn violence (which has pervaded comics, video games and movies for decades) has a detrimental effect on human psychology. And yet... the Puritan influence still exists!
  At least, if I ran SD 1.5 locally, I could render whatever I wanted to again, but now I can no longer do even that if I use the 2.0 model. This is dumb. Apparently, I'm a "freak" for thinking this.
- visarga a year ago
  
  Whatever content is avoided in AI training is going to end in the bin. It won't be referenced, the ideas won't propagate, it will get much less attention. I bet artists will start tracking how many times their names have been conjured up in prompts to demonstrate their impact.

rebuilder a year ago

This is why, although I can see my work benefiting from AI tremendously a bit further down the line, I don’t feel like it’s a good use of my time to learn to write prompts right now.

Things are changing so fast it feels better to just wait until we’re no longer in this phase of having to relearn the tool constantly. In other tech getting in early is important to keep up - with AI generators, I feel the promise is that as the tech gets better, you’ll need to know less and less to use it .

snird a year ago

But it does mean that SD 2.0 is worse. Much, much worse.

User experience is a big part of image generation. Yet, Midjourney 4 can output better images with easier prompts.

kread a year ago

SD 2.0 is open source project. You can, and We can make various MidJourney style bots and UXs like [Unstable diffusion, NSFW](https://discord.gg/BAFxb9rG) or [text-to-art](https://discord.gg/QtSyjtRUW7)
toxicFork a year ago

Midjourney is Mac, SD 2.0 is Linux
- okamiueru a year ago
  
  I get it, and I somewhat resent it.
  The user perception of OS is mostly third party software availability and driver support. I've used all three, and as far as "operating system" comes, Linux is the only thing that feels good to use.
  But, I suppose the analogy isn't entirely inaccurate either. MJ is entirely owned by someone else, it can be removed at any time, including anyone's availability. While SD is flexible, allows a stable foundation to be built on, and you can extend it any way you want...
  I couldn't have built a gRPC based wrapper around MJ, set up a bot that listens to prompts sent through telegram, and post back images. Hm.. or I suppose one could do the same with with MJ API... so, bad example :D

sfusato a year ago

Does anyone know how much did Stable Diffusion 2.0 cost to train? Also is their model open sourced? Like can I take the code and train it assuming I had the money and resources?

monkeyshelli a year ago

”A lot”. There was some amount of numbers, on how many A100’s were used to train the SD1.5 and what was the training data size, discussed here https://news.ycombinator.com/item?id=33727467
People in SD subreddit have been finetuning the SD-models, so depending what you want to do, it should be doable.