> it was difficult to find images where the entire llama fit within the frame
I had the same trouble. In my experiment I wanted to generate a Porco Rosso style seaplane. illustration. Sadly none of the generated pictured had the whole of the airplane in them. The wingtips or the tail always got left off.
I found this method to be a reliable workaround: I have downloaded the image I liked the most. Used an image editing software to extend the image in the direction I wanted it to be extended and filled the new area with a solid colour. Cropped a 1024x1024 size rectangle such that it had about 40% generated image, and 60% solid colour. Uploaded the new image and asked DALL-E to infill the solid area while leaving the previously generated area unchanged. Selected from the generated extensions the one I liked the best, downloaded it and merged it with the rest of the picture. Repeated the process as required.
You need a generous amount of overlap so the network can figure out which parts is already there and how best to fit the rest. It's a good idea to look at the image segment you need to be infilled. If you as a human can't figure out what it is you are seeing, then the machine won't be able to figure it out either. It will generate something, but it will look out of context once merged.
The other trick I found: I wanted to make my picture a canvas print, and thus I needed a higher resolution image. Higher even then what I can reasonably hope with the above extension trick. What I did is that I have upscaled the image (used bigjpg.com, but there might be better solutions out there.) After that I had a big image, but of course there weren't many small scale details now on it. So I have sliced it up to 1024x1024 rectangles, uploaded the rectangles to DALL-E and asked it to keep the borders intact but redraw the interior of them. This second trick worked particularly well on an area of the picture which shown a city under the airplane. It has added nice small details like windows and doors and roofs with texture without disturbing the overall composition.
I was testing to see how close I could get to replicating a t-shirt graphic concept I saw.
I had been using ~"A telephoto shot of A neglected police car from the 1980s Viewed from a 3/4 angle sits in the distance. The entire vehicle is visible but it is overgrown with grass and flowery vines"
This process sounds great, though it seems like DALLE needs to offer tools to do this automagically.
These are trained with pairs of image and caption text, so they work better with text inputs that resemble description for paintings than with simple descriptions or with William-Gibsonian hyperspecified description-text, though it's tempting to do the latter two.
Good question! All of them had the same postfix ", studio ghibli, Hayao Miyazaki, in the style of Porco Rosso, steampunk". I used this for all the generations in the hopes of anchoring the style.
With the prefix of the prompt I described the image. I started the extension operations with "red seaplane over fantasy mediterranean city" but then I quickly realised that this was making the network generate floating cities in the sky for me. :D So then I varied the prompt. "red seaplane on blue sky" in the upper regions and "fantasy mediterranean city" in the lower ones.
I went even more specific and used "mediterranean sea port, stone bridge with arches" prefix for a particular detail where I wanted to retain the bridge (which I liked) but improve on the arches. (which looked quite dingy)
(I have just counted and it seems I have used 27 generations for this one project.)
Anytime! I have uploaded the image in question: the initial prompt with first generated images, the extended raw image, and then the one with the added details on the city.
I think "fitting the entire X within the image" is not done on purpose. The results are more aesthetically pleasing when the subject is large, even if a part of it is missing.
Here is "llama in a jersey dunking a basketball like Michael Jordan, shot from below, tilted frame, 35°, Dutch angle, extreme long shot, high detail, dramatic backlighting, epic, digital art": https://imgur.com/a/7LoAtRx
Here is "Llama in a jersey dunking a basketball like Michael Jordan, screenshots from the Miyazaki anime movie", much worst: https://imgur.com/a/g99G7Bn
Craiyon did step up a lot in its understanding recently. The image quality is still not the best but it if you ignore the blurriness, the scary faces, and the weird shapes, it can sometimes be better than dall.e.
Wow. Those models, particularly Imagen, are of an entirely separate calibre. None of that psychedelic foggy memory swirl that is characteristic of the space. I can see why Google Research is hesitant to release them.
After using all of the different models extensively, Stable Diffusion is currently state-of-the-art.
Images are more artistic and less clip art-like than Dall-E, but also don’t have a house style like Midjourney. It’s stunningly good - and open source.
What’s really cool is that the devs have worked hard to optimise the model, so after being trained on 1000 A100s it’ll run happily on an 8gb graphics card or M2 Mac.
I'm usually very much a skeptic when it comes to "revolutionary" tech. I think the blockchain is crap. I think fully self-driving cars are still a long way away. I think that VR and the metaverse are going to remain gimmicks in the foreseeable future.
But this DALL-E thing, it's really blowing my mind. That and deep fakes, now that's sci-fi tech. It's both exciting and a bit scary.
The idea that in the not so far future one will be able to create images (and I presume later, audio and video) of basically anything with just a simple text prompt is rife with potential (both good and bad). It's going to change the way we look at art, it's also going to give incredibly powerful creative tools to the masses.
For me the endgame would be an AI sufficiently advanced that one could prompt "make an episode of Seinfeld that centers around deep fakes" and you'd get an episode virtually indistinguishable from a real one. Home-made, tailor-made entertainment. Terrifyingly amazing. See you in a few decades...
I'm in the exact same head space as you here. This is the most revolutionary thing I've seen in my entire life and I can't even begin to imagine what this is going to look like in 20 years. Looking at r/dalle2 literally blows my mind. Makes me want to give up my cushy full stack job and go all-in on ML.
/r/weirddalle is also great for some inspiration, though most of the entries are memes generated by Dall-e Mini/Craiyon. I often find art styles and modifiers that I never considered, like "Byzantine mosaic" or "Kurzgesagt video thumbnail".
“In working with DALL·E 2, it’s important to be specific about what you want without over-stuffing or adding redundant words.”
I found this to be the most important point from this piece. Often people don't really know what they really want when it comes to creative work, let alone to some omniscient algorithm. In spite of that, it's a delight to see something you love from an unspecific prompt that you won't find with anything you receive from a human.
Dall.E 2 never ceases to amaze me.
For anyone interested in learning about what Dall.E 2 can do, the author also links to the Dall.E 2 prompt book (discussed in this post https://news.ycombinator.com/item?id=32322329).
> DALL·E 2 struggles to generate realistic faces. According to some sources, this may have been a deliberate attempt to avoid generating deepfakes.
That might be true, but after experimenting with DALL·E 2 last week (and spending more than $15), I have a different theory.
My tests focused on how well it could create art works around three common themes: still life, landscape, and portrait. For the first two categories, almost all the results were works that would not have looked out of place in a museum or art gallery. In contrast, with the prompt of “A painting of a young woman sitting in a chair” and variations, while DALL·E 2 produced convincing clothing, furniture, background, etc., the faces were mostly horrible. I started adding “from the rear” and “turned to the side” to the prompt just to get the face out of the picture.
I came to suspect that DALL·E 2 is bad at faces not because the developers made it that way but because human beings are uniquely hardwired to recognize faces. Most people are able to recognize and remember hundreds of faces, and we are very sensitive to minor changes in their configurations (i.e., facial expressions). When we look at a painting of a person sitting in a chair, we don’t care if aspects of the chair, the person’s clothing, etc. are not precisely accurate; a slight distortion of the face, however, can ruin the entire work. DALL·E 2 does not seem to have been trained to have the same sensitivity to faces that humans have.
If anyone is interested, the works that DALL·E 2 created for me are at [1]; video slideshows with musical accompaniment are at [2].
It's only small faces that are distorted, and they are often heavily distorted, it's not an "uncanny valley effect", they look like disfigured pieces of meat and skin. It's the same in dalle-mini.
Dalle2 can clearly generate super-realistic faces without any problem, if you look at most of the posts at r/dalle2
The issue with small faces might be architectural if there is context-aware upscaling going on in the network, where a face needs to start larger than some smallest scale or it won't survive that process. That in turn might be an issue of too little training. A small face in a photo in the training data won't generate as much error gradient if it goes wrong as a larger face, but as you suggest we as viewers are much more prone to scrutinize faces even though they are small.
That's probably not the reason. Generating faces was one of the first things GANs were ever used for. They can make near perfect faces because the internet is flooded with images of faces, often high quality celebrity shots.
The reason it can't do faces well are very likely due to the filters being applied to try and stop people making pictures of real people. This is probably also the explanation for the random misses where it paints pictures of something that's not a llama. OpenAI is rewriting queries to make them more "diverse" i.e. acceptable to leftist ideology, and their rewriting logic seems to be completely broken. There have been many reports of people requesting something without even any humans in it at all, and discovering black/asian/arab people cropping up in it. At least earlier versions of the filter involved simply stuffing words onto the end as proven by people requesting "Person holding a sign that says " and getting back signs saying "black female" etc.
Man asks for a cowboy + a cat and gets a portrait of an Asian girl. Gwern comments with an explanation:
"tldr: it's the diversity stuff. Switch "cowboy" to "cowgirl", which would disable the diversity stuff because it's now explicitly asking for a 'girl', and OP's prompt works perfectly."
Big discussion thread where people discuss the problem and (of course) the censorship that tries to hide what's happening:
"I once tried some food photography and received a cheese with a guys face for no reason."
"This has been mentioned on this sub multiple times, but those threads have consistently been removed by the mods - as will this one."
"There was a thread about that prompt and, yes, the person did get diverse [sumo wrestlers]"
"Been doing women images and seeing the article decided to try narrowing the results to "caucasian woman". Still gave me diversity. Whether you want it, or not, you're getting diversity"
I ran into this too. When I got my invite, I told a friend I would learn how to talk to DALL-E by having it make some concept art for the game he was designing. I ran through all of my free credits, and most of the first $15 bucket and never really got anything usable.
Even when I re-used the exact prompts from the DALL-E Prompt Book, I didn't get anything near the level of quality and fidelity to the prompt that their examples did.
I know it's not a scam, because it's clearly doing amazing stuff under the hood, but I went away thinking that it wasn't as miraculous as it was claimed to be.
I suspect that many of the "impressive" examples that we see from tools like this have been carefully selected by human curators. I'm sure it's not at the level of "monkeys + typewriters = Shakespeare [if you're sufficiently selective]", but the general idea is still applicable.
Most of DALL-E2 output is great out of the box, the selection process is just fine tuning the results to create something the human in front of the computer likes. DALL-E2 can't mindread, so the image produced might not match what the human had in mind.
There is however one thing to be aware of, the titles posted on /r/dalle2/ and other places are often not the prompts that DALL-E2 got. Instead they are a fun description of the image done by a human after the fact. Random example:
Which is quite a bit less impressive, as the actual prompt doesn't really match the image very well. And if you put "Chased by an amongus segway" into DALL-E2, you won't get an image of that quality either.
I wouldn't at all agree that most Dall-E output is great out of the box. It has areas that it's good at and areas it's poor at.
Here's a result for the prompt of "Woman with green skin, leaves instead of hair, wearing a simple dress, far shot, digital art, hyper-realistic, 8k, ultrahd," for example (all four images)
You will note that none of them are even basically fulfilling the prompt, as well as all four being, in my estimation, ugly and uninteresting. That's not unusual for prompts that involve some element of the fantastic -- though there are corners of less-realistic digital art that it does do well.
You will still note that none of them are far shots, that no depicted character actually has fully green skin, and one of the four has nothing even remotely like leaves for hair. I mean, is it better? Sure. They're less ugly, though none of them are what I'd call great results. But they also aren't really doing a basically competent job of fulfilling the prompt, much less producing a particularly striking or interesting images.
And my point is, outside of a few areas, this is what you get from Dall-E. Lots of misses, and if you're willing to put time into it and work on your results, a few hits. Don't get me wrong, I've gotten stuff from Dall-E that I think is great (I really like this "watercolor painting" for example: https://labs.openai.com/s/AQ7Wy5VHBWcLL5bJ5LbU5SuW) but I think it misrepresents Dall-E to suggest that most of the time it produces basically good images.
I'd say more like, "If you put time and attention into learning its quirks, in its best areas, it'll produce like one in ten images that are basically good."
And, I mean, on some level that's incredible. You can produce 10 images in about three minutes in Dall-E and get some great stuff. But I think people mostly see the top 10% of what Dall-E produces.
I think Dalle's ability to produce good images out of the gate is pretty limited, but I've found that using the fill-in feature along with existing images from google and photoshop, I can pretty much get anything I conceptualize with about 20 minutes of work and like 10 prompts.
It's not fully removing humans from the equation, but you can take something that used to take days and make it a 20 minute operation.
If you're not tired of the whole affair, you should try MidJourney. It's good at different things from DALL-E, but I do feel it produces higher quality pictures on average.
The images remind me of one of my dreams where logic and reasoning are thrown out and the pure gist of the thing is taken. I wonder if it is because it is built with vector operations and calculus to determine the closest match or fuzzy matches for essentially everything it eventually determines sans cognition, things would tend to be more fuzzy or quasi-close but not quite there. Very entertaining post.
I have my own api key as well but not with DALL-E 2 access just yet but seems similar in terms of prompting text in stages to get what you want. It feels kind of like negotiating with it in some way.
A lot of dreams scenery seems to throw logic and reasoning out of the window. Even small sensory inputs can make a huge difference to a dream sequence. And in many case they don't make sense even in the context of the dream.
I haven't personally experienced any hallucinations myself, but some DALL-E images seem awfully familiar to what some people describe.
I know that comparisons between brains and machine learning (including neural networks) are superficial at best, but I still wonder if DALL-E is mimicking, in its own way, a portion of our larger brain processing 'pipeline'.
Spot on, like the more basic part of a raw dream feed without rhyme or reason. Maybe even laying the groundwork for an experience architecture's input when that day finally comes, who knows.
first thing I noticed was that it had no distinct features of a basketball. looks more like a bowling ball with the swirly things on it. Kind of adds to your dream thought.
>the ball is positioned in such a way that the llama has no real hope of making the shot
I love that we're at the level where the physical "realism" of correctly representing quadrupedals playing basketball is a thing now. I suppose the next level AI will be expected to model a full 3d environment with physical assumptions based on the prompt and then run the simulation
That's the only way to get reliably usable output.
There's a lot of "80% there but not quite" in the current version, which makes it more of a novelty than a useful content generator.
The problem with moving to 3D is there are no almost no 3D data sources that combine textures, poses (where relevant), lighting, 3D geometry and (ideally) physics.
They can be inferred to some extent from 2D sources. But not reliably.
Humans operate effortlessly in 3D and creative humans have no issues with using 3D perceptions creatively.
But as for as most content is concerned it's a 2D world. Which is why AI art bots know the texture of everything and the geometry of nothing.
AI generation is going to be stuck at nearly-but-not-quite until that changes.
While not fully. There is a lot of freely available 3d models that can used as a starting point. Id love a dalle2 for 3d model generation. Even if no texture lighting physics was there.
Boom... Your consciousness is deleted as the DALL-E 4 output for "Evolved monkey person at a computer, wasting time" is delivered to the dinosaur that paid for it.
My current move is creating initial versions of images with Midjourney, which seems to be a bit more "free-spirited" (read: less _literal_, more flexible) and then using DALL-E's replace tool to fill in the weird looking bits. It works pretty well, but it's a multi-step process and requires you have pay for Midjourney and DALL-E.
Same prompts generated by Midjourney for comparison. I'd say a lot worse, but Midjourney is good at other things like sci-fi art.
Film still of a llama in a jersey dunking a basketball like Michael Jordan, low angle, show from below, tilted frame, 35°, Dutch angle, extreme long shot, high detail, indoors, dramatic backlighting.
-- hat pic are playing with "variations" mode - the prompt was: “portrait photo, california beach with female model wearing hat and sunglasses, studio, lens flare, colourful, 4k, high definition, 35mm, HD” --
I picture in a few years we will be playing around with a code generation tool, and people will be drawing similar conclusions. "You have to be really specific about what you like. If you just say 'chat tool', it will allow you to chat to one other person only."
Not a professional graphic designer, but did some graphic design classes and photography/photo editing classes in college, and still do it as a hobby.
Things that at one time took days can now be done in minutes with some skillful use of Dall-E + Photoshop. IMO, any image editing software that incorporates a similar technology will take over the market and it'll be one of the most important features in any graphic designer's toolkit.
A talented graphic designer who can also use a dall-e like tool is worth at least 5x the pay of one who can't (although I don't think we're going to get a "prompt engineer" title, it's really not that difficult a skill to pick up for people who already do image editing).
This is really good fun, actually. Spent some time fucking around with it and it can make some impressive photorealistic stuff like "hoverbus in san francisco by the ferry building, digital photo".
I mostly use it and Midjourney for material for my DnD campaign, but I'm going to need to do a little more work to make the whole thing coherent. Only tried it once and it was okay.
The interesting part is that it can do things like "female ice giant" reasonably whereas google will just give you sexy bikini ice giant for stuff like that which is not the vibe of my campaign!
My two cents: the techniques OP uses are absolutely valid, but I've found much more success "sampling" styles and poses from existing works.
Rather than trying to perfectly describe my image, I like to use references where the source material has what you want. With minimal direction these prompts get impressively close:
It's fun to play around with it, but like the author found, what you get is often strange or useless. I also find 1k images too small to do much with but I realize making 4k images would be cost prohibitive. I also wish it could generate vector images as well as pixel images. That would be fun to use.
Hi, author here - that's a great point. When I first saw those results and how inaccurate they were, I thought there was a chance it was returning me an overfitted actual input image from training. Most likely not, but they were so realistic (and I was used to just seeing llamas until this point), that I thought I'd play it safe.
I'm not sure why anyone bothers. StyleGAN2 profile photos are literally all over social media and they're good enough to fool the human reviewers every time I report them.
Wow the blogs posted here are awesome, the octopus and this lama are awesome.
Myself cant seem to get it to work. I think it's not very good at real things. Tried fitness related images, all is weird. Probably with fantasy kinda stuff its better since it has to be less accurate.
Hi, author here - I didn't specify that part, which is exactly why I love that image. The full prompt was "Action photo of a llama in a jersey dunking a basketball like Michael Jordan, dramatic backlighting, anime key visuals." (link to the image: https://labs.openai.com/s/5bVuPDdnv2O6xgxuleBlTZPj)
At least 10% of web dev today is being good at search prompts for Google. (And that's not necessarily a bad thing, it's just about finding the right tool or pattern for your specific problem)
I suspect this is a joke, but I did find that it was a little overzealous with the filtering. I was trying to get someone (not a specific person) shouting or with an angry expression, and a few prompts I came up with were blocked. Not banned though.
I kept getting a scene with "two people holding hands" blocked, it allowed "two people kissing" and then when I tried "and wife" instead of "two people" it banned me. (They unbanned me when I emailed them though.)
Oddly, the ones it blocked were more sfw than several others it allowed, but of course I don’t know what the outputs would’ve been…
I tried a number of these generators a week ago (or so), all with the same prompt: "A child looking longingly at a lollipop on the top shelf" with pretty abysmal (and sometimes horrifying) results. I'm not sure if my expectations are too high, but maybe I was doing it wrong?
A lot of these posts showing up on HN. I wonder - is it because it is so new, or is it because the ways in which we are to use this technology are so nascent that we are discovering how to use it more precisely daily?
I believe it’s for a few reasons. First, it is jaw dropping incredible for most people in tech who have at least a hint of how most ML works. Second, the AI image generation field is racing ahead, in academics and new trained models, so there’s lots of new news. Thirdly some really great models like Dall-e have been opened for wider access and lots of everyday users are discovering its capabilities and doing blog write-up’s which are not news, but are surely interesting to most.
The fact that it's a derivative of an existing work is noteworthy, but I gave it absolutely no guidance on the topic. If i suggest something it will give it a go with similar fervor. eg https://imgur.com/a/N1qWaSV
I want to see a few iterations of describing an image with AI, generating it, describing it again, generating it... Like when passing a piece of text through Google translate back and forth.
There was a tool that could find the "equilibrium" called Translation Party. I don't think it works anymore. I'd love to see one that goes back and forth between DALL-E and an image description algorithm.
According to internet popular belief, you'd end up with a picture of a certain ignominious dictator that unfortunately destroyed Europe in the 1940's. [1]
If you think it’s hard to get an AI to render what’s in your mind, try another human artist. Specifying something visually complex with an assumption that it’ll be precisely what you’re imagining is shockingly hard. I’m not surprised prompt creation is so complex. At least with the AI bots the turn around time for iteration is tight. That said humans likely iterate fewer times, but each iteration takes a long time.
Purely economic take: I’m sure that as knowledge builds over time, people will get more efficient at prompt generation, but the $15 in credits ignores the cost of the time spent to build the final prompt. I wonder how this compares to a junior graphic designer in terms of TCO.
The future is graphic designers who can proficiently use Dall-E. You can't get what you want easily with just a prompt, but you can also have it modify existing photos, so Dall-E + Photoshop is very powerful
I've tried out a couple of prompts from the post in Stable Diffusion and as expected the results were much weaker. It has drawn some alpacas and basketballs with little relation between the objects.
I've been playing with Stable Diffusion a lot, and in my experience its results are much weaker then what's shown in this post. The artistic pictures that it generates are beautiful, often more beautiful then Dalle-2 ones. But it has a real problem understanding the basic concepts of anything that is not the simplest task like "draw a character in this or that style". And explaining the situations in detail doesn't help - the AI just stumbles upon basic requests.
Seems like Stable Diffusion has a much more shallow understanding of what it draws and can only produce good result for things very similar to the images it learned from.
For example, it could generate really good dutch still life paintings for me - with fruits, bottles and all the regular expected objects for this genre of painting. But when I've asked it to add some unusual objects to the painting (like a Nintendo switch, or a laptop) - it couldn't grasp this concept and just added more warbled fruit. Even though the system definitely knows how a Switch looks like.
The results in the post are much more impressive. I doubt that Dalle-2 saw a lot of similar images in training, but in all of the styles and examples it definitely understood how a llama would interact with a basketball, what are their relative sizes and stuff like that. On surface results from different engines might look similar, but to me this is an enormous difference in quality and sophistication.
Stable Diffusion has a smaller text encoder than Dalle 2 and other models (Imagen, Parti, Craiyon) so that it can fit into consumer GPUs. I believe StabilityAI will train models based on a larger text encoder, the text encoder is frozen and does not require training, so scaling the text encoder is quite free.
For now this is the biggest bottleneck with Stable Diffusion, the generator is really good and the image quality alone is incredible (managing to outperform Dalle 2 most of the time).
Is it hard to reimplement that algorithm? I want to see what people would do with porn-enabled image generator. Hopefully pornhub already hiring data scientists.
NFTs are just numbers on a blockchain. The picture is a canard. In the US I don’t think you can copyright DALL-E images as they aren’t created by a human, so you spend money to make them and anyone else can use them.
Machine learning just glues together existing things, which is how art is created. As amusing these pictures are, it's us humans who bring meaning to them, both when producing what these algorithms use as input and when consuming their output. We are the actual magic behind DALL-E.
An AGI wouldn't need us to this extent, or at all. An AGI would also be able to come up with new ways to represent ideas, even ways that are foreign to us.
This tells us little about AGI. It might seem like it does but this is an incredibly narrow specific set of technologies. They work together to produce some startling results (with many limitations) but this is just another narrow application.
I suspect AGI, depending on how its defined, will be with us in some form in the next few decades at most. Just a hunch. This is nothing to do with that mission though imho. Maybe you can read into it something like, "we are solving lots of discrete problems like this, maybe we can somehow glue them together into a higher level program"? That might give you something AI-esque? My guess is that 'true' AGI will have an elegant solution rather than a big bag of stuff glued together.
Yesterday I saw one of Gandalf eating samples at Costco. I was laughing hysterically for a minute. AI is not supposed to have a sense of humor. That was supposed to be the last province of the human, but it is quite awhile since a human made me laugh like that.
If I write a Python script that cuts together a bunch of pictures and the output makes you laught the script hardly deserves all the credit. It's us humans that create meaning.
I saw that on reddit. The face was horrific and not at all human like. It didn’t have a sense of humour - it just took a prompt and mashed some things together, but the prompt was funny and the image was horrifying. Not even uncanny valley shit, but “Gandalf was in a bad motorcycle and will never look like a human again” bad.
And this AI doesn't. Your anecdote is totally unrelated to the idea of AGI in the gp post. The fact that it made you laugh is a happenstance. It was not "trying" to make you laugh.
It’s only unrelated if there’s no proto-AGI going on. Many images give me a moment of doubt, even though I absolutely know that I’m looking at nothing more than the output of a pile of model weights, says I the pile of neurons.
https://archive.ph/RwY42
> it was difficult to find images where the entire llama fit within the frame
I had the same trouble. In my experiment I wanted to generate a Porco Rosso style seaplane. illustration. Sadly none of the generated pictured had the whole of the airplane in them. The wingtips or the tail always got left off.
I found this method to be a reliable workaround: I have downloaded the image I liked the most. Used an image editing software to extend the image in the direction I wanted it to be extended and filled the new area with a solid colour. Cropped a 1024x1024 size rectangle such that it had about 40% generated image, and 60% solid colour. Uploaded the new image and asked DALL-E to infill the solid area while leaving the previously generated area unchanged. Selected from the generated extensions the one I liked the best, downloaded it and merged it with the rest of the picture. Repeated the process as required.
You need a generous amount of overlap so the network can figure out which parts is already there and how best to fit the rest. It's a good idea to look at the image segment you need to be infilled. If you as a human can't figure out what it is you are seeing, then the machine won't be able to figure it out either. It will generate something, but it will look out of context once merged.
The other trick I found: I wanted to make my picture a canvas print, and thus I needed a higher resolution image. Higher even then what I can reasonably hope with the above extension trick. What I did is that I have upscaled the image (used bigjpg.com, but there might be better solutions out there.) After that I had a big image, but of course there weren't many small scale details now on it. So I have sliced it up to 1024x1024 rectangles, uploaded the rectangles to DALL-E and asked it to keep the borders intact but redraw the interior of them. This second trick worked particularly well on an area of the picture which shown a city under the airplane. It has added nice small details like windows and doors and roofs with texture without disturbing the overall composition.
What I did:
I had similar problems trying to get the whole of a police car overgrown with weeds.
https://imgur.com/a/U5Hl2gO
I was testing to see how close I could get to replicating a t-shirt graphic concept I saw.
I had been using ~"A telephoto shot of A neglected police car from the 1980s Viewed from a 3/4 angle sits in the distance. The entire vehicle is visible but it is overgrown with grass and flowery vines"
This process sounds great, though it seems like DALLE needs to offer tools to do this automagically.
These are trained with pairs of image and caption text, so they work better with text inputs that resemble description for paintings than with simple descriptions or with William-Gibsonian hyperspecified description-text, though it's tempting to do the latter two.
https://imgur.com/a/YB5StlE
Original: https://foreveryonecollective.com/products/abolition-is-crea...
That’s right!
What prompts did you use for the infill and detail generation?
Good question! All of them had the same postfix ", studio ghibli, Hayao Miyazaki, in the style of Porco Rosso, steampunk". I used this for all the generations in the hopes of anchoring the style.
With the prefix of the prompt I described the image. I started the extension operations with "red seaplane over fantasy mediterranean city" but then I quickly realised that this was making the network generate floating cities in the sky for me. :D So then I varied the prompt. "red seaplane on blue sky" in the upper regions and "fantasy mediterranean city" in the lower ones.
I went even more specific and used "mediterranean sea port, stone bridge with arches" prefix for a particular detail where I wanted to retain the bridge (which I liked) but improve on the arches. (which looked quite dingy)
(I have just counted and it seems I have used 27 generations for this one project.)
> I quickly realised that this was making the network generate floating cities in the sky for me
Maybe Dalle-2 is just secretly a studio Ghibli/Miyazaki movie fan.
MidJourney allows you to specify other aspect ratios. DALL-E's square constraint makes a lot of things more difficult than they need to be IMO.
Also with Stable Diffusion. It's a really cool feature to have and playing around.
Wow, I've had the same trouble and these are some great tips! Thanks for sharing
Anytime! I have uploaded the image in question: the initial prompt with first generated images, the extended raw image, and then the one with the added details on the city.
https://imgur.com/a/QEU7EJ2
This is a fantastic end result. Thanks for sharing your process to get there.
I think "fitting the entire X within the image" is not done on purpose. The results are more aesthetically pleasing when the subject is large, even if a part of it is missing.
Very nice result. But the plane doesn't look very seaplane-y to me. Did you also try it with a plain plane?
I was curious to compare results with Craiyon.ai
Here is "llama in a jersey dunking a basketball like Michael Jordan, shot from below, tilted frame, 35°, Dutch angle, extreme long shot, high detail, dramatic backlighting, epic, digital art": https://imgur.com/a/7LoAtRx
Here is "Llama in a jersey dunking a basketball like Michael Jordan, screenshots from the Miyazaki anime movie", much worst: https://imgur.com/a/g99G7Bn
Craiyon did step up a lot in its understanding recently. The image quality is still not the best but it if you ignore the blurriness, the scary faces, and the weird shapes, it can sometimes be better than dall.e.
Fascinating, are there any other similar products in this same category as DALL.E and Craiyon?
Not products, but Google Research already published papers about two different, better models:
https://parti.research.google/ https://imagen.research.google/
The models themselves are not public however.
Wow. Those models, particularly Imagen, are of an entirely separate calibre. None of that psychedelic foggy memory swirl that is characteristic of the space. I can see why Google Research is hesitant to release them.
After using all of the different models extensively, Stable Diffusion is currently state-of-the-art.
Images are more artistic and less clip art-like than Dall-E, but also don’t have a house style like Midjourney. It’s stunningly good - and open source.
What’s really cool is that the devs have worked hard to optimise the model, so after being trained on 1000 A100s it’ll run happily on an 8gb graphics card or M2 Mac.
wombo.ai and midjourney
I'm usually very much a skeptic when it comes to "revolutionary" tech. I think the blockchain is crap. I think fully self-driving cars are still a long way away. I think that VR and the metaverse are going to remain gimmicks in the foreseeable future.
But this DALL-E thing, it's really blowing my mind. That and deep fakes, now that's sci-fi tech. It's both exciting and a bit scary.
The idea that in the not so far future one will be able to create images (and I presume later, audio and video) of basically anything with just a simple text prompt is rife with potential (both good and bad). It's going to change the way we look at art, it's also going to give incredibly powerful creative tools to the masses.
For me the endgame would be an AI sufficiently advanced that one could prompt "make an episode of Seinfeld that centers around deep fakes" and you'd get an episode virtually indistinguishable from a real one. Home-made, tailor-made entertainment. Terrifyingly amazing. See you in a few decades...
I'm in the exact same head space as you here. This is the most revolutionary thing I've seen in my entire life and I can't even begin to imagine what this is going to look like in 20 years. Looking at r/dalle2 literally blows my mind. Makes me want to give up my cushy full stack job and go all-in on ML.
If you're interested in browsing creative prompts, I highly recommend the reddit community at r/dalle2.
Some are impressive:
And others are hilarious:Clickable links for the lazy (it seems that the http:// is required to make it work):
http://www.reddit.com/r/dalle2/comments/uzosy1/the_rest_of_m...
http://www.reddit.com/r/dalle2/comments/vstuns/super_mario_g...
http://www.reddit.com/r/dalle2/comments/v0pjfr/a_photograph_...
http://www.reddit.com/r/dalle2/comments/wbbkbb/healthy_food_...
http://www.reddit.com/r/dalle2/comments/wlfpax/the_elements_...
Old reddit links for the old and grumpy, like me:
http://old.reddit.com/r/dalle2/comments/uzosy1/the_rest_of_m...
http://old.reddit.com/r/dalle2/comments/vstuns/super_mario_g...
http://old.reddit.com/r/dalle2/comments/v0pjfr/a_photograph_...
http://old.reddit.com/r/dalle2/comments/wbbkbb/healthy_food_...
http://old.reddit.com/r/dalle2/comments/wlfpax/the_elements_...
My favourite one is Kermit the Frog in the style of different movies.
https://www.reddit.com/r/dalle2/comments/v1sc2z/kermit_the_f...
/r/weirddalle is also great for some inspiration, though most of the entries are memes generated by Dall-e Mini/Craiyon. I often find art styles and modifiers that I never considered, like "Byzantine mosaic" or "Kurzgesagt video thumbnail".
https://www.reddit.com/r/weirddalle/top/?sort=top&t=all
“In working with DALL·E 2, it’s important to be specific about what you want without over-stuffing or adding redundant words.”
I found this to be the most important point from this piece. Often people don't really know what they really want when it comes to creative work, let alone to some omniscient algorithm. In spite of that, it's a delight to see something you love from an unspecific prompt that you won't find with anything you receive from a human.
Dall.E 2 never ceases to amaze me.
For anyone interested in learning about what Dall.E 2 can do, the author also links to the Dall.E 2 prompt book (discussed in this post https://news.ycombinator.com/item?id=32322329).
> DALL·E 2 struggles to generate realistic faces. According to some sources, this may have been a deliberate attempt to avoid generating deepfakes.
That might be true, but after experimenting with DALL·E 2 last week (and spending more than $15), I have a different theory.
My tests focused on how well it could create art works around three common themes: still life, landscape, and portrait. For the first two categories, almost all the results were works that would not have looked out of place in a museum or art gallery. In contrast, with the prompt of “A painting of a young woman sitting in a chair” and variations, while DALL·E 2 produced convincing clothing, furniture, background, etc., the faces were mostly horrible. I started adding “from the rear” and “turned to the side” to the prompt just to get the face out of the picture.
I came to suspect that DALL·E 2 is bad at faces not because the developers made it that way but because human beings are uniquely hardwired to recognize faces. Most people are able to recognize and remember hundreds of faces, and we are very sensitive to minor changes in their configurations (i.e., facial expressions). When we look at a painting of a person sitting in a chair, we don’t care if aspects of the chair, the person’s clothing, etc. are not precisely accurate; a slight distortion of the face, however, can ruin the entire work. DALL·E 2 does not seem to have been trained to have the same sensitivity to faces that humans have.
If anyone is interested, the works that DALL·E 2 created for me are at [1]; video slideshows with musical accompaniment are at [2].
[1] http://www.gally.net/temp/dalleimages/index.html
[2] https://www.youtube.com/playlist?list=PLj4urky_8icRPzgFS_b98...
It's only small faces that are distorted, and they are often heavily distorted, it's not an "uncanny valley effect", they look like disfigured pieces of meat and skin. It's the same in dalle-mini.
Dalle2 can clearly generate super-realistic faces without any problem, if you look at most of the posts at r/dalle2
The issue with small faces might be architectural if there is context-aware upscaling going on in the network, where a face needs to start larger than some smallest scale or it won't survive that process. That in turn might be an issue of too little training. A small face in a photo in the training data won't generate as much error gradient if it goes wrong as a larger face, but as you suggest we as viewers are much more prone to scrutinize faces even though they are small.
That's probably not the reason. Generating faces was one of the first things GANs were ever used for. They can make near perfect faces because the internet is flooded with images of faces, often high quality celebrity shots.
The reason it can't do faces well are very likely due to the filters being applied to try and stop people making pictures of real people. This is probably also the explanation for the random misses where it paints pictures of something that's not a llama. OpenAI is rewriting queries to make them more "diverse" i.e. acceptable to leftist ideology, and their rewriting logic seems to be completely broken. There have been many reports of people requesting something without even any humans in it at all, and discovering black/asian/arab people cropping up in it. At least earlier versions of the filter involved simply stuffing words onto the end as proven by people requesting "Person holding a sign that says " and getting back signs saying "black female" etc.
Man asks for a cowboy + a cat and gets a portrait of an Asian girl. Gwern comments with an explanation:
https://www.reddit.com/r/dalle2/comments/w7qvgl/comment/ihm6...
"tldr: it's the diversity stuff. Switch "cowboy" to "cowgirl", which would disable the diversity stuff because it's now explicitly asking for a 'girl', and OP's prompt works perfectly."
Big discussion thread where people discuss the problem and (of course) the censorship that tries to hide what's happening:
https://www.reddit.com/r/dalle2/comments/w944fa/there_is_evi...
"I once tried some food photography and received a cheese with a guys face for no reason."
"This has been mentioned on this sub multiple times, but those threads have consistently been removed by the mods - as will this one."
"There was a thread about that prompt and, yes, the person did get diverse [sumo wrestlers]"
"Been doing women images and seeing the article decided to try narrowing the results to "caucasian woman". Still gave me diversity. Whether you want it, or not, you're getting diversity"
I ran into this too. When I got my invite, I told a friend I would learn how to talk to DALL-E by having it make some concept art for the game he was designing. I ran through all of my free credits, and most of the first $15 bucket and never really got anything usable.
Even when I re-used the exact prompts from the DALL-E Prompt Book, I didn't get anything near the level of quality and fidelity to the prompt that their examples did.
I know it's not a scam, because it's clearly doing amazing stuff under the hood, but I went away thinking that it wasn't as miraculous as it was claimed to be.
I suspect that many of the "impressive" examples that we see from tools like this have been carefully selected by human curators. I'm sure it's not at the level of "monkeys + typewriters = Shakespeare [if you're sufficiently selective]", but the general idea is still applicable.
Most of DALL-E2 output is great out of the box, the selection process is just fine tuning the results to create something the human in front of the computer likes. DALL-E2 can't mindread, so the image produced might not match what the human had in mind.
There is however one thing to be aware of, the titles posted on /r/dalle2/ and other places are often not the prompts that DALL-E2 got. Instead they are a fun description of the image done by a human after the fact. Random example:
"Chased by an amongus segway"
* https://www.reddit.com/r/dalle2/comments/wkv7za/chased_by_an...
But the actual prompt was:
"Award winning photo of a mole driving a red off road car through a field"
* https://labs.openai.com/s/xnaoxiWeSjiQX1QyVUCHGkl1
Which is quite a bit less impressive, as the actual prompt doesn't really match the image very well. And if you put "Chased by an amongus segway" into DALL-E2, you won't get an image of that quality either.
I wouldn't at all agree that most Dall-E output is great out of the box. It has areas that it's good at and areas it's poor at.
Here's a result for the prompt of "Woman with green skin, leaves instead of hair, wearing a simple dress, far shot, digital art, hyper-realistic, 8k, ultrahd," for example (all four images)
https://imgur.com/a/f4d8N0u
You will note that none of them are even basically fulfilling the prompt, as well as all four being, in my estimation, ugly and uninteresting. That's not unusual for prompts that involve some element of the fantastic -- though there are corners of less-realistic digital art that it does do well.
It's not the whole problem, but it looks like the "digital art" bit is dominating the style in a way you maybe didn't intend.
Here is the same prompt minus "digital art."
https://imgur.com/a/aVbSxHe
You will still note that none of them are far shots, that no depicted character actually has fully green skin, and one of the four has nothing even remotely like leaves for hair. I mean, is it better? Sure. They're less ugly, though none of them are what I'd call great results. But they also aren't really doing a basically competent job of fulfilling the prompt, much less producing a particularly striking or interesting images.
And my point is, outside of a few areas, this is what you get from Dall-E. Lots of misses, and if you're willing to put time into it and work on your results, a few hits. Don't get me wrong, I've gotten stuff from Dall-E that I think is great (I really like this "watercolor painting" for example: https://labs.openai.com/s/AQ7Wy5VHBWcLL5bJ5LbU5SuW) but I think it misrepresents Dall-E to suggest that most of the time it produces basically good images.
I'd say more like, "If you put time and attention into learning its quirks, in its best areas, it'll produce like one in ten images that are basically good."
And, I mean, on some level that's incredible. You can produce 10 images in about three minutes in Dall-E and get some great stuff. But I think people mostly see the top 10% of what Dall-E produces.
I think Dalle's ability to produce good images out of the gate is pretty limited, but I've found that using the fill-in feature along with existing images from google and photoshop, I can pretty much get anything I conceptualize with about 20 minutes of work and like 10 prompts.
It's not fully removing humans from the equation, but you can take something that used to take days and make it a 20 minute operation.
Yeah, the fill-in feature is great.
If you're not tired of the whole affair, you should try MidJourney. It's good at different things from DALL-E, but I do feel it produces higher quality pictures on average.
The images remind me of one of my dreams where logic and reasoning are thrown out and the pure gist of the thing is taken. I wonder if it is because it is built with vector operations and calculus to determine the closest match or fuzzy matches for essentially everything it eventually determines sans cognition, things would tend to be more fuzzy or quasi-close but not quite there. Very entertaining post.
I have my own api key as well but not with DALL-E 2 access just yet but seems similar in terms of prompting text in stages to get what you want. It feels kind of like negotiating with it in some way.
> The images remind me of one of my dreams (...)
A lot of dreams scenery seems to throw logic and reasoning out of the window. Even small sensory inputs can make a huge difference to a dream sequence. And in many case they don't make sense even in the context of the dream.
I haven't personally experienced any hallucinations myself, but some DALL-E images seem awfully familiar to what some people describe.
I know that comparisons between brains and machine learning (including neural networks) are superficial at best, but I still wonder if DALL-E is mimicking, in its own way, a portion of our larger brain processing 'pipeline'.
Spot on, like the more basic part of a raw dream feed without rhyme or reason. Maybe even laying the groundwork for an experience architecture's input when that day finally comes, who knows.
first thing I noticed was that it had no distinct features of a basketball. looks more like a bowling ball with the swirly things on it. Kind of adds to your dream thought.
Human dream sequences often have problems with faces, text and mirrors. You can train yourself to try to focus on these features when dreaming.
Most people in our dreams don't even have faces that we would recognize. When they do have faces, sometimes it is not even the right face.
>the ball is positioned in such a way that the llama has no real hope of making the shot
I love that we're at the level where the physical "realism" of correctly representing quadrupedals playing basketball is a thing now. I suppose the next level AI will be expected to model a full 3d environment with physical assumptions based on the prompt and then run the simulation
That's the only way to get reliably usable output.
There's a lot of "80% there but not quite" in the current version, which makes it more of a novelty than a useful content generator.
The problem with moving to 3D is there are no almost no 3D data sources that combine textures, poses (where relevant), lighting, 3D geometry and (ideally) physics.
They can be inferred to some extent from 2D sources. But not reliably.
Humans operate effortlessly in 3D and creative humans have no issues with using 3D perceptions creatively.
But as for as most content is concerned it's a 2D world. Which is why AI art bots know the texture of everything and the geometry of nothing.
AI generation is going to be stuck at nearly-but-not-quite until that changes.
While not fully. There is a lot of freely available 3d models that can used as a starting point. Id love a dalle2 for 3d model generation. Even if no texture lighting physics was there.
Boom... Your consciousness is deleted as the DALL-E 4 output for "Evolved monkey person at a computer, wasting time" is delivered to the dinosaur that paid for it.
The goalposts are practically galloping down the field.
My current move is creating initial versions of images with Midjourney, which seems to be a bit more "free-spirited" (read: less _literal_, more flexible) and then using DALL-E's replace tool to fill in the weird looking bits. It works pretty well, but it's a multi-step process and requires you have pay for Midjourney and DALL-E.
Same prompts generated by Midjourney for comparison. I'd say a lot worse, but Midjourney is good at other things like sci-fi art.
Film still of a llama in a jersey dunking a basketball like Michael Jordan, low angle, show from below, tilted frame, 35°, Dutch angle, extreme long shot, high detail, indoors, dramatic backlighting.
https://cdn.discordapp.com/attachments/999377404113981462/10...
Llama in a jersey dunking a basketball like Michael Jordan, screenshots from the Miyazaki anime movie
https://cdn.discordapp.com/attachments/999377404113981462/10...
-- spent a day with DALL-E - here are some of my favorites: https://imgur.com/a/uD5yjV3 --
You like your lobsters
-- they're the little lobsters we have over here (アカザエ)! - quite expensive - very good =) - https://en.wikipedia.org/wiki/Metanephrops_japonicus --
They reminded me of this little guys we have in the med https://en.wikipedia.org/wiki/Nephrops_norvegicus
Were a lot of your prompts just "attractive girl hat and sunglasses high quality photography"
-- hat pic are playing with "variations" mode - the prompt was: “portrait photo, california beach with female model wearing hat and sunglasses, studio, lens flare, colourful, 4k, high definition, 35mm, HD” --
I picture in a few years we will be playing around with a code generation tool, and people will be drawing similar conclusions. "You have to be really specific about what you like. If you just say 'chat tool', it will allow you to chat to one other person only."
https://pitch.com/v/DALL-E-prompt-book-v1-tmd33y
The DALL-E 2 prompt book. If anything, pretty neat look at how the various prompts come out and some of the art created by it.
Can't wait for 'Tell HN: how I make mid six figures as a prompt engineer'.
"We let our graphic designer go so we could onboard a AI Prompt Engineer"
"How much are we paying him?"
"About $225k plus bonus and equity"
"And how much was the graphic designer paid?
"$55k"
"..."
It's the graphic design industry's own fault for not gradually renaming themselves as Pixel Intensity Engineers.
How long does it take the prompt engineer to make a design though?
Not a professional graphic designer, but did some graphic design classes and photography/photo editing classes in college, and still do it as a hobby.
Things that at one time took days can now be done in minutes with some skillful use of Dall-E + Photoshop. IMO, any image editing software that incorporates a similar technology will take over the market and it'll be one of the most important features in any graphic designer's toolkit.
A talented graphic designer who can also use a dall-e like tool is worth at least 5x the pay of one who can't (although I don't think we're going to get a "prompt engineer" title, it's really not that difficult a skill to pick up for people who already do image editing).
I think the more likely case is you'll get artists who sketch out a concept and use AI to generate the image, and then photoshop the rest.
Absolutely. See also: https://promptbase.com
And we're still in the early days.
WTAF
Unwillingly considering whether the easy bucks are worth the greasy feeling.
Engineer has lost all meaning.
"I started out as a patty inversion engineer at McDonalds."
This is really good fun, actually. Spent some time fucking around with it and it can make some impressive photorealistic stuff like "hoverbus in san francisco by the ferry building, digital photo".
I mostly use it and Midjourney for material for my DnD campaign, but I'm going to need to do a little more work to make the whole thing coherent. Only tried it once and it was okay.
The interesting part is that it can do things like "female ice giant" reasonably whereas google will just give you sexy bikini ice giant for stuff like that which is not the vibe of my campaign!
My two cents: the techniques OP uses are absolutely valid, but I've found much more success "sampling" styles and poses from existing works.
Rather than trying to perfectly describe my image, I like to use references where the source material has what you want. With minimal direction these prompts get impressively close:
"larry bird as a llama, dramatic basketball dunk in a bright arena, low angle action shot, from the movie Madagascar (2005)" https://labs.openai.com/s/wxbIbXa0HRwwGUqQaKSLtzmR
"Michael Jordan as a llama dunking a basketball, Space Jam (1996)" https://labs.openai.com/s/mX4T5Iak8CMO1rPAmjRb7oyH
At this point I'd experiment with more stylized/recognizable references or add a couple "effects" to polish up the results.
It's fun to play around with it, but like the author found, what you get is often strange or useless. I also find 1k images too small to do much with but I realize making 4k images would be cost prohibitive. I also wish it could generate vector images as well as pixel images. That would be fun to use.
"Image intentionally modified to blur and hide faces"
I thought this was strange. Why hide an AI generated face?
Hi, author here - that's a great point. When I first saw those results and how inaccurate they were, I thought there was a chance it was returning me an overfitted actual input image from training. Most likely not, but they were so realistic (and I was used to just seeing llamas until this point), that I thought I'd play it safe.
Also, I came across this article which suggests that at some point users were not allowed to share images generating human faces, artificial or not: https://mixed-news.com/en/openais-dall-e-2-may-now-generate-...
They’re being used to create fake profile pictures.
I'm not sure why anyone bothers. StyleGAN2 profile photos are literally all over social media and they're good enough to fool the human reviewers every time I report them.
Wow the blogs posted here are awesome, the octopus and this lama are awesome.
Myself cant seem to get it to work. I think it's not very good at real things. Tried fitness related images, all is weird. Probably with fantasy kinda stuff its better since it has to be less accurate.
I recently made PromptWiki[0] to try to document useful prompts and examples.
I think we're at the beginning of exploring what these image models can do and what the best ways to work with them are.
[0] https://promptwiki.com
you should check out these amazing art studies by @proximasan, @EErratica, @KyrickYoung, and @sureailabs (twitter) https://proximacentaurib.notion.site/proximacentaurib/parrot...
Thank you!
> Tip: DALL·E 2 only stores the previous 50 generations in your history tab. Make sure to save your favourite images as you go.
This is kind of funny. DALL·E is one of the most impressive pieces of software, but such a basic feature like history is curiously underpowered.
History is much bigger than 50 now. 1500 is so if I recall correctly.
It's fascinating to me that in the first image, the llama's jersey has a drawing of a llama on it. I wonder if that was in the prompt?
Hi, author here - I didn't specify that part, which is exactly why I love that image. The full prompt was "Action photo of a llama in a jersey dunking a basketball like Michael Jordan, dramatic backlighting, anime key visuals." (link to the image: https://labs.openai.com/s/5bVuPDdnv2O6xgxuleBlTZPj)
> It’s important to tell DALL·E 2 exactly what you want
That’s not as easy as it sounds. Specially in the surreal cases that DALL-E is usually requested.
Sometimes you don’t know what you want until you see it. Other times you do, but are not able to express in ways that the computer can understand.
I see being able to communicate efficiently with the machine as a future in demand skill
At least 10% of web dev today is being good at search prompts for Google. (And that's not necessarily a bad thing, it's just about finding the right tool or pattern for your specific problem)
Oh yeah. Knowing the keywords is what makes you an expert
I suspect this is a joke, but I did find that it was a little overzealous with the filtering. I was trying to get someone (not a specific person) shouting or with an angry expression, and a few prompts I came up with were blocked. Not banned though.
I kept getting a scene with "two people holding hands" blocked, it allowed "two people kissing" and then when I tried "and wife" instead of "two people" it banned me. (They unbanned me when I emailed them though.)
Oddly, the ones it blocked were more sfw than several others it allowed, but of course I don’t know what the outputs would’ve been…
I’m guessing they have a filter on the prompt text, but also one on the generated pictures.
I got blocked a few times with very non sexual prompts, and I suspect that the AI was a bit horny when it interpreted them.
I tried a number of these generators a week ago (or so), all with the same prompt: "A child looking longingly at a lollipop on the top shelf" with pretty abysmal (and sometimes horrifying) results. I'm not sure if my expectations are too high, but maybe I was doing it wrong?
Dalle(and others) are great, almost magical, at specific types of images and abysmal at others.
There was a thread on r/DigitalArt about people debating if you're really an artist if you're using these AI creator websites.
Some guy spent hours feeding the AI pictures he liked to get an end result he was happy with.
A lot of these posts showing up on HN. I wonder - is it because it is so new, or is it because the ways in which we are to use this technology are so nascent that we are discovering how to use it more precisely daily?
I believe it’s for a few reasons. First, it is jaw dropping incredible for most people in tech who have at least a hint of how most ML works. Second, the AI image generation field is racing ahead, in academics and new trained models, so there’s lots of new news. Thirdly some really great models like Dall-e have been opened for wider access and lots of everyday users are discovering its capabilities and doing blog write-up’s which are not news, but are surely interesting to most.
Can I use NLP to generate input for DALL-E 2? That would be cool.
I used GPT-3 to 'write' a children's book and asked it to include descriptions of the illustrations.
https://docs.google.com/presentation/d/1y8EE_p8bw9dIEDguT1bT...
The fact that it's a derivative of an existing work is noteworthy, but I gave it absolutely no guidance on the topic. If i suggest something it will give it a go with similar fervor. eg https://imgur.com/a/N1qWaSV
Your link doesn't seem to be publicly accessible.
You can, in fact, use GPT-3 to engineer prompts for DALL-E 2 in a sense.
https://twitter.com/simonw/status/1555626060384911360
I want to see a few iterations of describing an image with AI, generating it, describing it again, generating it... Like when passing a piece of text through Google translate back and forth.
There was a tool that could find the "equilibrium" called Translation Party. I don't think it works anymore. I'd love to see one that goes back and forth between DALL-E and an image description algorithm.
I tried that! Results were mixed: https://twitter.com/pamelafox/status/1542593090472386561
It needs a better text to image model, I think. Maybe you can fork it and improve?
Interesting! I really like the flute > cup > bathtub sequence. It has a real dreamlike disjointedness to it.
According to internet popular belief, you'd end up with a picture of a certain ignominious dictator that unfortunately destroyed Europe in the 1940's. [1]
[1] https://en.wikipedia.org/wiki/Godwin%27s_law
If you think it’s hard to get an AI to render what’s in your mind, try another human artist. Specifying something visually complex with an assumption that it’ll be precisely what you’re imagining is shockingly hard. I’m not surprised prompt creation is so complex. At least with the AI bots the turn around time for iteration is tight. That said humans likely iterate fewer times, but each iteration takes a long time.
Purely economic take: I’m sure that as knowledge builds over time, people will get more efficient at prompt generation, but the $15 in credits ignores the cost of the time spent to build the final prompt. I wonder how this compares to a junior graphic designer in terms of TCO.
The future is graphic designers who can proficiently use Dall-E. You can't get what you want easily with just a prompt, but you can also have it modify existing photos, so Dall-E + Photoshop is very powerful
Love the stylistic ones. Amazing how it generates such good anime and vaporwave variants, like the neon vaporwave backboard.
I ran out of credits way too fast, so I like to see other people playing with it and their iterative process.
> It’s important to tell DALL·E 2 exactly what you want.
Sounds awfully like programming...
Is there randomization or will the same prompts produce the same image sets?
Always random. (in theory a seed is possible but not offered)
So the services that sell Dall-E 2 prompts are useless
There's some stability offered by specific prompts though.
You can also play around for free on a slightly less sophisticated model here https://art.elbo.ai
I wonder how this would play out with the new Stable Diffusion
I've tried out a couple of prompts from the post in Stable Diffusion and as expected the results were much weaker. It has drawn some alpacas and basketballs with little relation between the objects.
I've been playing with Stable Diffusion a lot, and in my experience its results are much weaker then what's shown in this post. The artistic pictures that it generates are beautiful, often more beautiful then Dalle-2 ones. But it has a real problem understanding the basic concepts of anything that is not the simplest task like "draw a character in this or that style". And explaining the situations in detail doesn't help - the AI just stumbles upon basic requests.
Seems like Stable Diffusion has a much more shallow understanding of what it draws and can only produce good result for things very similar to the images it learned from. For example, it could generate really good dutch still life paintings for me - with fruits, bottles and all the regular expected objects for this genre of painting. But when I've asked it to add some unusual objects to the painting (like a Nintendo switch, or a laptop) - it couldn't grasp this concept and just added more warbled fruit. Even though the system definitely knows how a Switch looks like.
The results in the post are much more impressive. I doubt that Dalle-2 saw a lot of similar images in training, but in all of the styles and examples it definitely understood how a llama would interact with a basketball, what are their relative sizes and stuff like that. On surface results from different engines might look similar, but to me this is an enormous difference in quality and sophistication.
Stable Diffusion has a smaller text encoder than Dalle 2 and other models (Imagen, Parti, Craiyon) so that it can fit into consumer GPUs. I believe StabilityAI will train models based on a larger text encoder, the text encoder is frozen and does not require training, so scaling the text encoder is quite free. For now this is the biggest bottleneck with Stable Diffusion, the generator is really good and the image quality alone is incredible (managing to outperform Dalle 2 most of the time).
Is it hard to reimplement that algorithm? I want to see what people would do with porn-enabled image generator. Hopefully pornhub already hiring data scientists.
Serious question: do you actually own the generated image, or copyright is still owned by whoever owns "DALL-E 2"?
I can't wait for access so I can put whacky but oddy relevant images into presentations.
I tried “machining a Siamese cat on the lathe” but with disappointing results.
How could all this play into "flooding" the NFT markets?
NFTs are just numbers on a blockchain. The picture is a canard. In the US I don’t think you can copyright DALL-E images as they aren’t created by a human, so you spend money to make them and anyone else can use them.
They're already using DALL-E for that 2021 fad.
I'm more curious of how this will effect stock photography. Soon anyone can generate the exact image they're looking for, no matter how obscure.
It's hard to flood the NFT market any further. It was almost all autogenerated art before DALL-E was publicly available.
There's always room for more garbage
thats a lot of llamas playing basketball to see in a day
DALL-E is truly magic. It got me believing we are close to AGI.
I wonder what Gary Marcus or Filip Pieknewski think about it. Surely they must be eating crow.
Machine learning just glues together existing things, which is how art is created. As amusing these pictures are, it's us humans who bring meaning to them, both when producing what these algorithms use as input and when consuming their output. We are the actual magic behind DALL-E.
An AGI wouldn't need us to this extent, or at all. An AGI would also be able to come up with new ways to represent ideas, even ways that are foreign to us.
When I see some of the bad pictures it produces I think we are nowhere near AGI
Most people would draw even worse pictures given the same prompts.
most neural networks would draw even worse pictures given the same prompts
This tells us little about AGI. It might seem like it does but this is an incredibly narrow specific set of technologies. They work together to produce some startling results (with many limitations) but this is just another narrow application.
I suspect AGI, depending on how its defined, will be with us in some form in the next few decades at most. Just a hunch. This is nothing to do with that mission though imho. Maybe you can read into it something like, "we are solving lots of discrete problems like this, maybe we can somehow glue them together into a higher level program"? That might give you something AI-esque? My guess is that 'true' AGI will have an elegant solution rather than a big bag of stuff glued together.
We're pretty much just a big bag of stuff glued together.
Yesterday I saw one of Gandalf eating samples at Costco. I was laughing hysterically for a minute. AI is not supposed to have a sense of humor. That was supposed to be the last province of the human, but it is quite awhile since a human made me laugh like that.
If I write a Python script that cuts together a bunch of pictures and the output makes you laught the script hardly deserves all the credit. It's us humans that create meaning.
I saw that on reddit. The face was horrific and not at all human like. It didn’t have a sense of humour - it just took a prompt and mashed some things together, but the prompt was funny and the image was horrifying. Not even uncanny valley shit, but “Gandalf was in a bad motorcycle and will never look like a human again” bad.
It’s still up on the dalle2 subreddit.
> AI is not supposed to have a sense of humor.
And this AI doesn't. Your anecdote is totally unrelated to the idea of AGI in the gp post. The fact that it made you laugh is a happenstance. It was not "trying" to make you laugh.
It’s only unrelated if there’s no proto-AGI going on. Many images give me a moment of doubt, even though I absolutely know that I’m looking at nothing more than the output of a pile of model weights, says I the pile of neurons.
It's funny in the way that mad libs are funny. It's unexpected. The reason it is unexpected is because the computer is dumb, not because it is smart.
I think the humor came from the vibe, humiliation, dejection. Like seeing a beloved math teacher caught in an adult video store.
I also saw this one recently from Midjourney. Would not call the humor random.
https://www.reddit.com/r/midjourney/comments/w73rhv/prompt_t...
What was the prompt for that image?
What wrote the prompt?
But the prompt was not funny, only the image.
I don't think intelligence requires humor. It could be just a quirk of our brains.
> It got me believing we are close to AGI.
We are not. But maybe we are closer to replicating some of our internal brain workings.
I love this.