points by jjbinx007 6 months ago

It still can't generate a full glass of wine. Even in follow up questions it failed to manipulate the image correctly.

meeton 6 months ago

https://i.imgur.com/xsFKqsI.png

"Draw a picture of a full glass of wine, ie a wine glass which is full to the brim with red wine and almost at the point of spilling over... Zoom out to show the full wine glass, and add a caption to the top which says "HELL YEAH". Keep the wine level of the glass exactly the same."

  • cruffle_duffle 6 months ago

    Maybe the "HELL YEAH" added a "party implication" which shifted it's "thinking" into just correct enough latent space that it was able to actually hunt down some image somewhere in its training data of a truly full glass of wine.

    I almost wonder if prompting it "similar to a full glass of beer" would get it shifted just enough.

  • Stevvo 6 months ago

    Can't replicate. Maybe the rollout is staggered? Using Plus from Europe, it's consistently giving me a half full glass.

    • amy_petrik 6 months ago

      I am using Plus from Australia, and while I am not getting a full glass, nor am I getting a half full glass. The glass I'm getting is half empty.

      • DonHopkins 6 months ago

        Surprised it isn't fully empty for being upside down!

      • bb88 6 months ago

        That's funny. HN hates funny. Enjoy your shadowban.

        • BlobberSnobber 6 months ago

          Yeah. I understand that this site doesn’t want to become Reddit, but it really has an allergy to comedy, it’s sad. God forbid you use sarcasm, half the people here won’t understand it and the other half will say it’s not appropriate for healthy discussion…

    • coder543 6 months ago

      Is it drawing the image from top to bottom very slowly over the course of at least 30 seconds? If not, then you're using DALL-E, not 4o image generation.

      • uh_uh 6 months ago

        This top to bottom drawing – does this tell us anything about the underlying model architecture? AFAIK diffusion models do not work like that. They denoise the full frame over many steps. In the past there used to be attempts to slowly synthetize a picture by predicting the next pixel, but I wasn't aware whether there has been a shift to that kind of architecture within OpenAI.

        • cubefox 6 months ago

          Yes, the model card explicitly says it's autoregressive, not diffusion. And it's not a separate model, it's a native ability of GPT-4o, which is a multimodal model. They just didn't made this ability public until now. I assume they worked on the fine-tuning to improve prompt following.

        • thesparks 6 months ago

          apparently it's not diffusion, but tokens

    • raxxorraxor 6 months ago

      The EU got the drunken version. And a good drunk know not to top of a glass of wine ever. In that context the glass is already "full".

      But aside from that it would only be comparable if would compare your prompts.

    • qingcharles 6 months ago

      You might still be on DALL-E. My account is if you use ChatGPT.

      I switched over to the sora.com domain and now I have access to it.

      • cchance 6 months ago

        the free site even has it, just dont turn on image generation it works with it off, if you enable it it uses dall-e

  • eitland 6 months ago

    Most interesting thing to me is the spelling is correct.

    I'm not a heavy user of AI or image generation in general, so is this also part of the new release or has this been fixed silently since last I tried?

    • widerporst 6 months ago

      It very much looks like a side effect of this new architecture. In my experience, text looks much better in recent DALL-E images (so what ChatGPT was using before), but it is still noticeably mangled when printing more than a few letters. This model update seems to improve text rendering by a lot, at least as long as the content is clearly specified.

      However, when giving a prompt that requires the model to come up with the text itself, it still seems to struggle a bit, as can be seen in this hilarious example from the post: https://images.ctfassets.net/kftzwdyauwt9/21nVyfD2KFeriJXUNL...

      • remuskaos 6 months ago

        The periodic table is absolutely hilarious, I didn't know LLMs had finally mastered absurdist humor.

        • soco 6 months ago

          Yeah who wouldn't love a dip in the sulphur pool. But back to the question, why can't such a model recognize letters as such? It cannot be trained to pay special attention to characters? How come it can print an anatomically correct eye but not differentiate between P and Z?

          • londons_explore 6 months ago

            I think the model has not decided if it should print a P or a Z, so you end up with something halfway between the two.

            It's a side effect of the entire model being differentiable - there is always some halfway point.

  • dghlsakjg 6 months ago

    The head of foam on that glass of wine is perfect!

    • ASalazarMX 6 months ago

      I think we're really fscked, because even AI image detectors think the images are genuine. They look great in Photoshop forensics too. I hope the arms race between generators and detectors doesn't stop here.

      • gloosx 6 months ago

        We're not. This PNG image of a wine glass has JPEG compression artefacts which are leaking from JPEG training data. You can zoom into the image and you will see 8x8 boundaries of the blocks used in JPEG compression, which just cannot be in a PNG. This is a common method to detect AI-generated image and it is working so far, no need for complex photoshop forensics or AI-detectors, just zoom-in and check for compression - current AI is incapable of getting it right – all the compression algorithms are mixed and mashed in the training data, so on the generated image you can find artefacts from almost all of them if you're lucky, but JPEG is prevalent obviously, lossless images are rare online.

        • ASalazarMX 6 months ago

          If JPEG compression is the only evident flaw, this kind of reinforces my point, as most of these images will end up shared as processed JPEG/WebP on social media.

          • gloosx 6 months ago

            You didn't get it. The image contains ALL compression artifacts from different algorithms mashed up in a single picture, the JPEG is just prevalent.

            • ASalazarMX 6 months ago

              Oh, I see. There's still room for reliable detection then.

        • londons_explore 6 months ago

          plenty of real PNG images have jpeg artifacts because they were once jpegs off someones phone...

yusufozkan 6 months ago

Are you sure you are using the new 4o image generation?

https://imgur.com/a/wGkBa0v

  • minimaxir 6 months ago

    That is an unexpectedly literal definition of "full glass".

    • Loeffelmann 6 months ago

      That's the point. With the old models they all failed to produce a wine glass that is completley to the brim full. Because you can't find that a lot in the data they used for training.

      • colecut 6 months ago

        Imagine if they just actually trained the model on a bunch of photographs of a full glass of wine, knowing of this litmus test

        • gorkish 6 months ago

          I obviously have no idea if they added real or synthetic data to the training set specifically regarding the full-to-the-brim wineglass test, but I fully expect that this prompt is now compromised in the sense that because it is being discussed in the public sphere, it's has inherently become part of the test suite.

          Remember the old internet adage that the fastest way to get a correct answer online is to post an incorrect one? I'm not entirely convinced this type of iterative gap finding and filling is really much different than natural human learning behavior.

          • friendzis 6 months ago

            > I'm not entirely convinced this type of iterative gap finding and filling is really much different than natural human learning behavior.

            Take some artisan, I'll go with a barber. The human person is not the best of the best, but still a capable barber, who can implement several styles on any head you throw at them. A client comes, describes certain style they want. The barber is not sure how to implement such a style, consults with master barber beside, that barber describes the technique required for that particular style, our barber in question comes and implements that style. Probably not perfectly as they need to train their mind-body coordination a bit, but the cut is good enough that the client is happy.

            There was no traditional training with "gap finding and filling" involved. The artisan already possessed core skill and knowledge required, was filled on the particulars of their task at hand and successfully implemented the task. There was no looking at examples of finished work, no looking at example of process, no iterative learning by redoing the task a bunch of times.

            So no, human learning, at least advanced human learning, is very much different from these techniques. Not that they are not impressive on their own, but let's be real here.

            • wegfawefgawefg 6 months ago

              overfitting vs generalizing

              also we all know real people who fail to generalize, and overfit. copycats, potentially even with great skill, no creativity.

          • vlovich123 6 months ago

            Humans don’t train on the entire contents of the Internet, so i’d wager that they do learn differently

            • sayamqazi 6 months ago

              I think there is a critical aspect of human visual learning which machine leanring cant replicate because it is prohibitively expensive. When we look at things as children we are not just looking at a single snapshot. When you stare at an object for a few seconds you have practically injested hundreds of slightly variated images of that object. This gets even more interesting when you take into account real world is moving all the time, so you are seeing so many things from so many angles. This is simply undoable with compute.

              • vlovich123 6 months ago

                Then explain blind children? Or blind & deaf children? There's obviously some role senses play in development but there's clearly capabilities at play here that are drastically more efficient and powerful than what we have with modern transformers. While humans learn through example, they clearly need a lot fewer examples to generalize off of and reason against.

                • sayamqazi 6 months ago

                  > Then explain blind children I was only talking about vision tasks as an example. You can extend the idea to any sense.

                  > While humans learn through example, they clearly need a lot fewer examples to generalize off of and reason against.

                  Human brain has been developing over millenia. machines start from zero. What if this few example learning is just an emergent capbaility of any "leanring function" given enough compute and training.

                • wegfawefgawefg 6 months ago

                  they take in many samples of touch data

                  • vlovich123 6 months ago

                    I think my point is that communication is the biggest contributor to brain development more than anything and communication is what powers our learning. Effective learners learn to communicate more with themselves and to communicate virtually with past authors through literature. That isn’t how LLMs work. Not sure why that would be considered objectionable. LLMs are great but we don’t have to pretend like they’re actually how brains work. They’re a decent approximation for neurons on today’s silicon - useful but nowhere near the efficiency and power of wetware.

                    Also as for touch, you’re going to have a hard time convincing me that the amount of data from touch rivals the amount of content on the internet or that you just learn about mistakes one example at a time.

                    • wegfawefgawefg 6 months ago

                      There are so many points to consider here im not sure i can address them all.

                      - Airplanes dont have wings like birds but can fly. and in some ways are superior to birds. (some ways not)

                      - Human brains may be doing some analogue of sample augmentation which gives you some multiple more equivalent samples of data to train on per real input state of environment. This is done for ml too.

                      - Whether that input data is text, or embodied is sort of irrelevant to cognition in general, but may be necessary for solving problems in a particular domain. (text only vs sight vs blind)

                      • vlovich123 6 months ago

                        > Airplanes dont have wings like birds but can fly. and in some ways are superior to birds. (some ways not)

                        I think you're saying exactly what I'm saying. Human brains work differently from LLMs and the OP comment that started this thread is claiming that they work very similarly. In some ways they do but there's very clear differences and while clarifying examples in the training set can improve human understanding and performance, it's pretty clear we're doing something beyond that - just from a power efficiency perspective humans consume far less energy for significantly more performance and it's pretty likely we need less training data.

                        • wegfawefgawefg 6 months ago

                          sure.

                          to be honest i dont really care if they work the same or not. I just like that they do work and find it interesting.

                          i dont even think peoples brains work the same as eachother. half of people cant even visually imagine an apple.

                          Neural networks seem to notice and remember very small details, as if they have access to signals from early layers. Humans often miss the minor details. Theres probably a lot more signal normalization happening. That limits calorie usage and artifacts the features.

                          I dont think that this is necessarily a property neural networks cant have. I think it could be engineered in. For now though seems like were making a lot of progress even without efficiency constraints so nobody cares.

                          • sayamqazi 5 months ago

                            > half of people cant even visually imagine an apple.

                            What is the evidence for this? We are just taking people's word for it?

                            • wegfawefgawefg 5 months ago

                              youre one of todays lucky few. about to have your mind blown. look this one up.

        • HelloImSteven 6 months ago

          Even if they did, I’d assume the association of “full” and this correct representation would benefit other areas of the model. I.e., there could (/should?) be general improvement for prompts where objects have unusual adjectives.

          So maybe training for litmus tests isn’t the worst strategy in the absence of another entire internet of training data…

        • orbital-decay 6 months ago

          A lot of other things are rare in datasets, let alone correctly labeled. Overturned cars (showing the underside), views from under the table, people walking on the ceiling with plausible upside down hair, clothes, and facial features etc etc

        • myaccountonhn 6 months ago

          They still can't generate a watch that shows arbitrary times I believe, so it could be the case?

      • sejje 6 months ago

        I did coax the old models into doing it once (dall-e) but it was like a fun exercise in prompting. They definitely didn't want to.

      • jorvi 6 months ago

        The old models were doing it correct also.

        There is no one correct way to interpert 'full'. If you go to a wine bar and ask for a full glass of wine, they'll probably interpert that as a double. But you could also interpert it the way a friend would at home, which is about 2-3cm from the rim.

        Personally I would call a glass of wine filled to the brim 'overfilled', not 'full'.

        • kalleboo 6 months ago

          I think you're missing the context everyone else has - this video is where the "AI can't draw a full glass of wine" meme got traction https://www.youtube.com/watch?v=160F8F8mXlo

          The prompts (some generated by ChatGPT itself, since it's instructing DALL-E behind the scenes) include phrases like "full to the brim" and "almost spilling over" that are not up to interpretation at all.

        • drdeca 6 months ago

          People were telling the models explicitly to fill it to the brim, and the models were still producing images where it was filled to approximately the half-way point.

    • yusufozkan 6 months ago

      Generating an image of a completely full glass of wine has been one of the popular limitations of image generators, the reason being neural networks struggling to generalise outside of their training data (there are almost no pictures on the internet of a glass "full" of wine). It seems they implemented some reasoning over images to overcome that.

      • kube-system 6 months ago

        I wonder if that has changed recently since this has become a litmus test.

        Searching in my favorite search engine for "full glass of wine", without even scrolling, three of the images are of wine glasses filled to the brim.

    • numpad0 6 months ago

      Except this is correct in this context. None of existing Diffusion models could, apparently.

  • Imustaskforhelp 6 months ago

    Looks amazing,can you please also create a unconventional image like the clock at 2:35 , I tried it something like this with gemini when some redditor asked it and it failed so wondering if 4o does do it

    • CSMastermind 6 months ago

      I tried and it failed repeatedly (like actual error messages):

      > It looks like there was an error when trying to generate the updated image of the clock showing 5:03. I wasn’t able to create it. If you’d like, you can try again by rephrasing or repeating the request.

      A few times it did generate an image but it never showed the right time. It would frequently show 10:10 for instance.

      • coder543 6 months ago

        If it tried and failed repeatedly, then it was prompting DALL-E, looking at the results, then prompting DALL-E again, not doing direct image generation.

        • Imustaskforhelp 6 months ago

          So it's not doing what they are saying/ advertising, I think you are onto something big then

          • coder543 6 months ago

            No... OpenAI said it was "rolling out". Not that it was "already rolled out to all users and all servers". Some people have access already, some people don't. Even people who have access don't have it consistently, since it seems to depend on which server processes your request.

    • Workaccount2 6 months ago

      I tried and while the clock it generated was very well done and high quality, it showed the time as the analog clock default of 10:10.

      • lyu07282 6 months ago

        The problem now is we don't know if people mistake dall-e for the new multimodal gpt4o output, they really should've made that clearer.

        • cmorgan31 6 months ago

          I’m using 4o and it gets time wrong a decent chunk but doesn’t get anything else in the prompt incorrect. I asked for the clock to be 4:30 but got 10:10. OpenAI pro account.

          • Imustaskforhelp 6 months ago

            Shouldn't reasoning make the clock work though.

            Why does it sound like this isn't reasoning on images directly but rather just dall e as some other comment said , I will type the name of the person here (coder543)

  • stevesearer 6 months ago

    Can you do this with the prompt of a cow jumping over the moon?

    I can’t ever seem to get it to make the cow appear to be above the moon. Always literally covering it or to the side etc.

jasonjmcghee 6 months ago

I don't buy the meme or w/e that they can't produce an image with the full glass of wine. Just takes a little prompt engineering.

Using Dall-e / old model without too much effort (I'd call this "full".)

https://imgur.com/a/J2bCwYh

  • ASalazarMX 6 months ago

    The true test was "full to the brim", as in almost overflowing.

sfjailbird 6 months ago

They're glass-half-full type models.

blixt 6 months ago

Yeah, it seems like somewhere in the semantic space (which then gets turned into a high resolution image using a specialized model probably) there is not enough space to hold all this kind of information. It becomes really obvious when you try to meaningfully modify a photo of yourself, it will lose your identity.

For Gemini it seems to me there's some kind of "retain old pixels" support in these models since simple image edits just look like a passthrough, in which case they do maintain your identity.

tobr 6 months ago

Also still seems to have a hard time consistently drawing pentagons. But at least it does some of the time, which is an improvement since last time I tried, when it would only ever draw hexagons.

HellDunkel 6 months ago

I think it is not the AI but you who is wrong here. A full glass of wine is filled only up to the point of max radius so that the surface to air is maxed an the wine can breathe. This is what we taught the AI to consider „a full glass of wine“ and it perfectly gets it right.

iagooar 6 months ago

The question remains: why would you generate a full glass of wine? Is that something really that common?

  • minimaxir 6 months ago

    It’s a type of QA question that can identify peculiarities in models (e.g. count “r”s in strawberry), which the best we have given the black box nature of LLMs.