yeldarb 3 months ago

I created this site using a trained language model using the Stack Overflow data dump.

Full writeup available here:

Interesting things I’ve noticed so far:

* It does a remarkably good job of context switching between programming languages based on the semantics of the question! If the question is about SQL it often includes SQL in < code > tags. If it’s about JavaScript it will include JavaScript! The syntax isn’t perfect due to the tokenizer mangling some things but it’s pretty close!

* The English grammar isn’t perfect but it’s pretty good.

* It doesn’t seem to lose closing track of closing tags and quotes.

* It's learned to sometimes pre-emptively thank people for their answers and to "edit" in "updates" at the end of the post.

If you find any interesting ones you can share them with the permalink! Use the "Fresh Question" button to load a new one.

  • duckerude 3 months ago

    I found this part of the about page interesting:

    > Originally, I wanted to predict the number of upvotes and views questions would get (which, intuitively, I thought would be a good proxy for their quality). Unfortunately, after working on this for about a week straight I've come to the conclusion that there is no correlation between question content and upvotes/views.

    > I tried several different models (including adapting an AWD_LSTM classifier, a random forest on a bag of words, and using Google's AutoML) and none of them produced anything better than random noise.

    > I also tried using myself as a "human classifier" and given two random questions from StackOverflow I can't predict which one will be more popular.

  • arendtio 3 months ago

    > Answering Questions Right now the model only generates questions. In version 2 I want to train it to answer questions. If I could get this working it'd actually become a useful tool instead of a fun toy.

    Looking forward to that part :D

    I mean, those answers are probably not going to be correct, but I wonder how close they will be to something useful.

    • drilldrive 3 months ago

      Yes, many times the questioner does not actually need an answer to the question, he just needs to look a little closer to the situation, which is potentially able to be automated. But one should not disguise such automation as an 'answer': more like a query autocheck but more tooled-up.

      • wpasc 3 months ago

        I wonder what percentage of questions just need a correctly working example because the questioner is unsure of how to use a given API. Automation of this I imagine could actually be doable.

  • avip 3 months ago

    Thanks, brilliant work! Some questions are downright hilarious (see a suite of automated packaging techniques [0]), and the broken English just adds extra ESL credibility to the questions.

    [0] >I want to write an update statement with a sequence of values I can run through a database. i've written the below code to broken up my character string into the columns. All of the articles i've read seem to suggest that I 'll need a suite of automated packaging techniques for my environment to all update the database.

    What s the best way to update the column ids?


    • naugtur 3 months ago

      I wouldn't doubt it was written by a human if I saw it on stackoverflow.

  • exolymph 3 months ago

    Thanks for including permalinks to questions, that's great for sharing!

jerf 3 months ago

"I have creating a PNG Image file where I am printing out the image's with different colors and Image Types. Now I am sure I am drawing properly, but what I'm seeing is that the image is not differently jpeg (ie FF or Chrome) and Safari (for Firefox) is different from the one in Firefox. "

As a bit of a connoisseur of babblebots over the decades, one of the interesting things about this generation is that it is producing text that has a very interesting effect in my mind. There is a part of the parsing process where the above text went down smooth; yup, that's what Stack Overflow questions from early developers tend to look like. That part of my brain issues no objection. But the next layer up screams bloody murder about how nonsensical that is. And it's not just "that's a bad question but I still see the order under it", but nonsense.

It's a combination I've not experienced before. Previous generation babblebots could often produce a lot of fun text, but every processing level above raw word processing has always been able to tell it's computer garbage, even when it blundered onto a particularly entertaining piece of garbage. We've actually successfully moved up a level here.

As I'm describing subjective experience, YMMV.

  • x1798DE 3 months ago

    The experience you are describing reminds me of the comparative illusion [0], which is a grammatical illusion where certain sentences seem grammatically correct when you read them, but upon further reflection actually make no sense, example:

    "More people have been to Berlin than I have."


    • Zanni 3 months ago

      Fascinating. There's a sentence I picked up from a friend in childhood, "Although the Moon is only an eighth the size of the Earth, it is much farther away," which seems to be similar, but not quite a CI, if I'm reading the Wikipedia article correctly. Thanks for the link.

  • veryworried 3 months ago

    This is like a mental tarpit, where you waste time reading trying to understand what the person is saying only to realize they were a bot and all your effort was for nothing, that is time you will never get back now.

    This is a terrifying way to destroy an online community if a person floods it with nonsense content like this.

    • ryandrake 3 months ago

      Terrifying -> inevitable. Imagine a botnet full of fake AI users trained with a corpus of legit HN posts. Let them loose commenting on random articles, beginning slowly but ramping up until they’re 99% of all comments.

      In a few years, the standard Silicon Valley “Growth Hacking” job description will include using AI to deploy fake content to your competitors’ sites, destroying their user community.

      • Jeff_Brown 3 months ago

        Two potential solutions: Reputation and a new account fee.

        Nonsense flooding will make it more difficult for people to establish their identities on a network, but once it's established, they'll be in the clear. If someone has to pay to have their first thread or two reviewed, it will take serious money to flood a site to death.

        (A similar solution to email spam has been waiting to happen for decades -- charge a fraction of a penny per email, and nobody is harmed but spammers. Maybe allow exceptions for officially recognized organizations that have to send a lot of messages, like political campaigns.)

        • richk449 3 months ago

          I was with you until you said you wanted to exempt the politicians. I would charge them double.

          • Jeff_Brown 3 months ago

            I share your concern.

            This would be even better: Email recipients grant free access to whoever they want. A tiny price would be charged only when sending to someone who has not granted such access.

        • yeldarb 3 months ago

          I like the idea.

          Unfortunately, even though snail mail has associated costs I still get a ton of junk.

      • avip 3 months ago

        Is this a future? How much content on twitter/fb is autogenerated, auto-liked and auto-shared?

        • pjc50 3 months ago

          Some is deliberately auto-generated, like ; but yes, there is definitely an awful lot of auto-liked auto-shared fake engagement out there.

        • jachee 3 months ago

          If the answer is not "zero" then the answer is "too much."

      • droithomme 3 months ago

        I'm positively sure these tactics are already being deployed as a weapon in order to shut down debate of certain inconvenient topics and disrupt problematic communities.

      • AnIdiotOnTheNet 3 months ago

        Ugh. Like SV hasn't yet made the internet suck enough...

  • mortenjorck 3 months ago

    Indeed, there’s something almost unsettling about text that initially appears to follow a sort of internal logic, yet doesn’t. Some of the results read like a programming fever dream:

    “I set each thread * pointer, adding a new thread and in a loop inside this function. The thread would be immediately on the thread, but the thread resulted in the exception. If I return the thread to the first thread and finally the thread is left, the thread doesn't hang, and I couldn't kill thread # 1 - because the thread method made first thread calls the native thread. But, the thread is waiting for the thread blocking and all the other threads to be started. In other words, the thread is always destroyed.”

  • hyperpallium 3 months ago

    A few times, I've come across Stackoverflow questions on technical topics I'm not very familiar with, and the question makes no sense to me (there are clear spelling, grammatical, and consistency errors). But there's an answer, and a comment exchange that seemed to resolve the question. So, I conclude, it's just my unfamiliarity prevents me from seeing through those errors.

    A related phenomena is seeing fundamental errors in a newspaper article on a topic you're expert in... but believing articles on topics you're not familiar with.

    This can operate as a partial turing test: a gradient for iteration.

  • eponeponepon 3 months ago

    On the other hand, cherrypicked excerpts can be terrifyingly convincing: "What is the best way to login to my Ruby application in a browser via Perl?"

    We've all been asked a question like that and had a cold dread creep over us as we try to formulate a response...

  • rosser 3 months ago

    It's the Uncanny Valley of text synthesis.

aur09 3 months ago

This one is golden:

"What's the best way to indeed start a process on an OS x machine?

What is the best way to start a process on Mac OS x Snow Leopard?

There I just need to be able to run the OS x.exe from the command line and it's working fine (make it available in Windows). But I'm on an Mac and I haven't figured out how to do this for a Linux machine.

Another reason I ask is that I only have a Unix shell running with the Python process in it (it's my an Ubuntu machine, nothing didn't work in the shell).

Thank you in advance"

apo 3 months ago

I propose a new kind of Turing Test.

Gather equal numbers of the least intelligible questions from SO (possibly using a metric based on low views/upvotes/comments/answers over long time) and a random selection from stackroboflow.

Present human judges with both sets of questions and ask them to tell the difference.

Having read numerous SO questions from newbie developers whose grasp of English was tenuous at best, I doubt I could tell the difference.

The next step up: the same test, but with mathematics or scientific papers judged by non-experts in the field.

We may actually be there already - I'm not sure.

All of which makes me wonder when we'll reach the point where the bar has been raised so high that the comparison will need to be against the best SO questions and scientific/mathematics papers judged by subject matter experts.

autechr3 3 months ago

"View from View to View, Need to open this new View in View"!/question/4781

A common problem for everyone, i'm sure.

avip 3 months ago!/question/22733 I hate when fellow coders do that.

Another pearl: Creating a PDF from PDF. The situation is as follows: We have a video file hosted by Google Map.

It's like reading a doco-satire about my life.

Findus23 3 months ago

I can claim to have experience [0] with generating funny nonsense based on Stackoverflow data (what a wired thing to say :))

Seems like you beat me to my plan to make a Neural Network based variant and I really like the results (especially that they stay a topic instead of totally drifting off into fun nonsense like my Markov Chains.

Have you tried also using other Stackexchange sites as a source? In my experience they result in more fun questions as they have more "human" interactions (especially the more personal advice based sites) which creates things like: - Do Greeks driving affect the whaling industry? - Essential windsurfing equipment to fish? - Do mountaineers eat grass? - Can I toast


  • yeldarb 3 months ago

    I haven't yet! It's on my list of things I'd like to try.

gudok 3 months ago

I reviewed 1600 edits at StackOverflow. And I can say that some of the automatically generated questions are more intelligible than the average SO question. For example, this one looks fine to me:!/question/11235

natch 3 months ago

Fascinating. I wonder if our current discussion boards on the interwebs can survive the coming influx of content like this and the next generations of it that follow.

There are a lot of SO questions posted by very weak non-native speakers of English and some of these are hard to distinguish from those. Kind of scary!

What possible positive outcomes do you see for this kind of (admittedly inevitable) capability?

  • pjc50 3 months ago

    AI will render political discussion between honest human strangers impossible on the open internet.

    At some point this technology will extend into what's left of print, then talk radio, then TV. An endless supply of Markov punditry.

    • carapace 3 months ago

      "Markov punditry" Nice coinage there.

      • pjc50 3 months ago

        :) I doubt it's original; it reminds me of the character Markov Chaney from Robert Anton Wilson's books.

  • yeldarb 3 months ago

    I am actually a bit worried that I’m already starting to see search engine traffic coming in...

    I hope that the good will outweigh the bad. I’d love to create an answer generator, for example.

    Once enough questions are generated I’m going to try creating a classifier to see if a neural net can differentiate between real questions and fake ones.

    • eitland 3 months ago

      > I am actually a bit worried that I’m already starting to see search engine traffic coming in...

      Brilliant, now you only need to come up with a way to use this for good and keep (at least slightly) ahead of the cost in the long run.

    • triplewipeass 3 months ago

      You could put up a robots.txt denying all search engines.

      • natch 3 months ago

        But the issue is not just what he could do, but what malicious content generation systems could do.

        • triplewipeass 3 months ago

          The issue is, as stated:

          > I am actually a bit worried that I’m already starting to see search engine traffic coming in...

          We can discuss hypothetical systems that could maliciously flood us with generated content. The creator of this particular service which is being discussed here and now could also begin taking steps to ensure that his creation does not inadvertently create a problem for some hapless Google user.

          • natch 3 months ago

            Well, no. You have to read further up the thread to see the issue I was referring to.

            >I wonder if our current discussion boards on the interwebs can survive the coming influx of content like this and the next generations of it that follow.

            Yes the robots.txt is a good and trivial step he could take to ensure well behaved robots do not pick up his content. So your comment suggesting robots.txt is a good comment in its narrow frame, but one that missed the larger picture. That minor problem is solved. The interesting problem is of a different nature.

  • Angostura 3 months ago

    I was thinking the same, as my immediate reaction was to attempt to understand the question and formulate a solution

9dev 3 months ago

I was cycling through some answers, when suddenly the following, completely unrelated text shows up in a random code block:

I can feel the admin is different

You sure you didn't just accidentally create a self-aware AI? Forgot to permalink sadly

motohagiography 3 months ago

There is immense value in training these to synthesize test data sets for sensitive information you can't safely put in a preprod environment.

Health information would be the main case I can think of now.

Having synthesized data for testing new services in govt would be a huge improvement.

De-identification is basically impossible and there are a bunch of companies who will lie to you if you pay them to, but synthesized data covers many use cases for de-identification and for homomorphic encryption.

kristianc 3 months ago

Awesome. Can you create a neutral net that arbitrarily closes questions as off topic or non constructive? ;)

  • penagwin 3 months ago

    No need for a neural network, you can just use Math.random (or your respective language's RNG).

    Sidenote, would this technically count as a neural network with only one weight (which is randomly initialized)?

  • yeldarb 3 months ago

    It’s something I’m interested in!

    Unfortunately I’ve come to the conclusion that upvotes on Stackoverflow aren’t correlated with question content (or I’m not skilled enough to be able to differentiate between “good” and “bad” questions). Check out the linked write up for more detailed info.

    • deckar01 3 months ago

      > arbitrarily

      I think the original comment is being sarcastic and suggesting that the actual humans that close discussions and mark them as "off topic" don't understand the question and perform these actions at random. This is a sentiment shared by many who don't "live" in those types of forums.

      • yeldarb 3 months ago

        Ah, missed the operative word there. I could definitely do that! ;D

    • joshvm 3 months ago

      Could you not also train a classifier that correlates question content with mod decision? Questions like "what is the best X? " that are obviously subjective , for example.

      Maybe even some kind of crazy generative model that learns to post questions that aren't closed by the AI moderator!

jefb 3 months ago

I think we've all been here before:

"i've been asked to use Json to call a webservice. I don't modify a JSON object at all. However, when calling JSON returned by the Json object, it fails because the object life isn't array!"!/question/24101

joshvm 3 months ago

Absolute gold: "Is there a animal out there that someone can apply to do the sort of thing I'm looking for?"

Not sure what happened to the title that time.!/question/11716

(perhaps op is a vim user)

  • kyle-rb 3 months ago

    Try a Python or a Pony, or maybe even an OCaml.

    • owl57 3 months ago

      Probably OCaml: that looks like an ML-style function signature.

silveroriole 3 months ago

This is great... “I'm getting errors with Line 1, Line 39, Column million” lol

  • skykooler 3 months ago

    When trying to pack your code into a one-liner goes too far...

sampleinajar 3 months ago

Nice! This is also what every question looked like when I was new to programming.

mannykannot 3 months ago

I'm voting to close as duplicate.

  • mormegil 3 months ago

    Right, after adding tags and answers, comments need to be added as well...

  • yeldarb 3 months ago

    [Sorry, this question has been protected by a moderator.]

    (^.^ this “so-SO” comment made me smile.)

ZoomZoomZoom 3 months ago

Oh, great, now I know how my clueless questions look like to a knowledgeable person! Example:

>"I need to create an image from a imported wav file (for a user - friendly format find enough header for the cookie). I looked for a solution, but that didn't work either."

hiccuphippo 3 months ago

> ... Thinking I have to use the first two but it's not possible to use Jquery.

> So: Is it recommended to use a Perl function

This is just like the real thing.

jpatokal 3 months ago

I presume this is due to tokenization or something, but there's a lot of extra whitespace in the code samples that make them look very unrealistic:

  def _ _ init__(self, default): 
  " " " 
  See if the default value for the field on a view is 
  " " " 

  < select > 
  < option > value < / option > 
  < option > value < / option > 
  < / select >
And indentation is also missing completely. Maybe you need to use another NN to guess which language the fake code is in and autoformat it accordingly!
hyperpallium 3 months ago

Comgratulations, you have simulated a million monkeys at typewriters with a million monkeys at typewriters. Has anyone really been far even as decided to use even go want to do look more like?

code_duck 3 months ago

Final question didn’t end in a question mark - perfect!

“I want to do something like this

$ _ -1 = object();”

We all do....

Now I see that the virtual question is different every time. Great work. It read better than most SO questions.

8bitsrule 3 months ago

This one is clearly written by a broken agent:!/question/1450

Reminds me of those online chatbots I used to torture back 10 or 15 years ago. One I started asking about personal information about its creator. It was remarkably evasive, constantly attempting to switch the subject.

joshvm 3 months ago

Another gem:!/question/17035

"I have got a big "someone" who will be going to be using the site.

I have a black box and a background in firefox, where they have a width of 100%.

They will never know of a color.

They come from a background color."

Film starring Liam Neeson?

dugluak 3 months ago

Every good invention can be terrifying if it falls in the hands of bad guys (Nuclear technology for example). It's true for AI also. I am sure bad guys must be training similar AI agents by only feeding fake news, conspiracy theories etc. and it's easy to build AI agents as there is so much Open Source material online about AI.

  • chrisco255 3 months ago

    I'm trying to imagine a productive use case for this? Maybe in reverse for attempting to answer questions?

    • jdefr89 3 months ago

      Think things like election meddling. Propagating truly fake news to cater to the emotions of what people simply want to be true. Humans are weak against Confirmation Bias, ten minutes on Facebook will show you for sure.

    • dugluak 3 months ago

      use case is to spread misconceptions in the society (that's what bad guys want right?) in an automated way. especially during elections.

aboutruby 3 months ago

I think it would still be considered a "Does Not Exist"-valid website if the generated questions would have some auto-formatter for the code. Main issue I see is extra spaces everywhere, often in a syntax breaking way (and missing spaces for formatting) (not that all SO questions have those).

  • yeldarb 3 months ago

    Yeah this is a shortcoming of the tokenizer. It splits things up in ways that are not 1:1 mappable back to their source unfortunately.

    I did a bit of post-processing to get it formatted a bit better (re-combining the “would“ and “n’t” tokens and changing html tags to markdown for example) but there’s still room for improvement.

    Spacing specifically is different based on the context. Outside of code blocks you want a space after a period. Inside you probably don’t. But since the tokenizer has one in both places there’s no opportunity for the neural net to learn this (it can’t see any difference). And my naive formatted doesn’t know the difference either. (If you’re curious you can find it in the JS file)

    • yeldarb 3 months ago

      I updated my regexes to clean up some of the tokenizer noise last night. So many of the formatting in the code snippets should look a bit more natural now.

TheAsprngHacker 3 months ago

Huh, in this question, there are a lot of words that get repeated five consecutive times:!/question/13733

Is there a reason why? (I don't know anything about AI.)

  • yeldarb 3 months ago

    The way the language model is trained is by rewarding it for correctly predicting the next word in a sequence.

    The output of the model is a predicted probability distribution of the next word and a “state” — the next iteration takes the state output of the previous interation and generates another word and state (and this process repeats many times).

    Since there’s a probabilistic dimension, what may have happened in this case is that it happened to repeat once by chance and the model had learned that if something repeats 2x it’s likely that it will repeat a third, fourth, and fifth time.

    Basically it’s just trying to game the loss function which rewarded it for predicting the next word in the sequence correctly.

    • TheAsprngHacker 3 months ago

      Thanks for the explanation. Your description superficially reminds me of a Markov chain ( Is this related or is it totally different?

      • LeanderK 3 months ago

        I haven't read the paper the work is based on, but if the RNN outputs a probability distribution for the next letter/word then they form Markov Chains (since then they only depend on the current state and not the previous state)!

        RNNs are just fancy parametric functions that take a (state, input)-pair and return a new (state', output)-pair.

  • crooked-v 3 months ago

    From what I understand, repeated output is a common failure state with neural net stuff in particular, though I don't know why.

  • dmurray 3 months ago

    Because it needs to determine if a string is a '????? '''''?????, of course.

drdaeman 3 months ago

This desperately needs some AI-generated expert answers!

deepsy 3 months ago

This is similar to

sachin18590 3 months ago

This looks absolutely amazing! I would be very curious to know how you went about conceptualizing the project and the AI beneath. Do you have a blogpost on it or planning to write one?

stabbles 3 months ago

This is very refreshing!

"I'm starting a new website using VB. I make a migration file and save it to a local Azure database"

drinane 3 months ago

This comment is worth about as much as this website.

mitchtbaum 3 months ago

The software that wrote this comment does not exist.

TomMckenny 3 months ago

It has better grammar than the real one anyway.

turtlebraile 3 months ago

I really would like a bot like this to produce ideas of things to create with programming in general.

Any ideas on the possible dataset?

booleandilemma 3 months ago

Can we please get a Jon Skeet neural network to provide answers?

droptablemain 3 months ago

Giggles Love it.

  • chrisco255 3 months ago

    Funny, yet terrifying at the same time. How can I be sure that HN isn't just a really well trained Neural Net?

    • chrisco256 3 months ago

      How can I be sure that i'm not just a really well trained Neural Net?

      • jdefr89 3 months ago

        You are a very well trained neural net... The concept is based off of actual Neurons in our brain. Can't tell if you're serious or trolling though lol.

      • chrisco255 3 months ago

        Isn't a brain just that?

    • yeldarb 3 months ago

      A friend of mine actually suggested trying to generate a hacker-news-comment language model next.. sounds like a fun project.

      I’ll have to look and see if there’s an archive of them available.

      • code_duck 3 months ago

        An archive of HN comments? This is it.

    • exolymph 3 months ago

      Wait, you're saying it's not?

    • code_duck 3 months ago

      How can I be sure that I’m not in a coma? Nothing external can be believed.

drinane 3 months ago

This is lame.

  • drinane 3 months ago

    I protest getting -4 points. They used github and stackoverflow. Wrote a function to connect the two based on tags and then randomly generate a question off of that. It's lame. Do something useful or cool.