Show HN: Using stylometry to find HN users with alternate accounts

676 points by costco 3 years ago

Author here. This site lets you put in a username and get the users with the most similar writing style to that user. It confirmed several users who I suspected were alts and after informally asking around has identified abandoned accounts of people I know from many years ago. I made this site mostly to show how easy this is and how it can erode online privacy. If some guy with a little bit of Python, and $8 to rent a decent dedicated server for a day can make this, imagine what a company with millions of dollars and a couple dozen PhD linguists could do.

Here's Paul Graham:

https://stylometry.net/user?username=pg

Here are some frequent HN commenters: (EDIT: Removed due to privacy concerns)

sillysaurusx 3 years ago

Wow. This gives a lot of false positives, but it found all ~10 of my old accounts over the years.

The most interesting thing is that my writing style changed pretty drastically since a decade ago. Searching for my oldest account matches my earliest usernames, whereas searching this account matched the rest.

The details of the algorithm are fascinating: https://stylometry.net/about Mostly because of how simple it is. I assumed it would measure word embeddings against a trained ML model, but nothing so fancy.

echelon 3 years ago

It works like a charm for me too.
I put in my username and found my pre-echelon alt, possibilistic.
(Echelon was taken when I registered possibilistic, but it must have been unused and dropped.)
costco 3 years ago

Yeah top 20 is a little excessive because in my own tests I found that top 20 is only marginally more accurate than top 10. You can get a more academic explanation [here](https://www.tandfonline.com/doi/abs/10.1080/09296174.2011.53...). I was amazed too because it seemed too easy!
- sillysaurusx 3 years ago
  
  FWIW, top 20 was necessary for mine. The bolding was a brilliant move. Several of my accounts were ranked 10-20, but popped out due to the bolding.
  
  justusthane 3 years ago
  
  What does the bolding indicate?
  
  sillysaurusx 3 years ago
  
  The explanation is here: https://news.ycombinator.com/item?id=33755466
  As far as I’m concerned, it’s the killer feature of the app. The top 20 results may be noisy, but the bolded results have a signal to noise ratio close to infinity.
  
  costco 3 years ago
  
  The funny thing is that I thought of it while eating dinner last night :)
  
  jsnell 3 years ago
  
  The precision of the bolded results looks like maybe 30% to me. Significantly better than the non-bolded, but nowhere near perfect precision.
  
  costco 3 years ago
  
  False positives become an increasingly difficult problem the more and more potential authors you introduce. If I had wrote a fancier model it probably wouldn't be as much of a problem but what can you do.
  
  jsnell 3 years ago
  
  Yes, this wasn't a criticism of the tool. It is crazy good.
  But I don't think people should be making the assumption that bolded results are definite alts, which sillysaurus' comment reads like.
  
  sillysaurusx 3 years ago
  
  Hmm, that wasn’t my intent. I see this tool as a recommendation engine more than a doxxer. By “signal to noise ratio close to infinity,” I meant that if you visit one of the bolded accounts, they’ll probably sound a lot like you.
  It’s one of those ideas that makes the tool substantially more effective, yet never would’ve occurred to me. It’s like the simplicity of pg’s “a plan for spam” algorithm: deceptively simple, but (like scrubbing dishes with fingers) works really well.
  
  tekknik 3 years ago
  
  > I see this tool as a recommendation engine more than a doxxer.
  That is absolutely all this will be used for. This is a dangerous tool that serves no real world purpose.
  
  loeg 3 years ago
  
  I have 7 bolded names (0.53-0.62) in the top 20 list, and none are alts of mine.
  
  ghaff 3 years ago
  
  Pretty much the exact same. (I do have a throwaway account but I rarely use it and it probably hasn't been used enough to qualify.)
  
  morsch 3 years ago
  
  I'm one of them and I can confirm. But then again that's what I'd say if I was.
  
  loeg 3 years ago
  
  Hi style-adjacent friend :-). Just briefly looking at your recent comment history, we seem to find different kinds of articles interesting, but maybe have a similar writing style.
  
  dragonwriter 3 years ago
  
  Of my top 20, 19 are bold, all are above 0.6, and I have no alts.
  
  notahacker 3 years ago
  
  Vast majority of my top 20 were bold, except you funnily enough!
  None of them are me (and you were the only one I recognised and thought "yeah, I can see where it gets it from"...)
  
  dimmke 3 years ago
  
  My results have 5 bolded users in my top 20, and I have 0 alt accounts.
User23 3 years ago

I’d figured it would be some kind of n-gram frequency analysis. Would be interesting to code that up and compare.
- costco 3 years ago
  
  It is. The description on the about page is a little simplified but I basically I look at the most common word and character ngrams of size 1,2,3 (200 each), put all the frequencies in an array and then compare to all the other users with https://scikit-learn.org/stable/modules/generated/sklearn.me....
  
  User23 3 years ago
  
  Cool, I only skimmed the description maybe I needed to read it more carefully.
  Have you considered doing rune rather than word ngrams? I can imagine that might be prohibitively expensive, but I really don’t know. I did something like that long long ago in C for automatic document language detection. It was quite accurate.
FormerBandmate 3 years ago

> sillysaurus3
> sillysaurus2
Tbf a human could have found a bunch of them relatively easily
lettergram 3 years ago

Frankly similar to how I was doing in back in 2018 (when you and I chatted about it on HN lol)
https://news.ycombinator.com/item?id=17944293
The approach I took was a bit different, but also no ML required.
The real trick is pruning and going cross platform. There are around 100k active HN accounts (meaning posts a few times a year), maybe 200k if you count at least one post a year. But <10k that post weekly.
It’s a very small space to try to compare so simple methods will work fine.
- costco 3 years ago
  
  Exactly. HN emphasizes long-form posts much more than other forums which makes the commenters here very susceptible to this kind of analysis. Plus you can fit every single HN comment in RAM on a mid tier gaming laptop so it's even easier. I was trying to think of applications of this kind of data and the only thing I could think of was moderation tools/detecting ban evaders but what you've done seems much more profitable lol.
hnburnerUixoHr5 3 years ago

Woof.
I create new accounts on a semi-regular basis because I think cliques are the most corrosive factor to social media. Any time my account gathers enough upvotes enough I destroy it for another.
I had four accounts. None are over 50% confidence, but when I look at any one account the others are consistently #2, #3, and #4.
Now I’m thinking very carefully about what words I use to avoid linking this as the 5th account.
- hailwren 3 years ago
  
  Exact same thing happened to me. Wild.
- butterNaN 3 years ago
  
  This makes me melancholic. One should be able to express themselves without the overhead of privacy concerns.
dimmke 3 years ago

On the other side of the coin, I have never had an alternate HN account (beyond maybe 1-2 throwaways with only one post or comment) so seeing the list of users that are most similar to me was interesting. I didn't see some stark similarities based on a quick peek at their comments, but it was interesting.
bb88 3 years ago

sillysaurus3 was in mine. :) Clearly we're not the same.

jll29 3 years ago

The method used, i.e. to calculate the cosine of the two authors' word vectors, is poorly suited for stylometric analysis because it is based on a poster's lexicon and the word frequencies of each word, but ignoring stylistically relevant factors like word order.

Also, the cosine of the vectors of word frequencies conflates author-specific vocabulary and topics; in other words, my account is grouped (with >51% similarity, according to the demo) with someone probably because we wrote about similar things. A strong stylometric matcher ought to be robust against topic shifts (our personal writing style is what stays constant when we move from writing about one topic to writing about another topic, just like our personality is what stays constant about our behavior over time - of course styles do change, but the premise then has to be that such changes happen very slowly).

Stylometrics/authorship identification is interesting and has led to some surprising findings, e.g. in forensic linguistics (Malcolm Coulthard wrote several good books about the topic).

This paper lists some other features that could be used and compares a bunch of techniques: https://research.ijcaonline.org/volume86/number12/pxc3893384...

MikePlacid 3 years ago

> based on a poster's lexicon and the word frequencies of each word, but ignoring stylistically relevant factors like word order.
Interesting. I was expecting to be grouped with other Russian speakers and I am (based on some nicknames). But I thought the most telling feature will be exactly word order - it’s absolutely relaxed in Russian. Word frequencies? Well, probably the absence of articles, lol (but I swear to God that I often spend some extra time trying to insert as many articles in my texts as I could).
implements 3 years ago

There’s https://en.wikipedia.org/wiki/Idiolect :
”Language consists of sentence constructs, choice of words, and expression of style. Accordingly, an idiolect is an individual's personal use of these facets. Every person has a unique idiolect influenced by their language, socioeconomic status, and geographical location.”
antirez 3 years ago

In practice a more complex approach will tend to require a greater amount of data per user, so in this specific case this simple approach is not too bad. Moreover, fake accounts are likely to talk about the same topics, so while this leads to false positives, also makes it more likely that in the list we find actual duplicates.

sillysaurusx 3 years ago

Ha, gruseom shows up for pg, which is dang’s old account. A worthy successor.

This is a fascinating way to find similar HN users who aren’t the same person. It’s a surprisingly great recommendation engine. “If you like pg, you might also like…”

Sure, the privacy concerns are valid, but the cat’s out of the boot. Might as well enjoy the benefits.

montrose is almost definitely pg. Someone who talks about ancient history, Occam’s razor, VCs and startups, uses the phrase “YC cos” (relatively uncommon), etc. https://news.ycombinator.com/item?id=17112567

Nicely done. One of the best hacks I’ve seen in a long time.

costco 3 years ago

> motrose is almost definitely pg. Someone who talks about ancient history, Occam’s razor, VCs and startups, uses the phrase “YC cos” (relatively uncommon), etc. https://news.ycombinator.com/item?id=17112567
I had this hunch too. It's either pg or someone trying really hard to be pg.
- roughly 3 years ago
  
  I mean, this is HN -
  > someone trying really hard to be pg
  describes half the site.
asveikau 3 years ago

> Someone who talks about ancient history, Occam’s razor, VCs and startups,
I think these are all common topics among HN readers and commenters.
VyseofArcadia 3 years ago

> but the cat’s out of the boot
It's my first time hearing that variant. Usually its, "the cat's out of the bag" where I'm from.
Do you mean boot in the UK sense, what Americans would call the trunk of a car? Or do you mean a sturdy piece of footwear?
Obligatory xkcd https://xkcd.com/2390/
- sillysaurusx 3 years ago
  
  It’s a little writing trick I leaned from (I think) Orwell. Any time you’re about to use a common metaphor, try to tweak it. You’ll catch readers off guard, which piques their curiosity.
  It’s a fun game, too. I wish I’d used “the cat’s out of the hat,” but I didn’t think of it till later.
  
  InGoodFaith 3 years ago
  
  What you are describing is also known as an eggcorn.
  https://en.wikipedia.org/wiki/Eggcorn
  
  sillysaurusx 3 years ago
  
  Thank you! I was trying to find the original essay I learned it from. I’m now pretty sure it was by Poe, but all I can remember is the main advice: avoid common metaphors.
  I vaguely remember one of the metaphors in the essay was about a chicken coop melting, or something like that. It was vivid enough to leave a big impression.
  
  ewilden 3 years ago
  
  I remember this being from Politics and the English Language (https://www.orwellfoundation.com/the-orwell-foundation/orwel...):
  “ Dying metaphors. A newly invented metaphor assists thought by evoking a visual image, while on the other hand a metaphor which is technically ‘dead’ (e. g. iron resolution) has in effect reverted to being an ordinary word and can generally be used without loss of vividness. But in between these two classes there is a huge dump of worn-out metaphors which have lost all evocative power and are merely used because they save people the trouble of inventing phrases for themselves.”
  
  sillysaurusx 3 years ago
  
  Thank you so much! That’s the one.
  (It’s remarkable how often a vague description can yield an HN comment with an answer from a clever sleuth like yourself. Much appreciated.)
  
  operator-name 3 years ago
  
  That's neeto!
  The 2nd example also loosely falls under the classification of malaphor.
  https://en.m.wiktionary.org/wiki/malaphor
  
  rcarr 3 years ago
  
  This is my all time favourite one of these:
  https://thehabit.co/knowledge-is-power-france-is-bacon/
  > When I was young my father said to me: “Knowledge is power, Francis Bacon.” I understood it as “Knowledge is power, France is bacon.”
  > For more than a decade I wondered over the meaning of the second part and what was the surreal linkage between the two. If I said the quote to someone, “Knowledge is power, France is Bacon,” they nodded knowingly. Or someone might say, “Knowledge is power” and I’d finish the quote “France is bacon,” and they wouldn’t look at me like I’d said something very odd, but thoughtfully agree. I did ask a teacher what did “Knowledge is power, France is bacon” mean and got a full 10-minute explanation of the “knowledge is power” bit but nothing on “France is bacon.” When I prompted further explanation by saying “France is bacon?” in a questioning tone, I just got a “yes.” At 12 I didn’t have the confidence to press it further. I just accepted it as something I’d never understand.
  > It wasn’t until years later I saw it written down that the penny dropped.
  
  sambapa 3 years ago
  
  You left the funniest thing - the guy/gal's nickname was "Lard_Baron"
  
  b800h 3 years ago
  
  An eggcorn is a soundalike though, isn't it? Deliberately altering idioms to catch people's attention isn't an eggcorn IMO.
  
  InGoodFaith 3 years ago
  
  > An eggcorn is a soundalike though, isn't it?
  Not necessarily, you might be thinking of malapropisms but yes probably a closer word would be the general term: protologism.
  Another commenter added some useful info on the evocative alteration of metaphors [2]
  1: https://en.wikipedia.org/wiki/Malapropism
  2: https://news.ycombinator.com/item?id=33757097
  
  sdwr 3 years ago
  
  I love doing this too, it's fun to write.
  
  UncleEntity 3 years ago
  
  Yeah, it’s like shooting ducks in a barrel it works so well.
  Easy to overuse then people just get annoyed though…kind of like commas, I suppose.
  
  PebblesRox 3 years ago
  
  That reminds me of a PETA campaign on social media trying to get people to replace violent idioms with alternatives like "feeding a fed horse" and "there's more than one way to pet a cat."
  
  esfandia 3 years ago
  
  I like mixing metaphors, in this case "the cat's out of the tube". ("the toothpaste's out of the bag" doesn't work as well though)
- martin82 3 years ago
  
  There's a popular movie called "Puss in Boots". That's what I had to think of first.
  
  pvg 3 years ago
  
  It's a bit older than the movie or movies in general.
  https://en.wikipedia.org/wiki/Puss_in_Boots
pyb 3 years ago

Why would montrose be pg ? The correlation is not that high. Looks like a few people have picked up pg's mannerisms.
- costco 3 years ago
  
  There are factors that make me think it is more likely than not (just scrolled through the comment history, don't feel like linking everything) that he is pg.
  - Is bolded on pg's page
  - Mentions yoga
  - Talks about Lisp often
  - Talks about YC often
  - Talks about kids
  - Links to Paul Graham's website
  - Says he uses vi
  - Writes exactly like you would expect pg to write
  
  ethmaxi 3 years ago
  
  I'm sophisticately sure they are not. They recommend a founder to ask users directly what they will pay for.
  Is that what PG would say?
  
  sillysaurusx 3 years ago
  
  Of course. Why wouldn’t he? That’s sound advice.
  
  ethmaxi 3 years ago
  
  YC startup videos recommend not asking users directly what they will pay for.
  Users freq. say they will pay for something but back down against other things.
  
  costco 3 years ago
  
  With all due respect:
  https://news.ycombinator.com/item?id=16785542
  https://twitter.com/paulg/status/1362369484036653058
  I think you are very likely to be wrong.
  
  astura 3 years ago
  
  Wow, what an odd thing to get so worked up about.
  
  pyb 3 years ago
  
  I agree that this person is trying very very hard to sound like pg ! You could be right actually. Could still be a "wannabe" though.
- seba_dos1 3 years ago
  
  Yeah, that score is only slightly higher than the highest one it shows for my account (which is also bold) - and unless my alter ego has been disguised so well it even managed to hide from myself, I'm pretty sure that isn't me :)
  
  kazinator 3 years ago
  
  The score for montrose vs pg is lower than the score for someone most similar to me, who is definitely not me.
  I think, the similiarity has to be in the high .80's to suspect that it's the same individual.

rcarr 3 years ago

This is somewhat similar to how they ended up catching the Unabomber. The FBI were literally at a dead end. They ended up posting one of his letters/manifestos in the paper, somebody recognised a turn of phrase the unabomber used that was unusual and reported it as possibly being their brother, FBI investigated the lead and it lead them straight to him.

Excerpts from wiki:

> Before the publication of Industrial Society and Its Future, Kaczynski's brother, David, was encouraged by his wife to follow up on suspicions that Ted was the Unabomber.[91] David was dismissive at first, but he took the likelihood more seriously after reading the manifesto a week after it was published in September 1995. He searched through old family papers and found letters dating to the 1970s that Ted had sent to newspapers to protest the abuses of technology using phrasing similar to that in the manifesto.[92]

> In early 1996, an investigator working with Bisceglie contacted former FBI hostage negotiator and criminal profiler Clinton R. Van Zandt. Bisceglie asked him to compare the manifesto to typewritten copies of handwritten letters David had received from his brother. Van Zandt's initial analysis determined that there was better than a 60 percent chance that the same person had written the manifesto, which had been in public circulation for half a year. Van Zandt's second analytical team determined a higher likelihood. He recommended Bisceglie's client contact the FBI immediately.[96]

> In February 1996, Bisceglie gave a copy of the 1971 essay written by Ted Kaczynski to Molly Flynn at the FBI.[87] She forwarded the essay to the San Francisco-based task force. FBI profiler James R. Fitzgerald[98][99] recognized similarities in the writings using linguistic analysis and determined that the author of the essays and the manifesto was almost certainly the same person. Combined with facts gleaned from the bombings and Kaczynski's life, the analysis provided the basis for an affidavit signed by Terry Turchie, the head of the entire investigation, in support of the application for a search warrant.[87]

https://en.m.wikipedia.org/wiki/Ted_Kaczynski

googlryas 3 years ago

It was actually his brother.
fbdab103 3 years ago

So is the lesson you should have GPT rewrite your manifesto so as to obscure your personal idioms?
- CharlesW 3 years ago
  
  Or something purpose-built like Anonymouth (https://github.com/psal/anonymouth), although it seems to be both unique and dead.
  Also interesting:
  > Ross Ulbricht aka Dread Pirate Roberts, the mastermind behind the infamous Silk Road site which served as a black market for drugs, weapons and fake documents was also well aware of the potential danger of stylometry being used against him. At the time of his arrest in a San Francisco public library, the FBI captured images of his laptop screen as evidence. Guess what what he had bookmarked — “Science of Stylometry.”
  https://medium.com/svilenk/the-case-for-anonymity-12db114f0c...
  
  rejectfinite 3 years ago
  
  I mean he used an forum account with an email that had his name in it.
  
  fbdab103 3 years ago
  
  That's the problem - it only takes a single slip and it is recorded forever. Perfect opsec is an impossibly high bar if you are maintaining an active online presence.
- astura 3 years ago
  
  Only if you have a history of sending crazed writings/manifestos to newspapers and family.
atestu 3 years ago

The show “Manhunt: Unabomber” (Netflix) shows this whole story very well.
ryangittins 3 years ago

As I recall, one of the clinchers was his use of the phrase, "you can’t eat your cake and have it too" as opposed to the now-predominant variant "you can’t have your cake and eat it too."
I often wonder if stylometry can be used to positively identify a person based not on general word frequency, but by a single phrase or two which are rare in general but commonly used by the individual. In theory this could be relatively easy to find given a large corpus. You'd pick out the top few n-grams for short phrases by an individual and identify the ones which are most overly-represented compared to the rest of the population.

drc500free 3 years ago

This is a super interesting tool for self reflection. Looking at the top 10 similar accounts to mine, it gives me an arms-length view of how other people probably interpret my tone.

I appear to be a well-educated, over-confident know-it-all.

bhaney 3 years ago

> I appear to be a well-educated, over-confident know-it-all.
Don't we all?
- sdwr 3 years ago
  
  I hate us insufferable nerds. !
seydor 3 years ago

we must be a good match
- drc500free 3 years ago
  
  I'd love a version of this where you enter two usernames and get a match score.
pavlov 3 years ago

My #3 match is cstross, and now I’m convinced that my life-long secret dream of being a successful sci-fi novelist is basically a matter of typing. (Ideas? Character development? Ruthless editing? Developing an audience? Having a publisher? What do I need of those when the Computer told me I’m practically a genius…)
- shagie 3 years ago
  
  I'd suggest giving the back story to Agent to the Stars by John Scalzi a glance.
  http://www.scalzi.com/agent/
  > In the summer of 1997, I was 28 years old, and I decided that after years of thinking about writing a novel, I was simply going to go ahead and write one. There were two motivations for doing so. First, I was simply curious if I could; I'd had up to that time a reasonably successful life as a writer, but I'd never written anything longer than ten pages in my life outside of a classroom setting. Two, my ten-year high school reunion was coming up, and I wanted to be able to say I'd finished a novel just in case anyone asked (they didn't, the bastards).
  > In sitting down to write the novel, I decided to make it easy on myself. I decided first that I wasn't going to try to write something near and dear to my heart, just a fun story. That way, if I screwed it up (which was a real possibility), it wasn't like I was screwing up the One Story That Mattered To Me. I decided also that the goal of writing the novel was the actual writing of it -- not the selling of it, which is usually the goal of a novelist. I didn't want to worry about whether it was good enough to sell; I just wanted to have the experience of writing a story over the length of a novel, and see what I thought about it. Not every writer is a novelist; I wanted to see if I was.
closeparen 3 years ago

That's what we all come to HN for...
bee_rider 3 years ago

I also enjoyed reading one of my style-partner’s posts.
The most noticeable similarity is that we both clearly have strong opinions about some things, and like to share information, but also like to be clear about our unknowns or opinions. So, lots of “sounds likes,” “probably,” “could be” and so on.
The downside is, I guess, this could be seen as a bit weasel-word-y or indirect.
- reducesuffering 3 years ago
  
  > like to be clear about our unknowns or opinions. So, lots of “sounds likes,” “probably,” “could be” and so on.
  Commonly called just “hedging” like hedging your bets.
  
  bee_rider 3 years ago
  
  That’s a kinder description than I gave it in my next paragraph, so thanks I suppose.
  I do think it is an under-emphasized aspect of honesty, though, that we should be clear about our level of experience/understanding. Especially online — people like to discuss things, even (especially?) when we are just getting started. So if we’ve picked up opinions through osmosis and we start repeating them without testing them, we’re really just amplifying some possibly-incorrect viewpoint (and if we’ve picked it up, there’s a good chance it is already widespread in the community, which is bad if it is wrong).
  And I mean, more concretely a measurement is not complete without the error bars!
  Often this doesn’t really matter, because it is just chit-chat anyway. But it is nice to keep in mind.
  
  fancybouncy 3 years ago
  
  > we should be clear about our level of experience/understanding
  there are many languages that encode this info as mandatory grammatical affixes, it's called evidentiality.
  
  bee_rider 3 years ago
  
  I hadn’t heard of that. Neat!
  I find it interesting that the first example they use in the Wikipedia article is Turkish. I’ve only met a couple Turks, but they were all quite good engineers. I wonder to what extent embedding this kind of information in the language helps organize your thoughts.
highwaylights 3 years ago

Same. Looking through some of the handles on my list tells me that I come across like a not-particularly-well-educated McSmug that needs to take a good long look at myself. Wouldn’t be so bad if I wasn’t reading the posts thinking I definitely could see myself writing this.
This was certainly eye-opening.
Update: It’s actually a little strange that reading through some of the matches it’s not just style that overlaps but perspectives in quite a few cases too. I’m definitely not the unique little snowflake that some others are finding themselves to be.
reducesuffering 3 years ago

> over-confident know-it-all.
I’m pretty sure participation in HN is a 99% sure filter for being called this many times in one’s life.

jsnell 3 years ago

After a few tries on boring accounts, I thought to try the account of somebody who was notorious for an incident outside of HN, and had a (deservedly) bad time at HN for a couple of years before the account went dark.

And yeah, there's a bunch of high confidence (.6-.8) hits for that account, and from a quick browse of the comments of the recently active ones, they look really likely to be alts. Like, all three that I looked at had comments that made it very clear it was this person writing pseudonymously. (E.g. writing on their signature issue, and saying they couldn't go into more detail due to fear of self-doxxing; or somebody literally saying that the alt's claims reminded them of the public writings of the notorious guy years ago).

Obviously I'm not naming the account, but this functionality turned out way creepier than I thought the moment I tried it on the account of somebody who has a reason to disassociate from an existing public persona, but still wants to participate here.

kcarter80 3 years ago

Could you elaborate on why it's obvious why you won't name the account?
- notduncansmith 3 years ago
  
  Maybe to avoid attracting any extra attention to this user? Also, as someone who’s read HN for a few years, it only took me 2 guesses to find an account that the above comment describes (and not necessarily the same person).
  
  sillysaurusx 3 years ago
  
  It was a classy move by jsnell, too. Thank you.
  (I don’t know who the comment is talking about, which is how it should be. There’s no need to blow someone’s cover in a highly visible way. Even if they were satan, they’d still be welcome on HN as long as they’re writing substantive, interesting comments that follow the guidelines.)
  
  Normal_gaussian 3 years ago
  
  Such quality comments would track with most thorough Satan representations.
- Aachen 3 years ago
  
  They obviously don't want it to be known, seeing as they've got alts to post under and avoid going into too much detail. Being able to go out and do your own research is different than posting the information open for everyone to see at a glance.
  I would say it's obvious why one might respect that wish (do unto others...), but I'm also aware that my and my culture's sense of privacy goes further than many others'.
tbrownaw 3 years ago

> but this functionality turned out way creepier than I thought the moment I tried it
Hopefully this raised awareness means that people who actually need anonymity will be more likely to know to take precautions.
- kaba0 3 years ago
  
  Genuinely asking, what way is there to combat this? Is there a tool that takes out stylistic elements of your comment?
  
  paulgb 3 years ago
  
  The site mentions a service called Quillbot which apparently does just that. https://stylometry.net/avoid
  
  thedragonline 3 years ago
  
  I wonder if gpt3 has a use case here?
  
  marbu 3 years ago
  
  One way would be to run such tool before posting and then based on the results, tweak the post and repeat until the similarities are not statistically significant. Or instead of tweaking, start posting under a new throwaway account. But this won't save you when some new way to analyze style appears in the future. Moreover there are other types of meta data which can be taken into account to narrow down the search space a bit such as timestamps. And obviously more you write, harder it is to control these things.
  
  klabb3 3 years ago
  
  This is the million dollar question. I think the goal of "anonymity for most intents and purposes" is worthy, it's been how I've enjoyed HN and Reddit, but I also know that it was just a matter of time before stylometry and other meta-analysis of post history become 10 second tools for everyone. Now the cat is out of the box.
  I've been thinking about this a bit, and I've landed in that having a stable identifier across ALL comments & posts is a poor default. We still probably want some coherence, at minimum within a thread, eg to follow a back-and-forth. The site itself may also use stable identifier for abuse prevention. But there's no reason one should have the same username externally traceable for posts about completely different topics.
  In practice, this could be done with low friction pseudonym creation, which all ties to the same account privately.
thesz 3 years ago

I keep no alternate accounts, but this tool reports best matches for me that appear to be Slavic or just Russian - and I am Russian. Best match score in my list is just above 0.5. There are some clearly alternate accounts on the list, their match scores with this tool are well above 0.7.
It is probable that persons of same cultural origin will have similar writing style and vocabulary. It is also probable that persons of same cultural origin would have same relationships with the world as a whole, they would like same things and dislike other same things.
So, in my opinion, it is possible that you have found not only alternate accounts (score above 0.7), but accounts of people with same cultural origin (ones that are around 0.6).
- vbezhenar 3 years ago
  
  There're 19 other accounts this tool finds similar to me. Those are not my accounts. 0.46 - 0.56 are numbers.
  
  costco 3 years ago
  
  I think people are sort of confused at what this tool is supposed to be which I will concede is partially my fault. The results of this tool are by themselves not indicative of having an alternative account. It generates the 20 most similar users for every single user on the site, regardless of whether they have an alt or not (there's obviously no way for me to know that for every single user). In your case further investigation would reveal that none of those accounts are yours.
  
  thesz 3 years ago
  
  It is a fun tool, I can assure you. It is just people have found use case you haven't foreseen yourself.
  I think your tool should have internal embeddings for each of the user. Also, most probably your tool uses cosine similarity for a search.
  Thus, I would like to suggest a feature: recognize simple arithmetic operations over user's embeddings, such as "thesz - 2 * patio11". It will make things even more fun, this way we can find users who are like me and much not like patio11. Even simple additions and subtractions would suffice.
  (an idea is taken from properties of word2vec embeddings)
  Your tool is thought provoking. What I discovered with it made me think about my use of language and what other languages (body, imagery, etc) I use differently because of who I am. Which made me think about my favorite underrated superhero Cypher [1] - would his innate ability to understand languages make him best detective ever?
  [1] https://en.wikipedia.org/wiki/Cypher_(Marvel_Comics)
  Thank you!
  
  costco 3 years ago
  
  Really cool idea. I'd need to upgrade the VPS though so all the vectors would fit in memory but it probably wouldn't be too hard (right now I'm just storing a map of username string -> array of 20 username strings because my VPS only has 512mb RAM). I'll think about if I can do this in a way that is more resource conservative.
  
  csa 3 years ago
  
  Fwiw, and as gp mentioned, > 0.7 seems more likely to be alt territory.
  
  b112 3 years ago
  
  You are fools, one and all! This tool's only purpose, is to tag people who use it!
  Now they know just who cares about which alternate accounts. They know!
  They freaking know, man!
  You have all fallen for their ploy. Fools!
  
  thesz 3 years ago
  
  I have no alternate accounts and visited the site out of curiosity, because I used to worked in the domain like this.
  What I found was worth visiting the site. Somehow notably many accounts with (relatively) high similarity to mine's are sharing at least one of my personal traits.
  Which is fascinating, to me.
  And I think is worth to be noticed by others - what and how you write can disclose who you are.
  
  TheOtherHobbes 3 years ago
  
  It knows my IP now.
  (Or does it?)
  
  neodypsis 3 years ago
  
  It offers no privacy policy, so can't tell.
- ricardobayes 3 years ago
  
  My highest was 0.41 and the person writes nothing like me. I guess I'm a unique snowflake after all.
  
  gilleain 3 years ago
  
  my second highest hit (ie, third in the list) is gwern at 0.45 who i'm fairly sure is not me.
  
  scarmig 3 years ago
  
  I was actually just looking at near hits for gwern and found what's almost definitely a defunct alt for him.
  
  gilleain 3 years ago
  
  Well is certainly NOT me, that's for sure.
  On an unrelated topic, I'm starting a service to write comments in the style of others to provide plausible deniability for other alt accounts. Rates negotiable.
  
  jrumbut 3 years ago
  
  I have a few in the low 0.5's and, honestly, they seem cool and I want to meet them.
  
  Litost 3 years ago
  
  I was curious about this, my highest match was 0.47 and I have no alts, maybe I'm also a unique snowflake, or haven't said anything noteworthy enough to have been deepfaked yet ;).
- weaksauce 3 years ago
  
  I don't have any alternate accounts here either and my writing style is apparently nearly the same as a high profile account that I recognize and has many points. I wouldn't say this is a highly accurate thing.
Animats 3 years ago

0.6 isn't much. I have 3 matches above 0.6, and they're not me. 20 or so over 0.5.
- jsnell 3 years ago
  
  That's why you manually evaluate the matches. And like I wrote in that comment, I did that manual eval, and these clearly are alts of that main account, not spurious. Narrowing down the pool of accounts you'd need to do this kind of manual evals for by a factor of 100000 is a pretty significant change in capabilities.
- input_sh 3 years ago
  
  I get one 0.68 match, which... fair enough. It is an account I've abandoned some years ago, no secrets there.
  No other hits above 0.5, so I guess that either makes me pretty unique as a commentator or my English is broken in a unique way.
phreeza 3 years ago

MD5 of the username is 9abc27e93b7e3c04b7c599017c1cfe5f ? The top one seems an odd one out in that case?
- Aachen 3 years ago
  
  Usernames aren't random enough to be safe as a simple MD5. Perhaps with a strong bcrypt, but similar to PIN codes, it might be better to give partial information like "is the second character an ...", assuming nobody else made similar statements. Or give the first ~two hex characters of the hash, so that it would match 1/(16²)rd of the usernames. I'm sure there's also a clever way for a zero-knowledge proof here, probably something with diffie-hellman using the name as your random integer or something, but I'm too sick to think about this stuff right now. Privately sharing data publicly is hard.
  
  lzooz 3 years ago
  
  Good point - I've been running john on that md5 for a couple minutes :)
  
  wizzwizz4 3 years ago
  
  Why use John? Just run down the list of Hacker News usernames; it'll take less time. (Or, better still, don't; just because the privacy's theoretically compromised doesn't mean we have to exploit that.)
  
  lzooz 3 years ago
  
  I don't think there's a public list of all HN usernames is there?
  Found this, it includes 250k usernames, but it's not there. https://www.kaggle.com/datasets/hacker-news/hacker-news-corp...
  
  meta2023 3 years ago
  
  The username in question isn't in this dataset but maybe it was created in the past 10 days, as the max(timestamp) is Nov 16th, 2022.
  https://console.cloud.google.com/marketplace/details/y-combi...
  
  lzooz 3 years ago
  
  It isn't there, and given the "story" it happened years ago so it should be there, so I guess we've been played.
  
  phreeza 3 years ago
  
  Unintentionally played I might add... But I will leave it at that.
  
  ahmedalsudani 3 years ago
  
  Another problem is that it's a small set. If you had a list of all HN users, you could compute md5 for all of them in seconds.
  
  phreeza 3 years ago
  
  I think the intention of the post not mentioning the handle was just to prevent old discussions from flaring up or so? The post doesn't really contain any new information on the person that would be worth obscuring. So I just thought I'd hash it to prevent that. But it seems I actually screwed up the hashing so I will leave it at that.
irrational 3 years ago

.6 is high confidence? I did my own username, wondering what it would return, since I know I don’t have any alt accounts. The top results are in the .6-.7 range. If they aren’t alt accounts, is it just coincidence that we have similar writing styles?
- bee_rider 3 years ago
  
  I think so.
  A funny thought — my “matches” cap out at around .56. Having false positives* in a tool like this might feel like a “bad result” but actually I think it just means that if someone were running this sort of tool across the whole internet, I’d be relatively easy to correlate, while your identity would be intermingled with your .6-.7 partners.
  *actually they aren’t really even false positives because the tool doesn’t promise to detect alts in the first place, just find similar styles.
tqi 3 years ago

> quick browse of the comments of the recently active ones, they look really likely to be alts.
Hmm isn't a spot check of comments somewhat tautological, since that is how the tool identifies alts (rather than something like IP address or time of day)? If this had been promoted as "find accounts with similar writing style to yours" would people immediately assume alts?
- margalabargala 3 years ago
  
  I would presume that OP is referring to the actual content of the comments. This just does stylometric analysis, which looks at word choice, but not what the arrangement of the words mean.
  If some accounts are found to be stylometrically similar, and then a visual inspection also shows them all stating similar opinions, that latter piece of data is a strong signal.

gus_massa 3 years ago

It would be nice to make the names clickable.

I don't think the list of pg alternate account is accurate. I checked a few. They have many oneliners that is typical of pg, but the topics and style don't look similar.

I searched a few more and got better results. :)

I searched myself (that I know that I have no alternate accounts). I recognize a few users that are interested in similar topics, and I discuss/upvote them many times. But I didn't recognize most of the user of the list.

costco 3 years ago

> I searched myself (that I know that I have no alternate accounts). I recognize a few users that are interested in similar topics, and I discuss/upvote them many times. But I didn't recognize most of the user of the list.
It's based purely off frequency of the 200 most common English 1 word phrases, 2 word phrases, 3 word phrases, 1 character sequences, 2 character sequences, and 3 character sequences. Topic does not really have anything to do with it. If I had more time I probably would've done a smarter model that accounted for things like that.
- gus_massa 3 years ago
  
  One is also a mathematician. It's trivial that we overuse some technical words even if it's unnecessary.
  Another is form Argentina, so I guess the native language leaks, for example using words derived from latin that are not idiomatic.
  And there are a few more, that is a honor to be "confused" with, but I have no clue why.

Fnoord 3 years ago

Cool stuff, thank you for sharing your findings!

I don't do throwaway. I either post or STFU. I also STFU on darknet. Its why I found it fun to read/lurk on things like I2P back when it was new. And I know that on a pseudonymous account it is only a matter of time until it can be linked to another pseudonymous account. It would not surprise me if stylometry was used on Dread Pirate Roberts or the people behind The Pirate Bay or the people behind Wikileaks (Assange's sockpuppet accounts). Such can also have been used to verify afterwards instead of beforehand. Though with TPB since it was on clearweb an advanced adversary could have used correlation/timing attack to figure who wrote what.

I'm having fun times recognizing other Dutch people though their usage of English language. For example, a distinctive word I see Dutch people use a lot is 'oke' instead of 'OK' or 'okay'. Its a red flag the person is native Dutch. I wonder if there are stylometry tools available for figuring if someone used physical vs touchscreen keyboard (I used Glider to write this post, spellchecker unavailable).

And yes, organizations like secret service and police should use such tools as well. It is a known tool, why not use it for good? As with any tool, it can be used for good and evil. On HN this could be useful for the mod team (AFAIK nowadays only dang) to find banned people's sockpuppets. Cross-community could also be a fun project: find a HN user's Twitter or Reddit account. And I hope this method is also used to find Russian trolls on social media.

ghaff 3 years ago

Most people greatly underestimate the power of linkage attacks on anonymity. And it doesn't even take fancy ML. In the context of healthcare records, I like to trot out this 25 year old example of an MIT grad student and the then-governor of MA.
https://ischoolonline.berkeley.edu/blog/anonymous-data/

dlkf 3 years ago

The top hit on my list looked familiar. I looked at their recent comments and saw a discussion between that user and me. We were quoting eachother directly throughout.

I wonder if this explains our similarity. And if so, could we tweak the algo by e.g. Removing text that is prepended with ”>”

bscphil 3 years ago

The scary thing is that once you have this data, finding HN matches for individual targeted users on other sites becomes trivial, even if those sites are harder to scrape. I bet most people here have an anonymous Reddit account, for example. If you wanted to know who was behind a particular Reddit account, you could feed it into something like this and compare the results with HN, where accounts are less likely to be anonymous. Or build a database based on blogs, Github comments, etc.

Also, since this uses only word frequency, there are probably relatively easy improvements to make that would make it even more powerful, like looking at particular runs of words that are unique. Some expressions or figurative language only show up in combinations of words, and tend to be highly style specific.

faeriechangling 3 years ago

Thus proving the only actually anonymous community in practice is 4chan, and that’s why it’s so toxic.
- sbierwagen 3 years ago
  
  If you define “toxic” as “people disagreeing with you”, sure. That was what the entire internet was like until maybe 2005.
  
  ben_w 3 years ago
  
  I'm old enough to remember when 4chan was self identifying as the Internet's hate machine, before xkcd referenced it as such: https://xkcd.com/591/
  Sometimes people insist that's all role-play and irony; others insist that if it ever was, it certainly isn't now.
  But regardless, I remember pre-2005, and it wasn't all like what I saw the two times I looked at 4chan. Bits were. Bits were much worse. But mostly, mostly, people were kinder… at least, unless political tribalism came up.
  
  philosopher1234 3 years ago
  
  “People disagreeing with you” describes almost none of the conversation on 4chan
costco 3 years ago

I could have used a part of speech tagger, looked at time of day a user posts, capitalization, spelling errors, etc. From what I understand the state of the art is lightyears ahead of this, there are even companies with actual linguists who will act as expert witnesses in court to say stuff like "we can say with 95% certainty that xyz authored this email." Honestly it's kind of scary. There are papers that talk about cross platform authorship attribution, one I think did it with Twitter, Blogspot, G+ and had pretty good results.

setr 3 years ago

Forget the alternate accounts — if two users are close in style, there’s a decent chance they should be friends. This is an HN friendship machine.

saurik 3 years ago

It would be convenient if the usernames linked to the comment pages on Hacker News (to avoid having to copy/paste and URL hack, which is made even slightly more annoying because for some reason when I tap and hold the usernames to copy them your markup--I haven't looked at why yet--is causing an extra space character to get copied on the left).

dsr_ 3 years ago

This is interesting.

I'm 0.566 correlated with logfromblammo -- and while we are definitely not the same person, I could easily imagine writing a sentence such as:

"For some bizarre reason, management has not yet assigned a task to their programmer underlings to automated themselves out of existence. I can't imagine why."

which is theirs, not mine, from about a year ago. I like that.

On the other hand, I'm nearly as correlated with peterwwillis: 0.5485 -- who has no comments and no submissions.

costco 3 years ago

> On the other hand, I'm nearly as correlated with peterwwillis: 0.5485 -- who has no comments and no submissions.
This is due to the Firebase API not updating when users ask the admins to move their comments to another account.
- matsemann 3 years ago
  
  Yeah, I got a good match with my previous nick here. Which to me proves the tool works well.
lifeisstillgood 3 years ago

I had a similar experience finding my most likely alt (.50 suggesting I am a unique snowflake as I have always thought :-), my most likely alt is writing certainly in a style I appreciate and on subjects I often mention.

DenisM 3 years ago

How about this for countermeasure:

As you're typing out a comment the software gives you a list of accounts you're becoming similar to. That way you can adjust your writing as you type.

bornfreddy 3 years ago

Sounds great, except there are many different similarity measures. Which one does the algorithm use?
- wizzwizz4 3 years ago
  
  Why not all of them? Which metrics are closer would tell you which aspects of your writing you need to focus on.
kaba0 3 years ago

Someone linked it in the thread: https://github.com/psal/anonymouth
pessimizer 3 years ago

Forget countermeasures, go covert. Write a comment, have the comment be rewritten before submission in order to resemble a targeted account.

davebillyhock 3 years ago

This found an alt that I created specifically to see if I could write artificially to defeat this kind of analysis. I have seen other tools like it posted to HN, but none before had found that account. I guess I need to up my game.

CharlesW 3 years ago

If you don't mind sharing, are you "writing artificially" purely in your head, or are you using techniques like intermediate translations?
- davebillyhock 3 years ago
  
  No mechanical means, but I have referred to a thesaurus occasionally. Mostly I tried to change my sentence structure, not just words. It requires actually thinking differently, in a way. Which makes it difficult to know how well I'm communicating.
  
  crtified 3 years ago
  
  I imagine this would be quite difficult in practise, due to all the subliminal factors behind a person's writing choices.
  For example, as somewhat illustrated here, your personal vocabulary is a kind of fingerprint. As you mention, using a thesaurus can somewhat alleviate that, but if a thesaurus is only changing a small % of your words, then it will only have a suitably small % effect upon analysis.
  To go yet further might (I suspect!) entail methods such as directly lifting and using other people's sentences to convey your own thoughts. But even then, "your own thought patterns" are still informing the manner of the post, to some extent, so over time increasingly robust analysis may still find patterns to hook into.
  
  neodypsis 3 years ago
  
  I wonder if someone will come up with a Grammarly-like tool which you can feed with sample writings to help you increase/lower the similarity score of a new text you are writing.

serhack_ 3 years ago

costco 3 years ago

That post was actually what motivated me to make this. I'm on your email list :)
- crecker 3 years ago
  
  WOW! It's such a pleasure for me

super256 3 years ago

Ahhh, anyone remembers this hacking crew who leaked BLUEETERNAL and other NSA tools and exploits? Shadowbrokers.

They were always communicating in some kind of meme-russian, and their texts were funny to read. [1]

I believe their writing mostly defeated this kind of analysis, at the cost of looking like idiots (which was probably the reason no one sent them crypto-dollars to buy that stuff exclusively).

Here's an excerpt:

"Attention government sponsors of cyber warfare and those who profit from it !!!!

How much you pay for enemies cyber weapons? Not malware you find in networks. Both sides, RAT + LP, full state sponsor tool set? We find cyber weapons made by creators of stuxnet, duqu, flame. Kaspersky calls Equation Group. We follow Equation Group traffic. We find Equation Group source range. We hack Equation Group. We find many many Equation Group cyber weapons. You see pictures. We give you some Equation Group files free, you see. This is good proof no? You enjoy!!! You break many things. You find many intrusions. You write many words. But not all, we are auction the best files."

[1] https://archive.ph/20160815133924/http://pastebin.com/NDTU5k...

super256 3 years ago

*EternalBlue

spdustin 3 years ago

Have you tried including parts of speech (for example, as bigrams and trigrams) as part of the features considered in your model? I’ve had great success with stylometry that goes beyond TF-IDF with bags of words; including grammar patterns was shockingly good.

(FWIW, it didn’t find my throwaways; my own model didn’t, either, because I knew that word choice wasn’t enough to avoid being outed by stylometry)

Edit: by bigrams and trigrams, I mean reducing word to their parts of speech labels and using THOSE as word tokens. You’ll find that native English speakers have higher weights on some phrase construction patterns than, say, folks from Romania. TF-IDF is useful for these POS-grams (just made that word up) as well.

costco 3 years ago

> Edit: by bigrams and trigrams, I mean reducing word to their parts of speech labels and using THOSE as word tokens. You’ll find that native English speakers have higher weights on some phrase construction patterns than, say, folks from Romania. TF-IDF is useful for these POS-grams (just made that word up) as well.
That is a very good idea and when I update the site that will almost certainly be included :) Any other tips? Been reading papers for ideas and I think I may have to ditch the cosine similarity and go for something fancier soon. Thank you

zxcvbn4038 3 years ago

How long until this becomes the algorithm for a dating site?

“Find hot single women who write just like you”

forgotpwd16 3 years ago

Wouldn't be surprised if dating sites already used similar algorithms.
- bornfreddy 3 years ago
  
  Wouldn't be surprised if most of the women on a specific dating site had very high similarity scores.
- dysoco 3 years ago
  
  Do dating sites really use clever algorithms to match up people together? I was under the impression that, the less likely you are to meet your perfect match, the more you're going to use the app.
  In my experience I don't see a relevant list of potential matches aside from gender and age preference, it's all completely random, even frequently I see people outside the settings I've specified (i.e. men or older women).
nrp 3 years ago

This seems like a great way to hire freelance copywriters/ghost writers too. I would absolutely hire someone I knew could match my tone well for writing generic unattributed copy.

interroboink 3 years ago

This is one reason why I like legal doctrines such as "beyond a reasonable doubt." Even a 0.9 match in a tool like this could be a coincidence, if there are millions of users. But that won't stop people from casually believing "aha it must be an alt account", based on some anecdata.

It's so easy for something like this to be turned into a tool for a witch hunt, targeting innocents.

costco 3 years ago

But a 0.8 or 0.9 match and something like Tor usage could be enough to justify a warrant. That's why I'm not sure I want to open source the code because I don't want to normalize this.
- yyt554 3 years ago
  
  Keep in mind the potential to create false accusations by fabricating similar looking accounts.

psychphysic 3 years ago

Hmmm, doesn't seem to work. But you have convinced me (and many others?) to search our alts consecutively and so now do know who has alts?

ufmace 3 years ago

I wonder what's a reasonable threshold for "probably the same person". I've never had an alt on HN, and when I searched myself, it found 3 other users above 0.6, none of whom I've ever heard of before.

costco 3 years ago

If it's >0.9 is you can almost guarantee it's an alt but I've seen certain matches at 0.6. The problem is writing styles change over time. Another idea I had was converting the scores which are just cosine similarity scores into percentiles (so 0.99 would be 99th percentile of certainty) to make them more human interpretable.
- bonzini 3 years ago
  
  The people at 0.4-0.6 with me do share some interests. That's cool on its own.
- throwup 3 years ago
  
  I make new accounts every so often and the accounts of mine that it found have a score of around 0.3. I'm not actively trying to defeat stylometry but it's possible I just have a particularly unremarkable writing style.
  
  xwolfi 3 years ago
  
  Well I must be stereotypical myself because it found me at 0.8 !
- forgotpwd16 3 years ago
  
  >The problem is writing styles change over time.
  Will be interesting if we could plot the writing style divergence over time.
- throwdbaaway 3 years ago
  
  I got matched with my old account with a score of only 0.45
dotancohen 3 years ago

Interesting. The highest non-me account is under 0.4 on my page. I do not believe that I have such a unique writing style - especially since half my posting is on mobile and therefore possibly slightly different than my desktop posts.
- dwringer 3 years ago
  
  My closest is 0.4879. I know I tend to be wordy but I thought I had a pretty generic style as well. This is definitely a fascinating demonstration.
  
  drdec 3 years ago
  
  Feeling better about my high of 0.49 now
MBCook 3 years ago

I have no alts. The highest match for me is about 0.66.
pyb 3 years ago

0.6 is not high enough to indicate an alt

stavros 3 years ago

Oh wow, it's really sure that I'm stavrosk, which I am:

https://stylometry.net/user?username=stavros

The next person is 30% less certain, that's huge! This would basically identify any alt I might have with near certainty.

jvolkman 3 years ago

stavrosk doesn't have any posts/comments? What's it using to match?
- stavros 3 years ago
  
  It's my old username.
  
  costco 3 years ago
  
  Huh... seems there are some inconsistencies between what's presented on news.ycombinator.com and the Firebase API. Glad it matches for you though :)
  
  stavros 3 years ago
  
  I guess they just didn't go back and reparse, not a big problem. I don't think people change their username frequently :P
rogual 3 years ago

Funny thing is, it thinks I'm you, but it doesn't think you're me!
https://stylometry.net/user?username=rogual
I'd have thought this stylometry thing would be commutative.
- stavros 3 years ago
  
  I guess it's a multidimensional space, so you can have someone closer to you than me, but they aren't also closer to me than you. Basically, they're close to you, but on the "other side" of me, I guess?
  
  yyt554 3 years ago
  
  Don't need multiple dimensions for that.
  0.1, 0.2, 0.3, 1.0, 2.0
  To 2.0, 1.0 is closest.
  To 1.0, 0.3, 0.2 and 0.1 are closer.
  
  rogual 3 years ago
  
  Thanks, seems obvious when you put it like that.
- yyt554 3 years ago
  
  The word you are looking for is "symmetric".

4qz 3 years ago

This is an evil website. We won’t have any anonymity soon. The highest match is my years old banned account that I forgot about. Where did you get the data from?

woodruffw 3 years ago

HN has an Algolia-based API. It’s also very easy to crawl.
I wouldn’t call this evil, however: it’s merely demonstrating a technique that you should be aware of, if you’re a privacy-conscious person. It looks like they also provide some resources for avoiding stylometric detection.
- nanidin 3 years ago
  
  I would bet my bottom dollar that the likes of Reddit and Google already have models to turn a corpus of text into probable demographic data and models to measure the similarity of users.
JadeNB 3 years ago

> This is an evil website. We won’t have any anonymity soon. The highest match is my years old banned account that I forgot about. Where did you get the data from?
I'd way rather have someone tell me "look at all the things I can find out about you" so that I can act accordingly (whatever that means!) rather than what we've mostly actually got, which is companies silently exploiting my data and doing everything they can to mumble reassuring but legally ineffective formulas assuring me that they deeply respect my privacy.
costco 3 years ago

HN Firebase API. I just wrote a program in C++ with libcurl to get https://hacker-news.firebaseio.com/v0/item/1.json, https://hacker-news.firebaseio.com/v0/item/2.json, https://hacker-news.firebaseio.com/v0/item/3.json, ...
- jonas-w 3 years ago
  
  Why didn't you use the google bigquery?
  https://news.ycombinator.com/item?id=10440502
  
  costco 3 years ago
  
  I was aware there was a HN dataset on BigQuery but I had never used a library to work with it before and when I played around on the website the posts I got were all from 2015 at the latest. It probably would've made my work easier but there's not really anything I can do about it now.
ufmace 3 years ago

I don't know that I'd call this evil. We have no idea who else is using this kind of technology but not making the results public. Better to know what's possible and take measures to make it less effective.
faeriechangling 3 years ago

It’s just statistics. I recall that during his whistleblowing, Snowden intentionally took anti-stylometry measures.
weinzierl 3 years ago

Please don't shoot at the messenger. costco shared this voluntarily and I can see no bad intention.
We should see it as an opportunity to learn how easy it is to associate different pseudonymous accounts. Nothing drives this point home better than a practical demo.
We can be pretty sure stylometry is used widely by bad actors already and we should not punish people who help to spread the word about these technical possibilities.
- ghaff 3 years ago
  
  And this is actually quite a simple approach--which is interesting in and of itself. While there would be diminishing returns, there are a ton of other techniques you could use to make stronger inferences about similarity.
vfinn 3 years ago

Imagine using this across different platforms :/, and let alone using different techniques in addition...
edit: maybe you'd catch some criminals if you tried to match reddit against dark web for example

schappim 3 years ago

Interesting that the Op doesn't come up in the search: https://stylometry.net/user?username=costco

Aachen 3 years ago

Their first comment and submission were 4 hours ago. Text on the page is accurate it seems.
Beltalowda 3 years ago

Not surprising considering the account had no activity before today.

macintux 3 years ago

My nearest match is only at 0.406. It'd be interesting to see who the most unique commenters are, but it's also quite possible it wouldn't be flattering.

pubby 3 years ago

0.35 is my nearest. In hopes of lowering it even further, here are some nonsensical opinions never expressed on HN before: 1) Programming peaked with COBOL 2) Paul Graham is responsible for 90% of SIDS cases 3) There's no reason to use car when cdr exists.
joisig 3 years ago

0.2506 is my nearest match
- costco 3 years ago
  
  That's the lowest I've seen yet. You must write uniquely :)

chriskanan 3 years ago

I have no alternative accounts besides making a single throwaway account to post one "Ask HN" five years ago, but I have a decent number of matches above 0.5. I think this is due to the relatively uniform style of "who is hiring posts," since my matches did that in a similar way for other companies. I made many of those for about two years when I was at a start-up.

Yeahsureok 3 years ago

On the how to avoid section: Isn't running comments through a randomised translator a few times then back considered a countermeasure also?

Also think it's probably poor form to list users as examples without their permission.

costco 3 years ago

> On the how to avoid section: Isn't running comments through a randomised translator a few times then back considered a countermeasure also?
Yes.
> This may be out of line but isn't pg on here with a different username, Levenschtein distance of one that's not included? Or is that just a very motivated 13yo account who writes a lot of admin-esque comments.
What other pg account are you referring to? I want to see it so I can see what my algorithm missed.
> Also think it's probably poor form to list users as examples without their permission.
You're right. I'll remove that - I just wanted some examples especially for people on phones who don't feel like typing. Thanks for the feedback.
jacooper 3 years ago

> However, using automated methods like machine translation services do not appear to be a viable method of circumvention.
https://www.whonix.org/wiki/Stylometry

Arathorn 3 years ago

It found my old account (ara4n; i lost the password) at 0.63. More amusingly it found my cofounder too, who hardly ever posts here (at 0.48)

SevenNation 3 years ago

> ... This site works primarily by analyzing for each user the frequencies of the most common words and phrases in the English language. Accordingly, the easiest way to avoid being identified is to simply use different words than you ordinarily would when writing. More sophisticated models than the one I made can use punctuation, comma usage, and capitalization to identify you so try alternating those as well. Services like Quillbot can help with you this but depending on your circmstances you may not want to send your writings to a third party service.

HN offers many other threads which could be tied together, including:

- time of posting

- ratio of replies to top-level comments

- comments being mainly upvoted or downvoted

- sentiment (mostly angry, dismissive, questioning, etc.)

- most common topics (keyword analysis of post being replied to)

- ratio of new posting to post replies

- first-to-comment on a post

- lone comment on a post

- etc...

It seems very likely that sooner or later every pseudonym for posting content will get discovered and linked. The lesson here is don't post anything that would cause you undue shame or harm if linked directly to your legal name.

bhaney 3 years ago

Well now I'm self conscious about my closest match being an 0.34 when so many other people are reporting much closer matches with accounts that aren't alts. Do I write weirdly?

spapas82 3 years ago

Same for me, the closest match is 0.36. But I expected that because I don't speak english very well so the pool of candidates is small.
klohto 3 years ago

0.36 here! Out of curiosity, are you a native speaker?
- bhaney 3 years ago
  
  I am, yes.
  
  quink 3 years ago
  
  0.39 for myself, I’m a non-native speaker.
CobaltFire 3 years ago

My closest is 0.40, so I’m right there with you.
Native English speaker as well.
nephanth 3 years ago

.31 here! I'm a non-native speaker tho, so it wouldn't surprise me if I had weird speaking habits

operator-name 3 years ago

What does the bold signify? For example when I search for dang (https://stylometry.net/user?username=dang) the 4th most likely user is not bold whereas the 16th is?

costco 3 years ago

Say you see user2 listed in bold on user1's page. That means that user1 is also in user2's top 20 users. In my experience it is often an indicator of a good match (but not always).
- operator-name 3 years ago
  
  Huh, that's a somewhat non intuitive property.
  
  silasdavis 3 years ago
  
  It is a bit, but if stylometric equality was a thing you'd expect it to be symmetric, so if stylometric simmilarity is a thing....

mygentys 3 years ago

And this is why I’m a reader and not a poster on HN :)

The second that I found out that requesting deletion of an account and its posts needed a MANUAL request to a single user (dang) I noped out so fast

But happy that the rest of you are still happy to contribute :)

ggerganov 3 years ago

I really liked the informative and straight-to-the-point about page - describing how the algorithm works in a way that is easy to understand. All the important details are summarised there. Well done!

Edit: From the "How to avoid .." page, there is the following sentence:

> Also, most authorship identification algorithms have poor accuracy when working with small amounts of words. This means the optimal strategy would be discarding an account either after every comment or after a small number of comments. Unfortunately, this is against HN rules and may result in a ban.

Can you clarify what this means and why it would result in a ban?

costco 3 years ago

> Can you clarify what this means and why it would result in a ban?
I have seen dang respond to users multiple times asking them to stop making new accounts especially but not always if it's to avoid rate limiting. I don't know if there's an official policy but it's definitely something I recall.
krisoft 3 years ago

> Can you clarify what this means
Imagine that for every new comment you want to post you would create a brand new account which you would use precisely once and never again. Then the stylometry would have just a few words and wouldn’t have enough corpus to get a reliable signature. If a lot of people does this it would be hard to figure out which account belongs with which human. ( Of course if you alone do this, your messages will stick out like a sore thumb. See xkcd 1105 )
> why it would result in a ban?
Because this practice is especially discouraged in the guidelines: “please don't create accounts routinely. HN is a community—users should have an identity that others can relate to.”
- stupendous_luck 3 years ago
  
  At the same time, HN doesn't let you delete comments.
  Maybe with some GDPR magic.
  
  krisoft 3 years ago
  
  Not sure what is your point, or how does that connect with my comment. Care to elaborate?
  
  stupendous_luck 3 years ago
  
  Your comment quotes an HN guideline, and my point relates to it. Some users may feel the need to create throwaway accounts in order to post comments that in an alternative reality they could post under their primary account and later delete if desired. It may not stop a scrupulous collector of data, but such a scenario may not be the object of their worry.
  Drawing this into the logical conclusion, a user may opt to always post under a throwaway account, to avoid any possible tainting associated with a primary account.

Bhurn00985 3 years ago

Just a heads up that for everyone who doesn't like to link their alt accounts, maybe not use this tool to see if it works.

Unless the author would run this against all HN user accounts, no need to flag the ones "of interest".

StrangeDoctor 3 years ago

Have you done any data analysis on distributions of similarity? How similar you'd expect any 2 people to be given English focused around tech? Or any other interesting stats you'd like to share?

Very nice clean site, great work.

wizofaus 3 years ago

What match level would you expect to see between two randomly chosen individuals?

bumble_bee900 3 years ago

It's accurate enough that I had to create a new account now :)

I guess it's difficult to evade it as the word frequency certainly catches all about the countries I frequently refer, programming languages, interests etc.

culi 3 years ago

Similar to how they make adversarial fashion[0][1] in order to not be tracked by face id AI, I wonder if we can make adversarial stylometry tools to run your comments through in order to anonymize it

.. [0] https://hackaday.com/2022/10/20/render-yourself-invisible-to...

.. [1] https://adversarialfashion.com/

carewell 3 years ago

OP links to a paraphrasing tool on their website.

nostylometry 3 years ago

This is absolutely bonkers. I tried it with my alt and it got my original correct! So I'm writing this comment with a fresh account which hopefully will not get correctly linked too lol

lettergram 3 years ago

Did something similar in 2018 (still running locally) which could damask anyone

https://twitter.com/austingwalters/status/104189476543920128...

Made both Metacortex.me and insideropinion.com

The idea being you don’t actually need an active directory. It would drop in, figure out all the users (provided one account was on the AD) and would monitor everyone’s skill sets, morale, schedule, etc. Worked super well for what it was / is.

woodruffw 3 years ago

Neat work!

Out of curiosity: do you filter sentences than begin with ‘>’, indicating a block quote from another user? That might improve the accuracy a little here, if you don’t already.

costco 3 years ago

Yep!
- robocat 3 years ago
  
  Perhaps explain in the about what you filter out? Along with what the bolding means?! Do you filter out anything else (like spaced/indented/monospace text/code, or even quoted text, which is often not written by the user?). Super thanks for this - interesting!
  
  costco 3 years ago
  
  Turns out that there may have been some glitches in the way I was filtering lines beginning with >. For explanation of bolding see https://news.ycombinator.com/item?id=33755466. I didn't attempt to filter anything else out though filtering out code would probably help a little bit.

pkos98 3 years ago

Sorry dang, aka sctb: https://stylometry.net/user?username=dang

Macha 3 years ago

In this particular case, it seems to be picking up the stock moderation responses as it looks like sctb was a moderator account until 2019.

nwiswell 3 years ago

I don't have an alt but it would be cool to meet my stylometry-neighbors. I'm curious whether the writing similarity translates to oral communication too

DrStrangeLoop 3 years ago

I tried dang's old account (gruseom) expecting to see his dang account listed. Nothing. Tried dang, sctb (a previous admin) was listed as closest match.

I wouldn't rely on these results

https://stylometry.net/user?username=gruseom

https://stylometry.net/user?username=dang

pvg 3 years ago

I wouldn't rely on these results
You picked a user who posts a massive volume of repeat, template-y comments and found their former colleague who also posted piles of repeat, template-y comments, that being part of both of their jobs.
- DrStrangeLoop 3 years ago
  
  There are a few close matches to dang's style of template-y comments in the results. Afaik none of the listed accounts are Daniel.
  I picked dang as he is the figurehead of hn, and didn't want to inadvertently reveal some other user's identity.
  
  dragonwriter 3 years ago
  
  > There are a few close matches to dang's style of template-y comments in the results.
  At least the #1 close match (sctb) was a comoderator with dang, so they were kind of alts as the official voice of HN.

antirez 3 years ago

writing "antirez" shows accounts with spanish names (none is mine). I guess Italian and Spanish speakers write very similarly English, but on HN there are a lot more Spanish speakers than Italian ones so that's what I get.

costco 3 years ago

It seems the accuracy for nonnative speakers is not nearly as good as it is for native speakers. The algorithm could definitely use some work.

jefftk 3 years ago

Tried my account thinking "I don't have any alts" but it turns out I do! In 2018 I changed my username from "cbr" to "jefftk" and it pulled that right up: https://stylometry.net/user?username=jefftk

jl6 3 years ago

Rebrand it as a soulmate-finder?

chronogram 3 years ago

Well done, it found my ancient old account.

nvr219 3 years ago

I only got 0.9999999999999992 for myself :(

noncoml 3 years ago

Naturally Born Imposter

nr2x 3 years ago

Honeypot to see what accounts are tested in sequence?

;-)

costco 3 years ago

I turned off nginx logging if that makes you feel any better. Of course there's no way for you to verify that because I'm just a random guy on the internet but I will tell you that I am a civic minded citizen who is concerned about privacy and the Internet.
- nr2x 3 years ago
  
  Only half kidding, but I’d I were state Intel it’s what I’d be doing. :D

CrypticShift 3 years ago

Ingenious idea. At the very least, this is just about finding people who write like us, the same way we seek those with similar tastes (music...)

How long before large commercial indexers start offering an efficient (AI based ?) stylometry to agencies and states ?

wait... do you think the NSA is already doing this?

A4ET8a8uTh0 3 years ago

They would be silly not to ( apart from creepish profiling of an entire globe population you also get to potentially identify bots ). We all have mannerisms that can easily 'betray us' online. I honestly thought my writing style is more unique, but as it turns out it is somewhat common.
- CrypticShift 3 years ago
  
  > I honestly thought my writing style is more unique
  You just showed another possible use case for this kind of tools: "How unique is my writing style ?"
- sitkack 3 years ago
  
  It isn't writing style, but more of phrase selection. If you lean on the same phrases (n-grams), then you will be very very close in a high dimensional space. Colloquialisms are the biggest tell, you should eschew them.
woodruffw 3 years ago

Stylometry is an old hat technique; you can assume that intelligence services around the globe regularly apply it.
(Statistical stylometry is a little newer and more rigorous than manual stylometry, which essentially involved a human being's judgement call around the similarity of documents.)
- CrypticShift 3 years ago
  
  What about "deep leaning" stylometry ?
  
  woodruffw 3 years ago
  
  I don't know, but it wouldn't surprise me if someone has tried to apply ML to stylometry. Statistical stylometry is already petty effective, as demonstrated by this site.
  
  nephanth 3 years ago
  
  https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=deep...
  Yields some results
  This one seems pretty interesting
  https://www.mdpi.com/2227-7390/10/5/838

garbagetime 3 years ago

Site down? I'm keen to see if it catches my alts.

tinodb 3 years ago

Same here, 502 consistently.
- costco 3 years ago
  
  Apologies for the downtime. Something crashed while I was asleep, should be working now. Not really sure how because the log indicates that uwsgi "gracefully exited," but I'm looking into it.
costco 3 years ago

Apologies for the downtime. Something crashed while I was asleep, should be working now. Not really sure how because the log indicates that uwsgi "gracefully exited," but I'm looking into it.

dibt 3 years ago

Since it looks for similar word usage, false positives seem to appear more often when specific topics are talked about, like stocks or crypto.

Does this ignore stop words? Or do all words have the same weighting? I wonder if only focusing on stop words would give a more accurate measure. Maybe we are more comfortable with certain stop words more than others?

https://en.wikipedia.org/wiki/Stop_words

"Stop words are the words in a stop list (or stoplist or negative dictionary) which are filtered out (i.e. stopped) before or after processing of natural language data (text) because they are insignificant."

costco 3 years ago

All words have the same weighting. I don't ignore stop words, in fact most of the ngrams I use are compromised almost entirely of stop words. Maybe it'd be more effective if I ignored them.

MikePlacid 3 years ago

1. Interesting. I was kinda expecting to be grouped with other Russian speakers, and I am (based on some nicknames). Probably the frequencies of “the” and “a” are telling. But I swear to God that I sometimes spend some extra time trying to insert as much “the” and “a” in my texts as I could.

2. There is a Russian mnemonic verse, which can’t be properly translated to English, at least it’s beyond my humble capabilities. It goes:

“Это я знаю и помню прекрасно:

Пи многие знаки мне лишни, напрасны”

The number of letters in the words give you the pi number: 3,1415… The meaning is: “I know and remember perfectly: too many signs (positions) of pi are useless and impractical”. Sometimes it’s nice to remember both things.

Trouble_007 3 years ago

Nice work! Thank you, of course I plugged in the obvious HN usernames

Edit to add;

Would be nice to have the https://news.ycombinator.com/user?id=username links included.

Trouble_007 3 years ago

And perhaps rounding to 3 or 4 decimal places?

jimhi 3 years ago

Amazing and I thought my doxxing tool was terrifying - https://news.ycombinator.com/item?id=32278871

I am afraid to combine all these methods

lijogdfljk 3 years ago

Yea.. i guess it's time to stop bothering with alt accounts/etc. I'll just make one account, maybe differently named on different services (makes scraping just a _pinch_ easier) but aside from that all i can do is modify/remove old posts.
Bit of a shame for useful posts/discussions.. but the internet is getting really.. finger print laden.

elteto 3 years ago

Incredible! There was a very active throwaway account here a while back that I always enjoyed interacting with. I suspected the person had more than one account and this found one that is incredibly close, down to the topics.

WaitWaitWha 3 years ago

I checked a few random user names and I am confused.

- Why is the author costco[0] not in this lookup?

[0]: https://stylometry.net/user?username=costco

Aachen 3 years ago

- Their first comment and submission were 4 hours ago.
- The text on that page is accurate it seems.

weinzierl 3 years ago

I played a little bit with it and it is baffling how well it finds accounts of people that know each other in real life. So it's not only good for finding alternate accounts but could be used to find peer groups.

sitkack 3 years ago

Interesting, they are trading phrase-grams (just made that up) or lingo. That is really cool.

dibt 3 years ago

This doesn't seem to include text from submissions.

I ran it on Brian Armstrong's temp account from here, and it said it didn't write 10,000 characters:

https://news.ycombinator.com/item?id=3754664

EDIT: Or maybe it's something else because Brian only wrote less than 6k characters. But then why can my account be looked up?

Also, I would guess quoted replies are included, which muddies the analysis. Seems to be a very naive implementation. Much more can be done, but this was probably just a quick project.

costco 3 years ago

Quoted replies shouldn't be included unless there's a bug on my end. Submission text is not included though I probably should have.

lifeisstillgood 3 years ago

How much should we fear de-anonymisation ?

A lot of discussion on the thread are over "how can we prevent this". I would like to know why should we not embrace this and similar technologies?

The benefits in my view are large - online behaviour tracks back to real life - and epidemiology speaking the value of millions of test subjects across every question are invaluable - from traditional medicine to "mass psychology recommendations"

I can guess some downsides (hiding from abusive exes) but am interested in studies, surveys, reports etc - any HN thoughts welcome

rejectfinite 3 years ago

>online behaviour tracks back to real life
This is good to you?
Okay, let's just make it like China or SK where your login is your citizen ID and if you write bad things the bad word police will take you away.
Also, no, I have no alts.
- lifeisstillgood 3 years ago
  
  So I am asking because my views are only challenged inside my own head, hence the need for external thoughts.
  But firstly the "governments will come and do bad things" argument - yes this is clearly and obviously a major problem - but not one solvable by technology in anyway. Fixing violent dictatorships is a IRL problem - one that requires enormous effort and sacrifices (see Ukraine for obvious example). We cannot pretend that a browser extension or a ground up rewrite of Twitter will defeat Putin or would have stopped Hitler.
  As for "free" countries (something like 120+ have open free elections), we still have online abuse for voicing opinions that some people don't like (anything from pro/anti Trump to LGBT and bitcoin etc). Those are real consequences but rarely government inspired and honestly I suspect we need better support for police in prosecuting such things - I mean a death threat is a death threat.
  In general my view seems to be we should have the same protections online as we do offline - and if those protections are "in theory only" that requires us to use our voting and other political power to chnage it - not to obfuscate IP addresses or so on.
  The upside of tech is so great it is worth spending IRL to defend agains the downsides
  
  rejectfinite 3 years ago
  
  I am of the generation and mindset that online abuse is not real. Straight up. Log out, turn off the screen and watch Netflix, take a walk and calm down, block the offending user. It's not real.
  >I suspect we need better support for police in prosecuting such things
  We do see that! But mostly people on Facebook. Here we have had judgements of people who posted threats on Facebook because it is tied to your real name.
  And yes, abuse is part of the "fun". Under your system, my 10 years old Leauge and CoD chats would have me locked up.
  >I mean a death threat is a death threat.
  Is it? I would find it more concerning if someone on the street tells me he is going to kill me than a kid on xbox live.
  NOW there is a difference in systematic stalking and harassment online if I would get bombarded with DMs and messages to kys. I don't know how to solve. But a one-off comment is NOT equivalent. Then it feels like I'm just old? At 31? Is it really so serious?
  
  lifeisstillgood 3 years ago
  
  This is almost certainly going to be decided by the "reasonable person" test - and if you were on the jury it's going to have to be a higher bar than I, but I suspect there will be some offences we will both agree on.
  My main point is not that we need to lock up everyone who makes a threat, but that we as a society will have to adjust our standards to the new normal.
  Once upon a time every conversation was fleeting, every discussion in a pub or bar was ephemeral. Even Einstein and Dirac would walk home chatting without fear of being overhead. Then someone imagined it would be wonderful for the whole word to hear the erudite wisdom of those two geniuses of our age - and Facebook and Twitter and social media made it possible for every conversation in every bar to be captured and recorded and published - and we found out that Dirac and Einstein were just sledging each other and most other conversations globally were worse.
  The new normal is that, like speeding, most evenings, conversations in most bars actually broke quite a lot of laws, from hate speech to sexual threats and basic politeness. And now the police can hear them as can everyone else - and discretion does not work on this scale - we either enforce the laws or change them.
  That's a conversation for each judiciary- and likely to be either a balkanisation of the social media world, or a race to the top (we can all have twitter as long as we all behave to the standards of the highest / politest society. I am not sure where I stand on that.
  Is it serious - hell yes. We are looking at a global technology with global benefits for all humankind - and if we want to communicate globally we need to agree what the standards for behaviour are on this virtual stage - from contract law to human rights and freedom of speech. We are inevitably going to build closer contacts - Brexit is a salutary lesson - and how we deal with freedom of speech online is just part of the jigsaw - but a telling part.
  
  msla 3 years ago
  
  > I am of the generation and mindset that online abuse is not real. Straight up. Log out, turn off the screen and watch Netflix, take a walk and calm down, block the offending user. It's not real.
  Until people can pierce the veil of your pseudonymity (which isn't all that hard depending on the platform and the person) and it isn't just online abuse and harassment anymore. "Tied to your real name" includes "tied to enough information about you that someone with plenty of free time can sift through various databases and piece it together" and most people have absolutely no idea how many such databases there are, and how much piecing someone can do.
  I'll say something tangential: Even if we both agree that one-off assholes are largely inconsequential, and I think we do, such assholery has a broken window effect on a platform, where people see all the assholes running free and decide that it's either a place for them to be assholes or a place they should stay away from to avoid assholes.
headhasthoughts 3 years ago

What could possibly be the harm in allowing people to harass others based on posts they made decades ago? What could possibly be harmful in making a person who for whatever reason has changed their online identity easier to track? What could be remotely harmful about allowing Marlboro to find the accounts of ex-smokers? What could be the harm in tracking underaged users site by site?
I'm sure this is completely harmless and will not harm society.
- lifeisstillgood 3 years ago
  
  I think this might be old age creeping up on me but I find it harder and harder to work backwards through "argument by sarcasm" to arrive at what you meant. I think clearly you are heartfelt in your views that having your identity online be a real one is bad - but I am not sure if that is because of posts you made years ago being linked back to you or nefarious advertising ?
  The old posts issue is interesting- do you mean that there are posts from years ago you would find upsetting to be linked to you? Is this because you have chnaged your mind (a normal process society needs to understand) or because you said things thinking yiunweee anonymous that you would not have said under your real name? Far less of a social issue I think.
  It does make for some interesting thoughts if we made everyone post under their real name.
  
  headhasthoughts 3 years ago
  
  My view isn't that accounts tied to real people are bad. It's that your lack of ability to think of cases where what you propose could be harmful points to a total lack of critical thought on your part.
  The point that I am making is that it's incredibly easy to decipher why "track everyone under every identity they choose" can go wrong and lower the quality of discussion, and specifically, that it's so easy the fact you can't think of a single reason why it's a bad idea to completely eradicate privacy.
  If I can find an alt of yours saying that you've quit smoking and then push tantalizing ads to you, you're going to bring me a better return than blind-firing into the American public.
  If someone is looking for people who are easy to manipulate in borderline-illegal fashion (let's say, sex crimes), it's a cheat code if they see some throwaway account on HN comment on a post about the treatment of youth, "As a present high school student, I disagree with your statement because..." and track it back to a minor.
  
  lifeisstillgood 3 years ago
  
  I disagree that tracking leads to lower quality discussion - for example I know my name and identity is tied to this account and instead of responding to "lack of ...thinking" with an insult I am forced to come up with intelligent responses (now you try ... it's really annoying isn't it :-)
  I also explicitly ask for real life examples and studies of harm - I can imagine and create examples but I much prefer the real world to my imagination as a guide. We learnt that as basis of science.
  I also think there is a difference between privacy and secrecy. You seem to conflate the two - if your actions online were secret then advertisers would not send you smoking ads. Secrecy is probably impossible - privacy is merely the politeness of our neighbours. And at scale politeness is enforced - by social norms and sometimes legal measures. We are seeing this come in (GDPR) but it's hard to have legal enforcement before the social norms have arrived.
  On the smoking ad front, Gabriel Weinbergs main argument is that searching for "red men's trainers" should be enough to serve ads without having to know if I am a 20 something graduate in wisconsin or a middle aged bloke in London. And I suspect he is right within a few percentage points.
  As for online grooming -yeah this is a huge danger. Every parents nightmare. And still absolutely something that needs to be enforced in the real world. And may need extra police and social resources. But if we want to stop predators reaching out to vulnerable children then it requires co-ordination amoung many groups onleine and offline - funding, political will, training education over many years.
  There will be no quick fixes for the problems tech is bringing - but I remain optimistic that the cost benefit ratio is worth it and that we can vote for and require change to defend against the dangers
  which takes me back to my point - what are the real world examples of dangers so we can make sensible policy
femto113 3 years ago

Fear it happening or fear its consequences? Doxxing already happens all the time, but the main tools are things like account names or image search, this sort of tool could take it to a new level. A simple experiment would be to run this same algorithm against another site (say Twitter or Reddit) and see if it can reliably pick out the same peoples' accounts there. Once anyone on the internet can quickly/easily draw that sort of connection it would require incredible diligence to avoid de-anonimyzation while still maintaining any sort of "real self" presence on the internet. How much we should fear the consequences probably depends a lot on how marginalized you are within your society, but since just revealing your gender is enough to invite harassment in many forums I'm not optimistic.

jaredsohn 3 years ago

Amusingly can't run it on the author since not enough comments

Wistar 3 years ago

I have only ever had a single account but it returned 19 possibles with no confidence above .54 but 11 bolded. My own account was listed at the top with a confidence of .9999.

Macha 3 years ago

Yeah, I have a bunch of bolded mutuals but none above 0.45. I think I have had one or two alts in the past, but probably they didn't make the 10000 word threshold for inclusion (nor can I remember their names to check if they work in inverse).

saurik 3 years ago

Why are some users bold?

srean 3 years ago

The non-bold are dead accounts I think
- saurik 3 years ago
  
  It isn't due to a mere property of the user, as, for example, cushman is not bold as the #2 result for tptacek but is bold as the #2 result for icambron.
  
  srean 3 years ago
  
  Good point.
  
  stavros 3 years ago
  
  FYI, the GP said above that bold usernames are those for which symmetry holds (ie they're both in each other's top ten).
costco 3 years ago

Say you see user2 listed in bold on user1's page. That means that user1 is also in user2's top 20 users. In my experience it is often an indicator of a good match (but not always). I should probably explain that on the site.
- layer8 3 years ago
  
  Instead of making it binary, you could use a gradient indicating the strength of the mutual correlation (like how HN colors downvoted comments).

nickstinemates 3 years ago

The bias is interesting here.

https://stylometry.net/user?username=nickstinemates

Number 2 for me is someone I worked closely with for a few years, and then putting his name into this results in all of the people we worked with for a few years. So it seems content>style, or, we are all more alike than we thought.

lostmyacctoops 3 years ago

I'd be very curious to know if these algorithms can link very different types of text. I'm not surprised that my style is "derivable" on HN, but what if you included my slash-fic pieces, my research papers, etc, would it still "catch" me?

Also, talk about a chilling effect. I was already vaguely aware of this, and now I'm overthinking every word I'm thinking/typing.

jonnycomputer 3 years ago

I'm gathering that they just took a bag-of-words approach to this; basically comparing word frequencies. Writing across content types (fiction vs technical writing for example) will probably show different word frequencies, especially technical jargon, and so on. More sophisticated approaches are possible.
And yes, potentially very chilling. If you want to post truly anonymously, you might want to run your words through some kind of filter first.

agumonkey 3 years ago

Oh god, that thing starts with direct focus on the search field, opening it showed a bunch of old nicknames, I thought it was the result of some study.

silasdavis 3 years ago

The top hit for me, though not a very high correlation (0.3 ish), is to my surprise someone I have met. I don't appear on their top 20 though.

godisdad 3 years ago

Can we find Satoshi with this?

drpancake 3 years ago

A few people have tried that e.g. https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1184...

Ros2 3 years ago

I interviewed years ago with someone who let me know that they use a pseudonym as an employee and their chosen name even got posted as the author for articles they wrote for the company. They were very concerned about their privacy.

I know their blog, which is their HN username, and this tool found their other account.

Perhaps ironically, this person stood out a lot because of this and I didn't forget them.

p4bl0 3 years ago

It's funny that I only match at 0.9999999999999982 with myself while all other username I tried matched with themselves at 1.0 ^^.

srean 3 years ago

https://theuijunkie.com/myth-or-fact-did-charlie-chaplin-los...
- p4bl0 3 years ago
  
  Huhu

samwillis 3 years ago

Sticking myself in (I haven't ever had another account) my closest match (at 0.43) is the maintainer of an Open Source project which I have occasionally commented about. They are also British, as am I.

My guess is that as they commonly mention the project and I have on a number of occasions, that has formed the link. Plus maybe usage of common British terms, but that seems far less significant.

It's super interesting!

It would be good if there were more controls to filter the type of words and language that are used for the matching algorithm. So you could say exclude words not in the dictionary. I wander how that would effect my link with this other person.

throwboi123 3 years ago

That’s why I always use throwaway :) everywhere. Reddit. HN. Twitter. Everywhere. I’ll spam every site with my throwaways.

Long live throwaways.

kaba0 3 years ago

That’s the point of this post, that you are not safe by throwaways at all, because all of your throwaways can be linked together purely by your textual style.
- SpelingBeeChamp 3 years ago
  
  No they can’t. If you only have a small amount of text to work with, stylometry is unreliable.

hobobaggins 3 years ago

> This means the optimal strategy would be discarding an account either after every comment or after a small number of comments. Unfortunately, this is against HN rules and may result in a ban.

Is this? I thought that it was ok to have throwaway accounts, as long as they're not specifically to avoid a ban or something like that.

ThrowSet 3 years ago

I find this tool to be disturbing. It is reality so I accept it. But I'm going to make effort to change my style between accounts.

A question for the author (costco): You created that account in 2019 but you didn't post or submit a single thing until 4 hours ago. Why did you create an account almost 3 years ago for no purpose?

chucksmash 3 years ago

Alone out here in the 0.30s. the three times I've used a throwaway account, they've been for a single post on a single topic, so no surprise they did not get picked up by this analysis I guess.

Does a low correlation with other users imply higher susceptibility to de-anonymization if I were using alts regularly?

reducesuffering 3 years ago

Probably. It means your writing is more unique and using an alt would be another "very unique" but only similar to yours.

kevmo314 3 years ago

There's someone (michaelmior if you're around!) with a false positive 0.46 match to me.

Maybe we could be friends :)

bagels 3 years ago

Not sure if that is a false positive. It just lists the top 20 accounts ranked by similarity score. Under 0.8 or so is unlikely to be a 'positive'.

anpat 3 years ago

This needs to exclude who’s hiring post because it confuses me with a few of my wonderful former colleagues!

seydor 3 years ago

Well the only solution is too have too many alts so that nobody can believe you can possibly have that many

throwaway5434q 3 years ago

Wow. This is insane, it found my old accounts. So throwaway obviously (because I'm a bit of an asshole) but this really is amazing. It also highlighted another account that's not me, but looking through their comments i don't see any resemblance to me either.

gavinray 3 years ago

I've complained a lot about Haskell and now it thinks I like Haskell =(

Needs sentiment analysis IMO, otherwise you'll get "Here's a bunch of people who are JUST LIKE YOU", except they use a similar grammar style but hold opposite opinions on the same nouns.

ahmedalsudani 3 years ago

Serves you right for disparaging The One True Language!
Ok, fine, we'll present Idris with a fig leaf.
layer8 3 years ago

It just thinks you engage a lot with Haskell. These are people with who you have something to talk about. :)

soneca 3 years ago

I have two accounts. This one, “soneca”, that is my first one and most active by far, and another one that I use sometimes mostly for Show HN and few comments.

When I searched the other one, “soneca” was the first guess, with 0.4.

But when I searched “soneca”, the other one was not in the top 20.

laurex 3 years ago

Those interested in the implications of this kind of analysis might enjoy the book The Secret Life of Pronouns http://secretlifeofpronouns.com/

iHateStylo 3 years ago

Thank you for this.. I thought I was being careful but evidently it's not enough. It found 13 of my previous accounts with the topmost being 0.4937 and lowest one being 0.3616 bold. All the bold ones were right, some correct matches weren’t bold.

notacoward 3 years ago

Seems pretty spot-on to me. I tried it with two accounts I was already certain were alts - based on other factors like favorite topics and common enemies as well as style/tone - and the top hits for both were the ones I would have expected.

joshstrange 3 years ago

Very interesting, .59 is my lowest, .64 is my highest match, none of these accounts are one of my alts. Though to be fair the handful of times I've used a throwaway I used it for a single comment so I didn't give it much to go off.

SnowHill9902 3 years ago

Anything like this for Reddit?

Would translating to other language and back defend against this algorithm?

costco 3 years ago

> Anything like this for Reddit?
No but it would be easily adaptable especially given that Pushshift is archiving every Reddit comment. Based on some of the feedback I'm getting here I don't know if I should open source this even though it really wasn't that hard to make.
> Would translating to other language and back defend against this algorithm?
Yes. But then you have to send your original comment to a translation company so there are privacy concerns there too.
- EMIRELADERO 3 years ago
  
  > Based on some of the feedback I'm getting here I don't know if I should open source this even though it really wasn't that hard to make.
  I'd say you should. I'd rather see this as being publicly and freely available to everyone rather than some shady "Big Tech" analytics company.
  If the "weapons" exist, I would feel more comfortable knowing everyone can access them, not just an elite that can use it for their own (selfish) purposes.
  
  A4ET8a8uTh0 3 years ago
  
  I am genuinely torn, because my initial reaction was almost the exact opposite, but the comparison to a weapon does ring true. And there is indeed an argument to be made for level playing field. At the very least, maybe counter-measures can be developed.
  
  Terretta 3 years ago
  
  People don't usually understand privacy risks till their own curtains fall down.
- operator-name 3 years ago
  
  I wouldn't worry about that too much as someone's already done something similar for reddit (https://towardsdatascience.com/using-nlp-to-identify-reddito...), and has released their code publicly (https://github.com/jabraunlin/reddit-user-id)
  Given the technique used, I don't see why something simple and local wouldn't defeat it? The "easiest" technique would be to use this weighting as a negative metric in rewriting.
- hcs 3 years ago
  
  > But then you have to send your original comment to a translation company so there are privacy concerns there too.
  There are modern offline translation systems available such as Bergamot https://browser.mt/

andreareina 3 years ago

Trailing (and probably leading, didn't check) spaces confuse the user lookup.

Lichtso 3 years ago

I wonder how much this can be improved if metadata is taken into account as well. Especially the distribution of common post dates and times modulo a week, which also exposes in which timezone somebody probably lives.

rand_user_100 3 years ago

On one hand, thank you for showing us all how easy it is to make something like this. No doubt organizations with more resources already have more sophisticated systems in the same vein.

On the other hand, can we agree that this product is unethical?

In many cases, when a person uses an alt, it is a direct and strong signal that they do not wish their other posts to be associated.

So this product is circumventing the explicit will of the person, and making it available to anyone with zero effort i.e. there is no barrier to getting this info.

I met someone about 10 years ago who said they built this at a university. And their argument also was "actually this enhances privacy because it lets you know something something something". And yet their research grants were coming from one source only.

It can be used for good, but most often it won't.

A4ET8a8uTh0 3 years ago

<< On the other hand, can we agree that this product is unethical?
It does create a high level of discomfort, because it illustrates well what privacy advocates try talking about to the population at large, but all that said.. how is it any different from regular scraping and analyzing it any other way?
This is a real question.
- rand_user_100 3 years ago
  
  It's different because you're removing all barriers to access and making it easy and convenient to stalk/dox people.
  Imagine you get the urge to track someone, but in order to do that you have to spend a week writing some new software. That's a barrier. And because of it you may change your mind because it's a lot of work with little payoff.
  But if that info is just one click away, it's a whole different ballgame.
dragonwriter 3 years ago

> On the other hand, can we agree that this product is unethical?
No.

yyt554 3 years ago

Fun exercise would be to find all accounts that suddenly stopped posting around today and correllate them with new accounts created around today.

All those scared folks who naively think that it's not too late yet. Busted.

abhaynayar 3 years ago

502 Bad Gateway

costco 3 years ago

Apologies for the downtime. It is up again and I'm looking into why uwsgi crashed.

JKCalhoun 3 years ago

The asymmetry is interesting. I have no alts but of course it nonetheless reported accounts similar to mine.

Running then the most similar person to my account did not put me in their top 20.

sitkack 3 years ago

I believe this is the https://en.wikipedia.org/wiki/Friendship_paradox
- JKCalhoun 3 years ago
  
  Very cool.

hamburglar 3 years ago

I’m guessing that a small corpus for a given account doesn’t produce a very good score? I’ve done throwaways a couple times in the past and this has not “outed” them.

oblib 3 years ago

I've only had one account here. The highest match has a 0.624 score and the lowest a 0.572. I'm not sure if that means I'm unique or common but I'd like to know.

rcarr 3 years ago

One way to get around this legitimately would be by posting a lot of quotes/lyrics/excerpts and the like thus fooling the algorithm unless it had a way to filter them out

notduncansmith 3 years ago

This has been a great way to find people whose commentary I enjoy!

mancerayder 3 years ago

We knew this was possible and was coming, and probably around a few years. Fascinating from a technology perspective, terrifying from a long-term privacy perspective.

rglover 3 years ago

It's moments like this I'm proud to have my insanity on full display without obscurity. Was surprised to see a bunch of ~30% matches despite not having any alts.

harryvederci 3 years ago

My runner-up has a rating of 0.42378790667730715

C'mon guys, work harder. That's not even close! :-D

Btw, I myself am only at 0.9999999999999999 so I guess I need to work harder at being myself.

srean 3 years ago

I tried it on a few user-ids that I strongly suspected were owned by the same person. My hunches stand corroborated. Not sure who is corroborating whom though, me or the script.

Good job.

SkyMarshal 3 years ago

Oddly, I am not an exact match to myself.

> Most likely candidates:

skymarshal: 0.9999999999999997</i>

The other few usernames I tested (pg, dang, some random ones from this thread) all matched themselves at 1.0.

timeon 3 years ago

I had hard time to understand some comments made by my closest match. I guess this is good reality check. I need to learn how to write more legible posts now.

FartyMcFarter 3 years ago

Sorry, what did you mean? :P

Retr0id 3 years ago

It didn't find my alt, but the second match is one of my twitter mutuals - I wonder if we've inadvertently borrowed style quirks from each other.

account-5 3 years ago

I wasn't aware this was even a thing! Scary stuff. 2 alts are listed but not with any great accuracy, so easy to dismiss. What an interesting topic.

kfichter 3 years ago

Does anyone here have a reasonably wide variety of similarity ratings? I'd love to see the difference between a 0.2 and a 0.8 for the same account.

CobaltFire 3 years ago

Interesting; I must have a fairly unique style as there are no matches over 0.40 for me.

I’m a native English speaker as well, so I’m unsure how to feel about that.

musicale 3 years ago

> I made this site mostly to show how easy this is and how it can erode online privacy

looks like it can indeed

> Here are some frequent HN commenters: (EDIT: Removed due to privacy concerns)

How surprising that someone might object to being included in a demonstration of the erosion of privacy!

Is the site opt-in or opt-out?

Aachen 3 years ago

I doubt they asked 78k users for permission when there's no standardized way of reaching out if you're not a site admin. It's opt out if anything.
- bee_rider 3 years ago
  
  You opt into making your writing publicly available when making posts on this site. I’m not sure what Ycombinator’s user agreement* says about this, but it is pretty obvious that they haven’t done anything to prevent it (and it isn’t clear what they could do).
  * and I mean they author of the tool is here making posts, so I guess they have agreed to the TOS, but clearly someone who hasn’t agreed to it could also make this tool and scrape out publicly available posts without agreeing to anything.

tomrod 3 years ago

Is it weird that my rating is very low compared to alternative options? I have no alts, but I'm curious how similar others might write to me.

stephc_int13 3 years ago

What is the threshold to be reasonably confident that two accounts are from the same individual?

I ever had only one account here and the closest match is at 0.47.

00F_ 3 years ago

ive had maybe a hundred throwaway accounts on HN over the past ten years. generally, i make an account, say something that is apparently wildly offensive to someone else, get flagged and down-voted and then muted or hell-banned. then i make another account because i never did anything wrong and start the process over again. ive emailed the admins, tried to reason with the admins, it never does any good. the power is held by power-users who flag people -- most of the power of an admin at the end of the day but without any of the accountability. as long as they are following the mainstream dogma, its all good.

anyway, this app was able to identify a lot of my accounts. but a lot of the matches werent me. bold matches were almost all me. but i know there are many more matches than those that were listed. it mainly showed my most recent accounts.

i think most people would get a sick feeling in their stomach if they tried this app. i dont think people are prepared for a world where you can type someones name into an app like this and produce everything ever recorded online that was created by that person. not only this but everything highlighted and summarized to answer any question about that person. this is what advanced ai will bring us. an information implosion where the planet-sized ocean of data that is just floating all around us suddenly and violently coalesces into the objects of our new societal calculus. violent is a good word. and this is just the change that one can see coming with ai.

costco 3 years ago

You are definitely right. Part of the reason I chose the 10,000 character minimum was so that people using throwaways in the true sense would be entirely excluded. I don't plan on keeping this up forever and I too would not feel comfortable if this was deployed at scale.
- ayewo 3 years ago
  
  Would you be open to open sourcing the code when you decide to shutdown the service?
stupendous_luck 3 years ago

You really don't need advanced AI to do it. Just a bunch of scrapers and some run of the mill statistics. And guess what, it's been done by many companies already. They just don't care to create such a site.
- 00F_ 3 years ago
  
  you have no idea what im talking about. you dont realize how much data is out there. you dont comprehend how much smarter than you something can be.

the_cat_kittles 3 years ago

pretty cool- i think there should be a term for two accounts that have each other as the top most similar account. kinda sad i dont have one :(

layer8 3 years ago

Stylotwins?
philosopher1234 3 years ago

We’re pretty close me and you — closer than my actual alts
- the_cat_kittles 3 years ago
  
  hello friend! but... id never use an m dash
  
  philosopher1234 3 years ago
  
  Well… I would never use a lowercase word after an exclamation point!
  …Because I’m on mobile

dvh 3 years ago

Make a fundraiser and start doing it for other sites.

costco 3 years ago

It would be possible for Reddit because Pushshift.io archives all the comments there and Reddit is still pretty small. I'd probably need to make things a lot faster. Doing it on a specific subreddit would be very feasible. I'll think about it but I don't actually know if I really want to do that because for instance I've been banned from subreddits before but I don't want a ban from when I was 12 years old to follow me around forever because my writing style hasn't changed. Moderation is the most obvious application of this kind of software.
- rand_user_100 3 years ago
  
  > I'll think about it but I don't actually know if I really want to do that because for instance I've been banned from subreddits before but I don't want a ban from when I was 12 years old to follow me around forever
  Insightful that your personal experience and impact on you personally affects your decision. I invite you to think about the impact of the products you build in your CS career by putting yourself in the shoes of other people as well.
  Some products should not be built, even though it's easy to build them.
  
  SpelingBeeChamp 3 years ago
  
  What other easily-built products do you think should not exist?

robertlagrant 3 years ago

Clicking on my top match (0.61) - I can see the similarity. I also note they quote the same way, with a > symbol. I wonder if that helps!

paulpauper 3 years ago

Inserting random Unicode blank, 1/4, 1/2, or zero space characters into your writing may help thwart it too, if you are paranoid

UncleEntity 3 years ago

Huh, that’s how I signal my KGB handler…
zimpenfish 3 years ago

Would thwart this tool, presumably, but not anything which considered spacing ("do they use double space after a sentence?") and punctuation, etc., as markers.

ed25519FUUU 3 years ago

Very cool! And really a shame that you’re not allowed to delete an old alt account or comments on HN! It follows you forever apparently.

sdsd 3 years ago

All false positives for me - I want to reach out to the accounts that talk similar to me and see if we make good friends

thot_experiment 3 years ago

Maybe this is a good tool to find new friends. :P

neodypsis 3 years ago

How do you protect yourself from impersonators?

aryc19 3 years ago

So what are some good tools to obfuscate style?

scarface74 3 years ago

It found my “alternate” account. If someone puts my username in, it’s not hard to figure out which alternate is mine.

balls187 3 years ago

No alt, and the highest match is 0.36

And that accounts last several comments were flagged as dead.

I'm a native speaker, but my english succcccks.

scotty79 3 years ago

Funny thing would be to find most unique user account stylistically.

Which user has lowest best match?

Mine is 0.58 so I'm really not that unique.

zimpenfish 3 years ago

Fractionally more unique with a best match of 0.547.

a-dub 3 years ago

would probably work better with case and punctuation preserving n-grams, sentence length, paragraph length and use of whitespace stats.

also maybe a tf-idf vector of top n words per user.

also could maybe do a same phrase analysis across the corpus to find some hand picked features.

timestamps could be interesting.

or, of course, let the machine do it with comment2vec.

mysterydip 3 years ago

I was curious to use this on myself to see if anyone writes like me. Closest was a .51 confidence, so I guess not?

iambateman 3 years ago

This is cool!

If an account returns a high score for many accounts, does that also mean they’re relatively less original in style?

msla 3 years ago

It puts almost all of my old accounts decently near the top, but my original account is almost comically low.

oliwary 3 years ago

Cool! I wonder if it could be run backwards, to identify the users on hackernews with the most unique voices.

peacelilly 3 years ago

This is creepy.

noncoml 3 years ago

I think the word you are looking for is uncanny

Semaphor 3 years ago

My alt accounts (not really, all below 0.5) seem to also be European or German Firefox users. Good for us ;)

jonnycomputer 3 years ago

Obviously the next thing to do is make this a popup on someone's account name when you hover over it.

el_dev_hell 3 years ago

This is super impressive!

Is there a common open source library (Python, JS, whatever) that implements something like this?

silasdavis 3 years ago

> imagine what a company with millions of dollars and a couple dozen PhD linguists could do.

Could they do much better?

medellin 3 years ago

How much writing do you need to analyze results? Would changing account every X sentences eliminate this?

costco 3 years ago

Current minimum is 10000 characters. In my own tests accuracy was still pretty good at 3-5000 but I instituted the 10000 minimum to reduce false positives. Yes it would, if you read the advice page on avoiding detection that is one of the things I recommend. Unfortunately HN moderators do not really like that.

serf 3 years ago

I have no alts, but to those of you compared to me by this engine : "Hey, good lookin'!"

vxNsr 3 years ago

wow, this is way off on me, didn't find my alts and the bolded accounts on my list are from different countries, use language I'd never use (cusses) and I see I've downvoted some of them...

I'd love to have the experience and or apparent wealth my "alts" have

SanjayMehta 3 years ago

This is great.

One funny thing though, while your example says 1.0, for my own account it says 0.99lotsof9s4

dsr_ 3 years ago

I like the way some usernames are only 0.9999999 correlated with themselves.

Perhaps 6 or 7 digits is enough?

2OEH8eoCRo0 3 years ago

This found an old account that I forgot I even had but with a lot of false positives. Neat!

sedatk 3 years ago

I have no alternate accounts, and all my matches are below 0.4 for whatever it’s worth.

ChrisMarshallNY 3 years ago

Interesting, but it gave me 20 accounts, and I know that I only have this one.

costco 3 years ago

Sorry for any misunderstanding, read https://news.ycombinator.com/item?id=33756725

elorant 3 years ago

Sounds like a nice tool to find friends. You locate people who might think like you.

pugworthy 3 years ago

Strip leading/trailing white space from the name if it says no match.

uberduper 3 years ago

I would have expected to be a closer match to myself.

> uberduper: 0.9999999999999991

jonnycomputer 3 years ago

Well, one of the closest on my list is my twin, so there's that.

kiernanmcgowan 3 years ago

Love a little NLP project on a public dataset - thanks for sharing!

McDyver 3 years ago

Would this work for Fernando Pessoa and all his heteronyms? :)

throwawayhghcj 3 years ago

I’d like to request the author takes this offline please until the implications can be thought through.

This is breaking anonymity that people incorrectly thought would not be revealed.

For some it might be awkward, others it might be quite problematic.

s3000 3 years ago

This is nothing new, e.g:
Analyzing stylistic similarity amongst authors
https://news.ycombinator.com/item?id=10050603
http://markallenthornton.com/blog/stylistic-similarity/
37 points by lingben on Aug 12, 2015
kaba0 3 years ago

I would agree with you but the genie is out of the bottle already. Nigh everyone can and could have reproduced these results, especially that archive.org and similar things exist.
So, I don’t think it causes any new harm, if anything it gives you future risk aversion.
IAmGraydon 3 years ago

This is not complex and is a well known method that state actors have been using for quite a long time. Governments have FAR more advanced ways to track you than this, but it's good for people to realize it exists.

F_r_k 3 years ago

Found my phone account; I'm quite impressed, really !

ThrowawayTestr 3 years ago

Haha, you got me and my main account. That's spooky.

afarviral 3 years ago

Im tempted to use it to find likeminded friends :)

theGnuMe 3 years ago

This could be a good idea for identifying bots.

costco 3 years ago

Not sure if GPT3 at least if prompted right would have clearly identifiable style. Could probably detect converted call centers in Russia or Cambodia where 50 employees post on 10000 accounts though.

atum47 3 years ago

at what threshold is it considering alt account?

costco 3 years ago

There is no threshold. This site does not make any call as to whether a user is an alt or not. It just gives the users with the most similar word choice and from there it is up to you to decide (is there a very specific detail that both accounts mention, do they post at similar times, etc). I will say bolded accounts are substantially more likely to be alts though. But obviously it is not guaranteed that every user has an alt.

hk1337 3 years ago

Jokes on you, this is my one and only account.

Ikatza 3 years ago

Are short sentences better for anonymity?

b800h 3 years ago

Well, interesting. This is one of the reasons we have the GDPR. @costco, if I were to make a GDPR erasure request, would you service it?

And I'm no lawyer, but it seems like there's also an outside chance of a breach of section 171 here as well, which is a criminal offence committed by a person who reidentifies de-identified data.

Plus - the laws have extraterritoriality. Vanishingly unlikely that you'd actually be pursued for it, but it's worth bearing in mind when you munge people's personal data.

eps 3 years ago

It's an EU law.
- b800h 3 years ago
  
  With extraterritoriality. And if identifying people in this way is covered (I'm not a lawyer, I'm not claiming it definitely is), then it's also possible that EU citizens using the tool are committing a criminal offence.
  The law seems to only apply where the deidentification has been made by the data controller, but HN admins changing someone's username, for example, if they ever do, would count. A person then using the tool to match another non-anonymous username to that account would seem to be caught.
  Important to stress how much of a technicality this is, but that sort of thing can be interesting sometimes.

canadiantim 3 years ago

Wow... that's shockingly effective

kuramitropolis 3 years ago

Welp, so much commenting for me then.

costco 3 years ago

Site seems to have been down when you commented this. If you want to try again it is up again :)

AtlasBarfed 3 years ago

What's a high correlation number?

karol 3 years ago

Are you going to try it on Twitter?

AviationAtom 3 years ago

Now I can find my HN doppelganger

zem 3 years ago

heh, I looked up the top bold hit for my name and they really do sound a bit like me (:

thr0v_awway 3 years ago

writing from throwaway:

Holy shit, it works really, really good. It found all of my older accounts.

moneywoes 3 years ago

What algorithm is being used?

interroboink 3 years ago

It's described here: https://stylometry.net/about

spaniard89277 3 years ago

I changed my nickname so my employer can't find me here. I'm not amused by this.

googlryas 3 years ago

New account, then translate your comments to Spanish and then back to English using Google translate.
bee_rider 3 years ago

If this basic implementation can catch you, I’d consider it a friendly reminder that changing your account name is not a very effective means of adding privacy.

cbracketdash 3 years ago

The website is down...

costco 3 years ago

Apologies for the downtime. It is up again and I'm looking into why uwsgi crashed.

user- 3 years ago

Now do one for reddit

julienreszka 3 years ago

why is my username not exactly equal to 1? https://stylometry.net/user?username=julienreszka

costco 3 years ago

Python/floating point rounding error. It doesn't mean anything.

seydor 3 years ago

does it use the most used words or least used?

t0bia_s 3 years ago

Possibility to hide user comments in profile should be optional.

ruined 3 years ago

didn't find a single one of my alts. nice

costco 3 years ago

I obviously don't expect you to help me but do they have at least >10000 characters written and are you varying your writing style in any way?

ALittleLight 3 years ago

Of the top ten accounts listed for my name two of them are me.

rmelhem 3 years ago

nice one. are you using gpt3 under the hood?

costco 3 years ago

I'm not that smart - my site is basically just doing some calculations on word frequencies. You can read https://academic.oup.com/dsh/article-abstract/17/3/267/92927... and https://www.tandfonline.com/doi/abs/10.1080/09296174.2011.53... and https://news.ycombinator.com/item?id=33755898 for more information.
- sillysaurusx 3 years ago
  
  Don’t sell yourself short. Simplicity is smart. It’s astonishing how often the simplest thing turns out to be exponentially more effective than the so-called smart thing.
  I can’t get over how phenomenal this is. Please put every one of your side project ideas into production!
- isoprophlex 3 years ago
  
  Simplicity is the greatest form of sophistication! Great work!
  One small nit from a user experience point of view..: it'd be easier on the eyes if you just truncated those cosine similarity scores (or whatever score you're using) after the, say, 5th digit. Showing the entire float is kinda messy to my eyes.
- rmelhem 3 years ago
  
  cool and thanks for the clarification. i ask that mainly because of the request limit of openai, which is something that makes many scalable ideas unfeasible
- Dma54rhs 3 years ago
  
  Its easy to write complicated systems, it takes a genius to make it simple.
- dunham 3 years ago
  
  I am curious whether it could pick GPT3 out of the crowd.
- ghaff 3 years ago
  
  As you mention on the site, you don't do punctuation. But I'm guessing there are some pretty good fingerprints like:
  two spaces after a period
  Whether someone uses an em-dash/single hyphen/double hyphens (which may correspond to house style they're used to)
  Whether they use semi-colons
  (Presumably harder) but consistent substitutions like loose for lose, break for brake, etc.
  Use of accents
  
  tfsh 3 years ago
  
  I manually determined there was an individual posing as two people (playing both the antagonist and the adversary) because they consistently misspelt certain words such as "definitely" as "defiantly".
  Fingerprinting certain linguistic traits and mapping that to time-zones as well as confirming there is a partial overlap in posts but never exact worked exceedingly well. Someone can't easily maintain a fluent conversation between themself on two accounts, but they can either get close, either through unnatural delays between sentences or just never interacting with the "other" party at the same time.

andsoitis 3 years ago

we leave fingerprints everywhere

joxel 3 years ago

ColinWright is Dang?

Woah

franze 3 years ago

totally on spot

my current and my old account

ecec 3 years ago

xwolfi 3 years ago

Wow... how !

WalterBright 3 years ago

Over in the D language forums, we welcome people who post under a pseudonym, and our policy is we won't allow attempts to unmask them.

This is to protect high profile users who are secretly enjoying programming in D rather than the language they are supposed to use.

And, of course, to protect users who feel they might be discriminated against if their background was known.

bo1024 3 years ago

It's very important for those people to be aware of these style analysis attacks! Glad this post is raising awareness.

RepAgent 3 years ago

What's up with cluster of users like:

j_s,password4321,carolinew,colinwright,kuharich etc.

https://stylometry.net/user?username=j_s https://stylometry.net/user?username=carolinew https://stylometry.net/user?username=colinwright https://stylometry.net/user?username=password4321

Lowest match for j_s is 0.80 and all but one is black.

notduncansmith 3 years ago

On a cursory glance it looks like a cluster of users that post links, especially with italicized quoted excerpts.

jallasprit 3 years ago

Most likely candidates:

    pg: 1.0
    montrose: 0.604073065373204
    mattmaroon: 0.5900372458160795
    natsu: 0.5519832271289953
    rauljara: 0.5418566694533273
    waterlesscloud: 0.5378996309342633
    damoncali: 0.5292014150349463
    gruseom: 0.5290151637991445
    kemiller2002: 0.5254174524920762
    jfengel: 0.5231938496089998
    jamesaguilar: 0.5229081613163672
    houseabsolute: 0.5219738531025365
    danssig: 0.5195368367601849
    austenallred: 0.519343009683366
    loewenskind: 0.5177030083877397
    baguasquirrel: 0.5153841099708854
    asdfasgasdgasdg: 0.5146704002447524
    aptwebapps: 0.5144149629369845
    allenbrunson: 0.512802806408646
    danielweber: 0.5123620795710832

honkler 3 years ago

Not today.

You fail, I win.

costco 3 years ago

Nice. Just out of curiosity are you taking any countermeasures or varying your writing style across accounts in any way?
- psychphysic 3 years ago
  
  My second closest match was 0.35 but searching people where they have matches 0.5-0.75 I suspect that's mostly to do with number of posts leading to better statistics.
- honkler 3 years ago
  
  yeah I vary my writing styles. Much of the stuff I post through this account is controversial, to say the least. So I have to take "measures".