jethkl 2 days ago

There is an analogue of the CLT for extreme values. The Fisher–Tippett–Gnedenko theorem is the extreme-values analogue of the CLT: if the properly normalized maximum of an i.i.d. sample converges, it must be Gumbel, Fréchet, or Weibull—unified as the Generalized Extreme Value distribution. Unlike the CLT, whose assumptions (in my experience) rarely hold in practice, this result is extremely general and underpins methods like wavelet thresholding and signal denoising—easy to demonstrate with a quick simulation.

  • kqr 2 days ago

    There's also a more conservative rule similar to the CLT that works off of the definition of variance, and thus rests on no assumptions other than the existence of variance. Chebyshev's inequality tells us that the probability that any sample is more than k standard deviations away is bounded by 1/k².

    In other words, it is possible (given sufficiently weird distributions) that not a single sample lands inside one standard deviation, but 75% of them must be inside two standard deviations, 88% inside three standard deviations, and so on.

    There's also a one-sided version of it (Cantelli's inequality) which bounds the probability of any sample by 1/(1+k)², meaning at least 75 % of samples must be less than one standard deviation, 88% less than two standard deviations, etc.

    Think of this during the next financial crisis when bank people no doubt will say they encountered "six sigma daily movements which should happen only once every hundred million years!!" or whatever. According to the CLT, sure, but for sufficiently odd distributions the Cantelli bound might be a more useful guide, and it says six sigma daily movements could happen as often as every fifty days.

    • kqr 2 days ago

      Too late to edit, but I felt there was something wrong with this comment and I figured out what: the Cantelli bound is 1/(1+k²).

      This means as little as 50% can be less than one standard deviation, as little as 80% below two standard deviations, etc.

    • pishpash 2 days ago

      I highly doubt the finance bros pretend distributions are normal or don't know Chebychev, vs. not having enough data to obtain the covariance structure (for rare events) to properly bound even with Chebychev.

gus_massa 2 days ago

> It’s very subjective, but I think the uniform stsrts looking reasonably good at a sample size of 8. The exponential however takes much longer to converge to a normal.

That's a good observation. The main idea behind the Central Limit Theorem is to take the Fourier Transform, operate and then go back. After that, after normalization the result is that the new distribution for the sum of N variables is something like

  Normal(X) + 1/N * "Skewness" * Something(X) + 1/N^2 * IDont * Remember(X) + ...
Where "Skewness" is a number defined in https://en.wikipedia.org/wiki/Skewness

The uniform distribution is symmetric, so skewness=0 and the correction decrease like 1/N^2.

The exponential distribution is very asymmetrical and and skewness!=0, so the main correction is like 1/N and takes longer to dissapear.

niemandhier 2 days ago

Highly entertaining, here a little fun fact: there exist a generalisation of the central limit theorem for distributions without find out variance.

For some reasons this is much less known, also the implications are vast. Via the detour of stable distributions and limiting distributions, this generalised central limit theorem plays an important role in the rise of power laws in physics.

  • Tachyooon 2 days ago

    3blue1brown has a great series of videos on the central limit theorem, and it makes me wish there were something similar covering the generalised form in a similar format. I have a textbook on my reading list that covers it, unfortunately I'm I can't seem to find it or the title right now. (edit: it's "The Fundamentals of Heavy Tails" by Nair, Wierman, and Zwart from 2022)

    Do you have any good sources for the physics angle?

  • nextos 2 days ago

    Yes, came here to say the same thing. Telling people that the CLT makes strong assumptions is important.

    Otherwise, they might end up underestimating rare events, with potentially catastrophic consequences. There are also CLTs for product and max operators, aside from the sum.

    The Fundamentals of Heavy Tails: Properties, Emergence, and Estimation discusses these topics in a rigorous way, but without excessive mathematics. See: https://adamwierman.com/book

    • selimthegrim 2 days ago

      I was at the NeurIPS workshop and saw one of the talks towards the end and it was pretty good

  • kgwgk 2 days ago

    > find out

    Finite?

kqr 2 days ago

This is a very neat illustration, but I want to leave a reminder that when we cherry-pick well-behaved distributions for illustrating the CLT, people get unrealistic expectations of what it means: https://entropicthoughts.com/it-takes-long-to-become-gaussia...

  • mturmon a day ago

    The article you link is not using the CLT correctly.

    The CLT gives a result about a recentered and rescaled version of the sum of iid variates. CLT does not give a result about the sum itself, and the article is invoking such a result in the “files” and “lakes” examples.

    I’m aware that it can appear that CLT does say something about the sum itself. The normal distribution of the recentered/rescaled sum can be translated into a distribution pertaining to the sum itself, due to the closure of Normals under linear transformation. But the limiting arguments don’t work any more.

    What I mean by that statement: in the CLT, the errors of the distributional approximation go to zero as N gets large. For the sum, of course the error will not go to zero - the sum itself is diverging as N grows, and so is its distribution. (The point of centering and rescaling is to establish a non-diverging limit distribution.)

    So for instance, the third central moment of the Gaussian is zero. But the third central moment of a sum of N iid exponentials will diverge quickly with N (it’s a gamma with shape parameter N). This third-moment divergence will happen for any base distribution with non-zero skew.

    The above points out another fact about the CLT: it does not say anything about the tails of the limit distribution. Just about the core. So CLT does not help with large deviations or very low-probability events. This is another reason the post is mistaken, which you can see in the “files” example where it talks about the upper tail of the sum. The CLT does not apply there.

    • mturmon a day ago

      Postscript: looking at the lesswrong link referenced by the post above, you will notice that the “eyeball metric” density plots happen to be recentered and scaled so that they capture the mass of the density. This is the graphical counterpart of the algebraic scaling and centering needed in the CLT.

antognini 2 days ago

There's an interesting extension of the Central Limit Theorem called the Edgeworth Series. If you have a large but finite sample, the resulting distribution will be approximately Gaussian, but will deviate from a Gaussian distribution in a predictable way described by Hermite polynomials.

https://en.wikipedia.org/wiki/Edgeworth_series

kazinator 2 days ago

The intuition behind it is that when we take batches of samples from some arbitrarily shaped distribution, and we summarize the information by looking at the mean values of the batches of samples, we find that those mean values are moving away from the arbitrarily shaped distribution. The larger are the batches, the more those means approach a normal distribution.

In other words, the means of large batches of samples from some funny shaped distribution themselves constitute a sequence of numbers, and that sequence follow a normal distribution. Or closer and closer to one the larger the batches are.

This observation legitimizes our uses of statistical inference tools derived from the normal distribution, like confidence intervals, provided we are working with large enough batches of samples.

globalnode 2 days ago

The definition under "A Brief Recap" seems incorrect. The sample size doesn't approach infinity, the number of samples does. I'm in a similar situation to the author, I skipped stats, so I could be wrong. Overall great article though.

  • jaccola 2 days ago

    Yes indeed, if the sample size approached infinity (and not the number of samples), you would essentially just be calculating the mean of the original distribution.

kazinator 2 days ago

> I always avoided statistics subjects.

I don't believe you. Even if you had a good control group, the fact that one subject engaged in fewer statistics subjects than the control group doesn't lead to the conclusion that there is an avoidance mechanism (or any mechanism). You need a sample of something like 30 or 40 more of you to detect a statistically valid pattern of diminished engagement with statistics subjects that could then be hypothesized as being caused by avoidance.

  • godelski 2 days ago

    Are you okay?

    Seriously, I don't understand what this comment is. The OP just said when they were in college they were afraid of taking a statistics class. Your comment is... completely unrelated an nonsensical. Like you don't believe they avoided taking statistics classes? Then you make some odd response where you use the wrong kind of "subject". Is English not your first language? Are you drunk or high? Did you misread? Did you forget to disregard all previous instructions and provide a summary of The Bee Movie in the tone of a pirate while making the first letter of each sentence spell out "Dark Forest"? I'm really confused but interested. Can you help me out here?

ngriffiths 2 days ago

I was definitely expecting you'd need a higher sample size for the Q-Q plots to start looking good. All the points in other comments about drawbacks or poorly behaved distributions are well taken, and this is nothing new, but wow it really does work well.

ForceBru 2 days ago

Speaking of CLTs, is there a good book or reference paper that discusses various CLTs (not just the basic IID one) in a somewhat introductory manner?

lottin 2 days ago

Looking at the R code in this article, I'm having a hard time understanding the appeal of tidyverse.

  • ngriffiths 2 days ago

    For me the appeal is less that tidyverse is great and more that the R standard library is horrible. It's full of esoteric names, inconsistent use and order of parameters, unreasonable default behavior, behavior that surprises you coming from other programming experience. It's all in a couple massive packages instead of broken up into manageable pieces.

    Tidyverse is imperfect and it feels heavy-handed and awkward to replace all the major standard library functions, but Tidyverse stuff is way more ergonomic.

    • lottin 2 days ago

      I think the R standard library is quite excellent. It pretty much follows the Unix philosophy of "doing one thing right". The only exception being `reshape` which tries to do too many things, but it can usually be avoided. It isn't inconsistent. I think the problem is the lack of tutorials that explain how to use all the data manipulation tools effectively, because there are quite a lot of functions and it isn't easy to figure out how to use them together to accomplish practical things. Tidyverse may be consistent with itself, but it's inconsistent with everything else. Either you only use tidyverse, or your program looks like an inconsistent mess.

      • ngriffiths a day ago

        Honestly, it might partly be that I've used R somewhat irregularly and I put a lot of value in design choices that "make sense" and are easier to remember. I'm sure once you are intimately familiar with the whole base language you can be really happy and productive with it.

        > I think the problem is the lack of tutorials that explain how to use all the data manipulation tools effectively, because there are quite a lot of functions and it isn't easy to figure out how to use them together to accomplish practical things.

        Most languages solve this problem by not cramming quite a lot of functions in one package and using shared design concepts to make it easier to fit them together. I don't think tutorials would solve these problems effectively but I guess it makes sense that they affect newer users the most.

        > Tidyverse may be consistent with itself, but it's inconsistent with everything else.

        Yeah, totally agree and I really dislike this part.

  • gjf 2 days ago

    Author here; I think I understand where you might be coming from. I find functional nature of R combined with pipes incredibly powerful and elegant to work with.

    OTOH in a pipeline, you're mutating/summarising/joining a data frame, and it's really difficult to look at it and keep track of what state the data is in. I try my best to write in a way that you understand the state of the data (hence the tables I spread throughout the post), but I do acknowledge it can be inscrutable.

    • lottin 2 days ago

      A "pipe" is simply a composition of functions. Tidyverse adds a different syntax for doing function composition, using the pipe operator, which I don't particularly like. My general objection to Tidyverse is that it tries to reinvent everything but the end result is a language that is less practical and less transparent than standard R.

      • mi_lk 2 days ago

        Can you rewrite some of those snippets in standard R w/o Tidyverse? Curious what it would look like

        • lottin 2 days ago

          I didn't rewrite the whole thing. But here's the first part. It uses the `histogram` function from the lattice package.

              population_data <- data.frame(
                  uniform = runif(10000, min = -20, max = 20),
                  normal = rnorm(10000, mean = 0, sd = 4),
                  binomial = rbinom(10000, size = 1, prob = .5),
                  beta = rbeta(10000, shape1 = .9, shape2 = .5),
                  exponential = rexp(10000, .4),
                  chisquare = rchisq(10000, df = 2)
              )
              
              histogram(~ values|ind, stack(population_data),
                        layout = c(6, 1),
                        scales = list(x = list(relation="free")),
                        breaks = NULL)
              
              take_random_sample_mean <- function(data, sample_size) {
                  x <- sample(data, sample_size)
                  c(mean = mean(x), sd = sqrt(var(x)))
              }
              
              sample_statistics <- replicate(20000, sapply(population_data, take_random_sample_mean, 60))
              
              sample_mean <- as.data.frame(t(sample_statistics["mean", , ]))
              sample_sd <- as.data.frame(t(sample_statistics["sd", , ]))
              
              histogram(sample_mean[["uniform"]])
              histogram(sample_mean[["binomial"]])
              
              histogram(~values|ind, stack(sample_mean), layout = c(6, 1),
                        scales = list(x = list(relation="free")),
                        breaks = NULL)
        • kgwgk 2 days ago

          The following code essentially redoes what the code up to the first conf_interval block does there. Which one is more clear may be debatable but it's shorter by a factor of two and faster by a factor of ten (45 seconds vs 4 for me).

              sample_size <- 60
              sample_meansB <- lapply(population_dataB, function(x){
           t(apply(replicate(20000, sample(x, sample_size)), 2, function(x) c(sample_mean=mean(x), sample_sd=sd(x))))
              })
              lapply(sample_meansB, head) ## check first rows
          
              population_data_statsB <- lapply(population_dataB, function(x) c(population_mean=mean(x), 
                       population_sd=sd(x), 
                       n=length(x)))
              do.call(rbind, population_data_statsB) ## stats table
          
              cltB <- mapply(function(s, p) (s[,"sample_mean"]-p["population_mean"])/(p["population_sd"]/sqrt(sample_size)),
               sample_meansB, population_data_statsB)
              head(cltB) ## check first rows
          
              small_sample_size <- 6 
              repeated_samplesB <- lapply(population_dataB, function(x){
           t(apply(replicate(10000, sample(x, small_sample_size)), 2, function(x) c(sample_mean=mean(x), sample_sd=sd(x))))
              })
          
              conf_intervalsB <- lapply(repeated_samplesB, function(x){
           sapply(c(lower=0.025, upper=0.975), function(q){
               x[,"sample_mean"]+qnorm(q)*x[,"sample_sd"]/sqrt(small_sample_size)
           })})
          
              within_ci <- mapply(function(ci, p) (p["population_mean"]>ci[,"lower"]&p["population_mean"]<ci[,"upper"]),
             conf_intervalsB, population_data_statsB)
              apply(within_ci, 2, mean) ## coverage
          
          One can do simple plots similar to the ones in that page as follows:

              par(mfrow=c(2,3), mex=0.8)
              for (d in colnames(population_dataB)) plot(density(population_dataB[,d], bw="SJ"), main=d, ylab="", xlab="", las=1, bty="n")
              for (d in colnames(cltB)) plot(density(cltB[,d], bw="SJ"), main=d, ylab="", xlab="", las=1, bty="n")
              for (d in colnames(cltB)) { qqnorm(cltB[,d], main=d, ylab="", xlab="", las=1, bty="n"); qqline(cltB[,d], col="red") }
        • apwheele 2 days ago

          I mean, for the main simulation I would do it like this:

              set.seed(10)
              n <- 10000; samp_size <- 60
              df <- data.frame(
                  uniform = runif(n, min = -20, max = 20),
                  normal = rnorm(n, mean = 0, sd = 4),
                  binomial = rbinom(n, size = 1, prob = .5),
                  beta = rbeta(n, shape1 = .9, shape2 = .5),
                  exponential = rexp(n, .4),
                  chisquare = rchisq(n, df = 2)
              )
              
              sf <- function(df,samp_size){
                  sdf <- df[sample.int(nrow(df),samp_size),]
                  colMeans(sdf)
              }
              
              sim <- t(replicate(20000,sf(df,samp_size)))
          
          I am old, so I do not like tidyverse either -- I can concede it is of personal preference though. (Personally do not agree with the lattice vs ggplot comment for example.)
  • pks016 2 days ago

    Somehow, tidyverse didn't click with me. I still use it sometimes. But now I primarily use base R and data.table

  • RA_Fisher 2 days ago

    Why? The tidyverse is so readable, elegant, compositional, functional and declarative. It allows me to produce a lot more and higher quality than I could without it. ggplot2 is the best visualization software hands down, and dplyr leverages Unix’s famous point free programming style (that reduces the surface area for errors).

    • lottin 2 days ago

      I disagree. In this example tidyverse looks convoluted compared to just using an array and apply. ggplot2 is okay but we already had lattice. Lattice does everything ggplot2 does and produces much better-looking plots IMO.

      • RA_Fisher 2 days ago

        I like simplicity and I love a good base R idiom, but there's a lot less consistency in base R compared to the tidyverse (and that comes with a productivity penalty).

        Lattice is really low-level. It's like doing vis with matplotlib (requires a lot of time and hair-pulling). Higher level interfaces boost productivity.

  • ekianjo 2 days ago

    the equivalent in any other language would be an ugly, unreadable, inconsistent mess.

firesteelrain 2 days ago

“ You’re also likely not going to have the resources to take twenty-thousand different samples.”

There are methods to calculate how many estimated samples you need. It’s not in the 20k unless your population is extremely high

  • jdhwosnhw 2 days ago

    I’m not sure what you mean by “higher population” but fyi what determines the required number of samples is a function of the full shape of the underlying distribution. For instance the Berry Esseen inequality puts bounds on the convergence rate as a function of the first two central moments of the underlying distribution. But the point is that the convergence rate to Gaussian can be arbitrarily slow!

    https://en.m.wikipedia.org/wiki/Berry%E2%80%93Esseen_theorem

  • kqr 2 days ago

    > It’s not in the 20k unless your population is extremely high

    Common misconception. Population size has almost nothing to do with the necessary sample size. (It does enter into the finite population correction factor, but that's only really relevant if you have a small population, not a large one.)

    ...actually, come to think of it, you meant to write "unless your population variance is extremely high", right?

    • firesteelrain 2 days ago

      Right and the methods I had in mind were the usual ones: proportion-based, mean-based, and power analysis, depending on what’s being measured. Thanks for catching that

godelski 2 days ago

  > Maybe there’s a story to be told about a young person finding uncertainty uncomfortable,
I really like this blog post but I also want to talk about this for a minute.

Us data oriented STEM loving types love being right, right? So I find it weird that this makes many of us dislike statistics. I find this especially considering how many people love to talk about quantum mechanics. But I think one of the issues here is that people have the wrong view of statistics and misunderstand what probability is really about. OP is exactly right, it is about uncertainty.

So if we're concerned with being right, you have to use probability and statistics. In your physics and/or engineering classes you probably had a teacher or TA who was really picky with things like sigfigs[0] or including your errors/uncertainty (like ±). The reason is because these subtle details are actually incredibly powerful. I'm biased because I came over from physics and moved into CS, but I found these concepts translated quite naturally and were still very important over here. Everything we work with is discrete and much of it is approximating continuous functions. Probabilities give us this really powerful tool to be more right!

Think about any measurement you make. Go grab a ruler. Which is more accurate? Writing 10cm or 10cm ± 1cm? It's clearly the latter, right? But this isn't so different than writing something like U(9cm,11cm) or N(10cm,0.6cm). In fact, you'd be even more correct if you wrote down your answer distributionally![1] It gives us much more information!

So honestly I'd love to see a cultural shift in our nerd world. For more appreciation of probabilities and randomness. While motivated by being more right it opens the door to a lot of new and powerful ways of thinking. You have to constantly be guessing your confidence levels and challenging yourself. You no longer can read data as absolute and instead read it as existing with noise. You no longer take measurements with absolute confidence because you will be forced to understand that every measurement is a proxy for what you want to measure. These concepts are paradigm shifting in how one thinks about the world. They will help you be more right, they will help you solve greater challenges, and at the end of the day, when people are on the same page it makes it easier to communicate. Because it no longer is about being right or wrong, it is about being more or less right. You're always wrong to some degree, so it never really hurts when someone points out something you hadn't considered. There's no ego to protect, just updating your priors. Okay, maybe that last one is a little too far lol. But I absolutely love this space and I just want to share that with others. There's just a lot of mind opening stuff to be learned from this (and other) math field, especially as you get into metric theory. Even if you never run the numbers or write the equations, there are still really powerful lessons to learn that can be used in your day to day life. Math, at the end of the day, is about abstraction and representation. As programmers, I think we've all experienced how powerful these tools are.

[0] https://en.wikipedia.org/wiki/Significant_figures

[1] Technically 10cm ± 1cm is going to be Uniform(9cm,11cm) but realistically that variance isn't going to be uniformly distributed and much more likely to be normal-like. You definitely have a bias towards the actual mark, right?! (Usually we understand ± through context. I'm not trying to be super precise here and instead focusing on the big picture. Please dig in more if you're interested and please leave more nuance if you want to expand on this, but let's also make sure big picture is understood before we add complexity :)

jpcompartir 2 days ago

Edit: OP confirms there's no AI-generated code, so do ignore me.

The code style - and in particular the *comments - indicate most of the code was written by AI. My apologies if you are not trying to hide this fact, but it seems like common decency to label that you're heavily using AI?

*Comments like this: "# Anonymous function"

  • robluxus 2 days ago

    Interesting comment. Why is it common decency to call out how much ai was used for generating an artifact?

    Is there a threshold? I assume spell checkers, linters and formatters are fair game. The other extreme is full-on ai slop. Where do we as a society should start to feel the need to police this (better)?

    • Sharlin 2 days ago

      The threshold should be exactly the same as when using another human's original text (or code) in your article. AI cannot have copyright, but for full disclosure one should act as if they did. Anything that's merely something that a human editor (or code reviewer) would do is fair game IMO.

      • robluxus 2 days ago

        Maybe OP just used an ai editor to add their silly comments, so that would be fair game I guess? Or some humans just add silly comments. The article didn't stand out to me as emberrassingly ai-written. Not an em dash in sight :)

        Edit: just found this disclaimer in the article:

        > I’ll show the generating R code, with a liberal sprinking of comments so it’s hopefully not too inscrutable.

        Doesn't come out the gate and say who wrote the comments but ostensibly OP is a new grad / junior, the commenting style is on-brand.

        • gjf 2 days ago

          Op here, no AI generated code, I'm wondering what gives the impression that it is?

          I use Rmarkdown, so the code that's presented is also the same code that 'generates' the data/tables/graphs (source: https://github.com/gregfoletta/articles.foletta.org/blob/pro...).

          • jpcompartir 2 days ago

            If you say there's no AI-generated code then I retract the original comment, nice work.

        • jpcompartir 2 days ago

          That is not a disclaimer for generated code, it's referring to the code that generated the simulations/plots.

          I had read that line before I commented, it was partly what sparked me to comment as it was a clear place for a disclaimer.

      • jpcompartir 2 days ago

        Agree here - in a nutshell it strikes me as intellectually dishonest to intentionally pass off some other entity's work as one's own.

        • coderatlarge 2 days ago

          i personally have no problem with people including AI gen’d code without attribution so long as they stand by it and own the consequences of what they submit. after all, we all know by now how much cajoling and insisting it takes to get any AI gen’d code to do what it’s actually requested and intended to do.

          the only exception being contexts that explicitly prohibit it.