Ask HN: Ways to automatically make inferences from data?

3 points by akudha 5 years ago

Suppose I have sales data - it is easy to run a few queries to find out top selling items, top regions etc.

But, is it possible to do this on any set of data, without knowing anything meta about the data at all? In other words, can this be generalized? Are there any models, theories I can learn to achieve this?

JPLeRouzic 5 years ago

I am not a specialist of this field but here is how I would deal with this problem:

My understanding is that it is possible to find the main components (items, regions, etc) in a dataset, for example with PCA [0]. However it will not name those components, but it might be quite easy to infer the name of each component. Once you know the data in one component, you can find their min/max. I guess there are several similar mathematical techniques to do the same job and there are also several user friendly Business Intelligence software.

[0] https://en.wikipedia.org/wiki/Principal_component_analysis

  • akudha 5 years ago

    If I had a CSV, I can import it into a database and simply run queries, systematically, isn't it?

    • JPLeRouzic 5 years ago

      You can alway do that, the question is (as the other comments told) "will your users get meaningful answers?"

      You could simply use Excel (it has PCA) also to test your ideas. There are many resources on Internet.

thedevindevops 5 years ago

Yes - but not in the way you think.

A program can be written to read in a dataset and correlate each set of data with every other set of data in that dataset (think producing a set of graphs comparing every combination of 2 properties of that dataset).

Somewhere in that mess of graphs there will be useful ones but the program won't be able to tell the difference.

You still need a human for that.

dhkxh 5 years ago

Your title is about inferences, but your text describes a summary, a descriptive statistic or an aggregation - they are very different problems. It's quite straightforward to "find a top item" regardless of the dataset but I don't know why you would want this automated at all.

natalyarostova 5 years ago

As a general answer to a general question: not really.

Domain knowledge and prior knowledge on how the data is associated with reality exists in the brain of the human, who expresses that using code and science for the problem at hand.

Otherwise the computer doesn't know the difference between sales data. And any other data.