This seems like a really interesting approach and I think the numbers are promising. Given that it's an entirely different type of model-building than traditional methods, I think it's fine for it to just be up to par with a basic shallow model. If the constructive approach turns out to be comparable to current state-of-the-art models with sufficient refinement, it could be really valuable for low-compute applications like IoT devices, etc.

To be honest, I can't say I know enough about the math here to do anything more than vaguely follow their explanations despite my ML/NLP background. I'm curious - other ML researchers out there, how much of this are you able to understand? My impression is that this math is pretty far beyond what ML folks typically know, although I'm on the lower end of the spectrum as far as math knowledge goes, so I may be totally wrong (and need to spend more time reading textbooks haha). I wonder if the complexity may slow down progress if it does turn out that this kind of geometric construction can compete with iterative training. It sounds like this approach could potentially support more complex networks by working more on the geometric representation, so I certainly hope this paper serves its purpose of motivating people with the right skillsets to do further exploration.

In this case the fundamental idea is not especially complicated, but the math may make it seem more complex than it is to somebody unfamiliar with the notation/vocabulary.

Imagine the case where an input dataset has three dimensions: each piece of input data could be represented as a point in 3D space. We want to classify each point with a category. Imagine a cube containing all such points, and imagine the cube was subdivided into many tiny cubes of the same size. If these tiny cubes were sufficiently small, then each cube would only contain points belonging to a single category (sufficiently small cubes would only contain a single point, making this trivially true). We could then assign to each cube a category based on the category of the points in it, then if we wanted to classify a new point, we'd just check what cube it's inside and look up the category of that cube.

The above process doesn't scale well (would be too slow in practice). An alternative is, rather than subdividing the cube into tiny evenly-sized cubes, we use an iterative process, starting with the whole cube, finding an ideal dimension along which to split it, splitting it (subject to the constraint that splitting it wouldn't make the children too small), then repeating this process with the two 3D rectangles that resulted from splitting the original shape. This could produce extremely long, thin 3D rectangles however, which might not generalise well to new data, so we cap the aspect ratio of the child rectangles so they can't get too thin. This approach may leave some areas of the original cube unclassified, so for each such area we find the category of the 3D rectangle that the unclassified areas touches most, and assign its category to the unclassified area, then repeat until there are no unclassified areas remaining. The end result approximates the result of subdividing the cube into tiny cubes, but is significantly faster to produce.

The above approach is roughly what's described in the paper, except they describe an n-dimensional hypercube rather than just a 3-dimensional one, and provide more detail on how to determine the ideal axis along which to split a hyperrectangle into child hyperrectangles.

It seems fair to me, in the sense that they are effectively constructing a kind of decision tree. There may be some subtle-yet-significant differences, but they don't contrast their model with other decision trees in the paper.

Not an ML researcher, but none of non-ML math here is beyond someone who has taken introductory classes to topology and linear algebra. Rather than being too advanced for you this may just be somewhat orthogonal to the aspects of the problem space you're used to thinking about.

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
random_state=1729,
test_size=X.shape[0] - (10 * 20))
model = MLPClassifier(random_state=1729)
model.fit(X_train, y_train)
p = model.predict(X_test)
print(accuracy_score(y_test, p))
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
random_state=1729,
test_size=X.shape[0] - (10 * 200))
model = MLPClassifier(random_state=1729)
model.fit(X_train, y_train)
p = model.predict(X_test)
print(accuracy_score(y_test, p))

This gets you 0.645 and 0.838 accuracy score respectively (versus 62% and 76% in the paper). Sure, different validation (I validate on all the remaining data, they do 20x 70% 30% splits on the 200 and 2000 samples, which needlessly lowers the number of training samples, fairer comparison is 0.819 with 1400 samples), but the scores seem at least comparable. Cool method though, I can dig this and look beyond benchmarks (Though Iris and Wine are really toy datasets by now.)

Perhaps more importantly than the limit on 200/2000 samples, the main difference from your code is that in the paper they only use the 10 first PCA (i.e. X is 10-dimensional instead of 784-dimensional). The fact that they do tends to contradict their claim that the proposed algorithm scales "weakly" with dimension, but for using just 10 dimensions (chosen linearly from the input ones) 76% is not that bad.

Personally, my doubts are more in line with the other ones that were raised here, namely tdj's comment on not building on existing literature (the feeling of "this just looks too much like a classification tree" is very strong while reading), and scalability issues. I also tried something vaguely similar in the past (but using iterated linear discriminant analysis instead of hypercubes / classification trees, and convolutional networks instead of fully connected), but never even finished the prototype because it was so terribly slow even on MNIST that nobody in their sane mind would have used it (I could only load into RAM a few samples at a time, which made it messy).

In any case, it's a pity they didn't try to use these weights as initialization for the "usual" backpropagation, it might have lead to interesting results (especially if in doing so they extended to the whole of the input and not just the first 10 PCA - or at least on ore than 10).

Scores become more equal when you make the first hidden layer size 10 instead of 100 (both methods use an X of 784 dimensions).

Instead of PCA on all features, they could subsample 10 random features to partition, and bag the results of multiple runs. Basically, Totally Random Trees with an arbitrarily handicapped Random Subspaces method. Scales well, can beat Logistic Regression, but not any of the more developed tree methods.

Another difference with established literature is that this algorithm does not use any kind of knowledge transfer from previously learned classes. In most one-shot methods, including those used by humans, the model is already trained on other classes, and uses this information to adapt to unseen classes. Instead, the authors interpret it solely as deep learning under a small training data setting (which -- my code shows -- does not require jumping through hoops).

If an established approach that has been made into a library matches a novel approach, then it's not something to "look beyond" -- it's validation that the novel approach is probably worth investigating more fully.

Their hypercube covering formalism can be seen as decision tree induction with a specific partitioning rule, and terminating branching only at uniformly labeled leaves. But try are using the tree nodes as kind of an embedding to apply a softmax on. I like the connection between relus and the geometrical representation, makes it easier to think about in spatial terms.

Reading this I got several dejavus to my grad school classes on classical ML stuff. I like the direction but it feels like it could be better if it admitted that it's a variant of decision tree embedding, and built on some of the massive amount of research work in that area. At least in terms of understanding.

I suspect doing a random forest version of this would actually help. Perhaps we will see this as a legit pre-training step.

This seems like a really interesting approach and I think the numbers are promising. Given that it's an entirely different type of model-building than traditional methods, I think it's fine for it to just be up to par with a basic shallow model. If the constructive approach turns out to be comparable to current state-of-the-art models with sufficient refinement, it could be really valuable for low-compute applications like IoT devices, etc.

To be honest, I can't say I know enough about the math here to do anything more than vaguely follow their explanations despite my ML/NLP background. I'm curious - other ML researchers out there, how much of this are you able to understand? My impression is that this math is pretty far beyond what ML folks typically know, although I'm on the lower end of the spectrum as far as math knowledge goes, so I may be totally wrong (and need to spend more time reading textbooks haha). I wonder if the complexity may slow down progress if it does turn out that this kind of geometric construction can compete with iterative training. It sounds like this approach could potentially support more complex networks by working more on the geometric representation, so I certainly hope this paper serves its purpose of motivating people with the right skillsets to do further exploration.

In this case the fundamental idea is not especially complicated, but the math may make it seem more complex than it is to somebody unfamiliar with the notation/vocabulary.

Imagine the case where an input dataset has three dimensions: each piece of input data could be represented as a point in 3D space. We want to classify each point with a category. Imagine a cube containing all such points, and imagine the cube was subdivided into many tiny cubes of the same size. If these tiny cubes were sufficiently small, then each cube would only contain points belonging to a single category (sufficiently small cubes would only contain a single point, making this trivially true). We could then assign to each cube a category based on the category of the points in it, then if we wanted to classify a new point, we'd just check what cube it's inside and look up the category of that cube.

The above process doesn't scale well (would be too slow in practice). An alternative is, rather than subdividing the cube into tiny evenly-sized cubes, we use an iterative process, starting with the whole cube, finding an ideal dimension along which to split it, splitting it (subject to the constraint that splitting it wouldn't make the children too small), then repeating this process with the two 3D rectangles that resulted from splitting the original shape. This could produce extremely long, thin 3D rectangles however, which might not generalise well to new data, so we cap the aspect ratio of the child rectangles so they can't get too thin. This approach may leave some areas of the original cube unclassified, so for each such area we find the category of the 3D rectangle that the unclassified areas touches most, and assign its category to the unclassified area, then repeat until there are no unclassified areas remaining. The end result approximates the result of subdividing the cube into tiny cubes, but is significantly faster to produce.

The above approach is roughly what's described in the paper, except they describe an n-dimensional hypercube rather than just a 3-dimensional one, and provide more detail on how to determine the ideal axis along which to split a hyperrectangle into child hyperrectangles.

This sounds like a K-d tree. Am I understanding it right or have I missed something?

Yep, pretty much, except it's constructed differently to how we'd normally construct a K-d tree.

Cool, thanks for the translation! I'm gonna go read the paper now...

Thanks for the tldr. This sounds rather a lot like a tree based method. How fair is that analogy?

It seems fair to me, in the sense that they are effectively constructing a kind of decision tree. There may be some subtle-yet-significant differences, but they don't contrast their model with other decision trees in the paper.

Not an ML researcher, but none of non-ML math here is beyond someone who has taken introductory classes to topology and linear algebra. Rather than being too advanced for you this may just be somewhat orthogonal to the aspects of the problem space you're used to thinking about.

Perhaps more importantly than the limit on 200/2000 samples, the main difference from your code is that in the paper they only use the 10 first PCA (i.e. X is 10-dimensional instead of 784-dimensional). The fact that they do tends to contradict their claim that the proposed algorithm scales "weakly" with dimension, but for using just 10 dimensions (chosen linearly from the input ones) 76% is not that bad.

Personally, my doubts are more in line with the other ones that were raised here, namely tdj's comment on not building on existing literature (the feeling of "this just looks too much like a classification tree" is very strong while reading), and scalability issues. I also tried something vaguely similar in the past (but using iterated linear discriminant analysis instead of hypercubes / classification trees, and convolutional networks instead of fully connected), but never even finished the prototype because it was so terribly slow even on MNIST that nobody in their sane mind would have used it (I could only load into RAM a few samples at a time, which made it messy).

In any case, it's a pity they didn't try to use these weights as initialization for the "usual" backpropagation, it might have lead to interesting results (especially if in doing so they extended to the whole of the input and not just the first 10 PCA - or at least on ore than 10).

Scores become more equal when you make the first hidden layer size 10 instead of 100 (both methods use an X of 784 dimensions).

Instead of PCA on all features, they could subsample 10 random features to partition, and bag the results of multiple runs. Basically, Totally Random Trees with an arbitrarily handicapped Random Subspaces method. Scales well, can beat Logistic Regression, but not any of the more developed tree methods.

Another difference with established literature is that this algorithm does not use any kind of knowledge transfer from previously learned classes. In most one-shot methods, including those used by humans, the model is already trained on other classes, and uses this information to adapt to unseen classes. Instead, the authors interpret it solely as deep learning under a small training data setting (which -- my code shows -- does not require jumping through hoops).

If an established approach that has been made into a library matches a novel approach, then it's not something to "look beyond" -- it's validation that the novel approach is probably worth investigating more fully.

Their hypercube covering formalism can be seen as decision tree induction with a specific partitioning rule, and terminating branching only at uniformly labeled leaves. But try are using the tree nodes as kind of an embedding to apply a softmax on. I like the connection between relus and the geometrical representation, makes it easier to think about in spatial terms.

Reading this I got several dejavus to my grad school classes on classical ML stuff. I like the direction but it feels like it could be better if it admitted that it's a variant of decision tree embedding, and built on some of the massive amount of research work in that area. At least in terms of understanding.

I suspect doing a random forest version of this would actually help. Perhaps we will see this as a legit pre-training step.

> terminating branching only at uniformly labeled leaves

Also called Perfect Decision Tree.

Wonder how this compares to single shot with an SVM or nearest neighbor. 76% on MNIST is frankly embarrassing

Interesting, we spent the last few days digging into the old Time Warp OS

https://lasr.cs.ucla.edu/reiher/Time_Warp.html

::applaud::

MNIST is joke. I can train linear regression and acheive 90% accuracy.

CIFAR-10 is the new MNIST today.