The post lists Auto Differentiation as one of the techniques that can overcome non-differentiable loss functions. Can someone explain how this is even possible? After all automatic differentiation[1] is a way to compute gradients/derivatives in a way which could become costly if otherwise done(symbolic differentiation[2]). The function or the operations defined in the function(in case of source-to-source differentiation[3]) needs to be differentiable.

I think it's easier to visualize considering how pytorch (or tensorflow eager) does it. Pytorch supports loops, conditionals and other native python non differentiable operators, but what happens is that for each forward/backward pass pair you run the graph building function again. Only the functions that are overloaded by the library with the execution graph ("tape") creation steps affect what's in each final tape (and they are all differentiable), while the non differentiable native parts (like loops) only define the combination of the former differentiable pieces. You can even have a function that asks for user input every time in order to build the tape (a non pure function), as long as said graph is always differentiable at each run then auto-differentiation still works. Also, unlike symbolic differentiation, you can have a surrogate gradient that approximates the behavior of the function well enough when it's not differentiable (for example max pooling for convolutional network).

The original tensorflow work the same, but instead of running the graph every time, it embeds the non differentiable control mechanism in the code of the graph, which can more efficiently (without needing the host language to build a new one every time) create the correct differentiable tape for each run based on it's input. And source-to-source differentiation work exactly the same way, except instead of having to use a DSL (like the tensorflow graph API) and compile it, it simply uses the host language and compiler directly (so you don't need effectively two languages). Which is the case of Julia's Zygote and Swift for Tensorflow.

The only alternative to this piecewise differentiation that I know of would be creating a soft version of discrete operators, such as replacing step functions with sigmoids and case/switch/elsif operators as softmax selectors for example, which is not what any of those libraries do (it would not be easy to make it converge as the graph would be much more complex at each backward pass). In this case you could have one single graph that includes every branch though.

In my opinion, the point about automatic differentiation isn't quite correct. Automatic differentiation will help you compute gradients for complicated functions involving multiple operations, as long as each of these operations are themselves differentiable (recursively; at some point, there's basic 'building blocks' with hard-coded gradients). But not all operations are inherently differentiable (in fact, most probably aren't). Now, there are other ideas such as neural Turing machines and differentiable programming, but these are not really what you'd refer to when talking about autodifferentiation.

Non-differentiable here doesn't mean actually non-differentiable in the mathematical sense, it just means that the function does not expose a derivative that is accessible to you.

> a differentiable function of one real variable is a function whose derivative exists at each point in its domain

Most programs aren't differentiable functions because conditionals aren't in general differentiable. For example "if x > 0 { 1 } else { -1 }" doesn't have a derivative at 0 (and so, by the definition above) isn't differentiable at 0.

In deep learning, you generally don't require differentiability on the entire domain, only on most points you're likely to encounter. So a finite number of non-differentiable points is fine: You just trust that you're never going to hit them by chance (the probability that you do is 0), and if by some miracle you do, you just use a subgradient.

Case in point, the currently most used activation function in neural nets, the rectified linear unit

> by some miracle you do, you just use a subgradient

This is the most succinct comment I have encountered on how people think about non-differentiability in deep learning.

This helped me reconcile my experiences with the deep learning paradigm. Thank you.

You see, in the numerical optimization of general mathematical models (e.g. where the model is a general nonlinear -- often nonconvex -- system of equations and constraints), you often do hit non-differentiable points by chance. This is why in mathematical modeling one is taught various techniques to promote model convergence. For instance, a formulation like x/y = k is reformulated as x = k * y to avoid division by zeros in y during iteration (even if the final value of y is nonzero) and to avoid any nonsmoothness (max(), min(), abs() functions for instance are replaced with "smooth" approximations). In a general nonlinear/noconvex model, when you encounter non-differentiability, you are liable to lose your descent direction and often end up losing your way (sometimes ending up with an infeasible solution).

However it seems to me that the deep learning problem is an unconstrained optimization problem with chained basis functions (ReLU), so the chances of this happening is slighter and subgradients provide a recovery method so the algorithm can gracefully continue.

This is often not the experience for general nonlinear models, but I guess deep learning problems have a special form that lets you get away with it. This is very interesting.

I don't know why you think subgradient is that important. It's just a shorthand for anything reasonable. DNNs are overwhelmingly underdetermined and have many many minimizers. It's not so important to find the best one (an impossible task for sgd) as to find one that is good enough.

> I don't know why you think subgradient is that important.

I underquoted. It's more the approach to handling of nondifferentiability in deep learning problems that is of interest to me, whether it involves subgradients or some other recovery approach.

These approaches typically do not work well in general nonlinear systems, but they seem to be ok in deep learning problems. I haven't read any attempts to explain this until I read parent comment.

> It's just a shorthand for anything reasonable. DNNs are overwhelmingly underdetermined and have many many minimizers.

This is not true for general nonlinear systems, hence my interest.

Agreed. I was just reacting to the parent's comment that

> Non-differentiable here doesn't mean actually non-differentiable in the mathematical sense, it just means that the function does not expose a derivative that is accessible to you.

I read that as meaning that the loss functions being considered were differentiable in the mathematical sense, it was just hard to calculate the derivative.

My point is that the parent's comment is mostly right: a frequent challenge in ML is black box computational units which don't expose the necessary information to run autograd. Even if the underlying mathematical function is not differentiable everywhere, but only on most points, having autograd available is valuable for use in training.

Hence you get works like this [1] which reimplement existing systems in a way that is amenable to autograd.
[1] https://arxiv.org/abs/1910.00935

In practice the functions just need to be piecewise differentiable. The RELU is the canonical example for deep learning. At kinks a subderivative is used.

It’s a little trickier than that, but you are generally right. The relaxation can involve many different things though, not just piecewise differentiability.

For example, when processing images that themselves may be intensity functions over a spatial domain, and the intensity function can have cusps, edges, occlusions, etc., then you need different weak-sense differentiability conditions, such as from Sobolev spaces, to guarantee that numerical gradient operators will succeed / converge where needed.

Whatever your computer actually computes is more or less a bunch of pluses, minuses, multiplications, divisions. The function you want to compute may not be differentiable, but the sequence of arithmetic operations you actually compute to arrive at your output should basically be differentiable by the chain rule. Then an AD tool should let you differentiate that.

It seems likely to me that there’s nothing extraordinarily good about this technique without further discussion of the specific problem application given that the field of derivative free optimization exists and is actively researched. I don’t really know why those guys would bother if AD (not a new thing) supplanted their field.

In numerical computing in general I’ve seen quite a bit the idea that you can either differentiate the thing you’re trying to optimize, then use a numerical method to evaluate the derivative, or use a numerical method to evaluate the thing you want to optimize and then differentiate the numerical method. I’m not expert enough to provide any real analysis of when to use each.

In theory if you had a source-to-source automatic differentiator (a programme which takes the source code for another programme f and outputs the source code for its derivative f'), then you could run backpropagation through that programme and compute the error at all computation steps.

This begs the question, what is the derivative of a programme? At its simplest (and I guess this is what the OP was trying to say), any computer programme can be reduced to arithmetic and also logical control flow operations, and thus the derivative of a programme involves calculating the derivatives each computation along all computation paths.

I'm no expert but this talk by Simon Peyton-Jones should make the concept much clearer:

This is a bit of an irritation of mine. To be clear, if a function is nondifferentiable, automatic differentiation will not magically fix this. Ever.

There are plenty of nondifferentiable, elementary functions. For example, `sqrt` is nondifferentiable at 0. Depending on if you want to include it as an elementary function, `abs` is another. Now, what people get away with is that really pathological functions such as the Dirichlet function aren't readily representable on a computer. Most of the time, points of discontinuity or nondifferentiability occur at a small finite number of locations. As such, we pretend like we can just ignore these points.

The unfortunate part is that these points of nondifferentiability to come up quite often because a lot of the interesting action in a function occurs near these points. Take `abs`, if we were doing something like `min abs(x)` we'd be cruising along moving toward the origin until we hit the kink, which turns out to be the interesting part of this function. At this point, most algorithms will have trouble.

Now, we could also say that, "Well, a subderivative/subgradient exists and the routine can just return that and it will be fine." That's all well and true if you're using an optimization algorithm that correctly deals with subderivatives and is notified of them. Most optimization algorithms do not handle subderivatives. The ones that do generally need to know that there are many subderivatives in a particular location and get to choose the one they want. A typical AD routine will not return this information.

If you want an example from the deep learning field, look at the old sigmoid activation functions. If we set aside the pretense that this is a simulation of the brain, sigmoids are a differentiable representation of the step (Heaviside) function. Recall how these algorithms had trouble with vanishing gradients? Well, the derivative of the step function is zero everywhere except at the origin where it doesn't exist. How are we supposed to optimize with that? We can't. That's why we used sigmoid functions, at the time.

Ok, does that mean we can't use nondifferentiable functions when optimizing? No. However, I would encourage everyone to clearly document the points of nondifferentiability because it makes debugging these issues vastly easier. Outside of this, the way we fix them depends on a point by point case. For example, if we wanted to solve the problem `min abs(f(x))` we could instead solve the problem `min y st y>=f(x) y>=-f(x)`. Assuming that `f` is differentiable, the new optimization problem is as well, but we now have to handle inequality constraints. For a soft variety, I like `sqrt(1+f(x)^2)-1`. In the deep learning field, step functions are replaced by sigmoids and linear rectifiers are replaced by soft rectifiers. And, yes, there are plenty of examples where soft rectifiers weren't used and things worked great. That's great when it works. I'll contend that such techniques don't have to break every time for it to be a problem.

Anyway, the bottom line is that nondifferentiability is a problem because virtually all gradient based optimization solvers require this property. Subgradients are only good enough if the underlying algorithm explicitly works with this information and most solvers don't. None of this means that everything has to break every time. Certainly, if we find an optimal solution away from points of nondifferentiability, we're good. However, it does create some really difficult to debug situations. Finally, in case it wasn't clear, yes, the article is talking about deep learning, but the kind of deep learning techniques under discussion are using optimization solvers to fit their particular model to the data, so the general discussion holds.

Related: "In general a machine learning system is built and trained to optimize a specified target objective: classification accuracy in a spam filter or tumor diagnostic, efficiency in route planning or Amazon box packing. Unlike these precise performance metrics, the criteria of safety, trust, and nondiscrimination often cannot be completely quantified. How then can models be trained towards these auxiliary objectives?"

I thought this was going to be about differential inclusions and deep learning which sounds pretty rad! Is anyone out there working in this direction ?

The post lists Auto Differentiation as one of the techniques that can overcome non-differentiable loss functions. Can someone explain how this is even possible? After all automatic differentiation[1] is a way to compute gradients/derivatives in a way which could become costly if otherwise done(symbolic differentiation[2]). The function or the operations defined in the function(in case of source-to-source differentiation[3]) needs to be differentiable.

[1] - https://www.youtube.com/watch?v=z8GyNneq5D4

[2] - https://stackoverflow.com/a/45134000

[3] - https://github.com/google/tangent

EDIT: 1. Added reference. 2. Formatting

I think it's easier to visualize considering how pytorch (or tensorflow eager) does it. Pytorch supports loops, conditionals and other native python non differentiable operators, but what happens is that for each forward/backward pass pair you run the graph building function again. Only the functions that are overloaded by the library with the execution graph ("tape") creation steps affect what's in each final tape (and they are all differentiable), while the non differentiable native parts (like loops) only define the combination of the former differentiable pieces. You can even have a function that asks for user input every time in order to build the tape (a non pure function), as long as said graph is always differentiable at each run then auto-differentiation still works. Also, unlike symbolic differentiation, you can have a surrogate gradient that approximates the behavior of the function well enough when it's not differentiable (for example max pooling for convolutional network).

The original tensorflow work the same, but instead of running the graph every time, it embeds the non differentiable control mechanism in the code of the graph, which can more efficiently (without needing the host language to build a new one every time) create the correct differentiable tape for each run based on it's input. And source-to-source differentiation work exactly the same way, except instead of having to use a DSL (like the tensorflow graph API) and compile it, it simply uses the host language and compiler directly (so you don't need effectively two languages). Which is the case of Julia's Zygote and Swift for Tensorflow.

The only alternative to this piecewise differentiation that I know of would be creating a soft version of discrete operators, such as replacing step functions with sigmoids and case/switch/elsif operators as softmax selectors for example, which is not what any of those libraries do (it would not be easy to make it converge as the graph would be much more complex at each backward pass). In this case you could have one single graph that includes every branch though.

In my opinion, the point about automatic differentiation isn't quite correct. Automatic differentiation will help you compute gradients for complicated functions involving multiple operations, as long as each of these operations are themselves differentiable (recursively; at some point, there's basic 'building blocks' with hard-coded gradients). But not all operations are inherently differentiable (in fact, most probably aren't). Now, there are other ideas such as neural Turing machines and differentiable programming, but these are not really what you'd refer to when talking about autodifferentiation.

Non-differentiable here doesn't mean

actuallynon-differentiable in the mathematical sense, it just means that the function does not expose a derivative that is accessible to you.I'm not sure. According to Wikipedia (https://en.wikipedia.org/wiki/Differentiable_function),

> a differentiable function of one real variable is a function whose derivative exists at each point in its domain

Most programs aren't differentiable functions because conditionals aren't in general differentiable. For example "if x > 0 { 1 } else { -1 }" doesn't have a derivative at 0 (and so, by the definition above) isn't differentiable at 0.

...or have I missed something?

In deep learning, you generally don't require differentiability on the entire domain, only on most points you're likely to encounter. So a finite number of non-differentiable points is fine: You just trust that you're never going to hit them by chance (the probability that you do is 0), and if by some miracle you do, you just use a subgradient.

Case in point, the currently most used activation function in neural nets, the rectified linear unit

ReLU(x) = max(x, 0),

is clearly not differentiable everywhere either.

> by some miracle you do, you just use a subgradient

This is the most succinct comment I have encountered on how people think about non-differentiability in deep learning.

This helped me reconcile my experiences with the deep learning paradigm. Thank you.

You see, in the numerical optimization of general mathematical models (e.g. where the model is a general nonlinear -- often nonconvex -- system of equations and constraints), you often

dohit non-differentiable points by chance. This is why in mathematical modeling one is taught various techniques to promote model convergence. For instance, a formulation like x/y = k is reformulated as x = k * y to avoid division by zeros in y during iteration (even if the final value of y is nonzero) and to avoid any nonsmoothness (max(), min(), abs() functions for instance are replaced with "smooth" approximations). In a general nonlinear/noconvex model, when you encounter non-differentiability, you are liable to lose your descent direction and often end up losing your way (sometimes ending up with an infeasible solution).However it seems to me that the deep learning problem is an unconstrained optimization problem with chained basis functions (ReLU), so the chances of this happening is slighter and subgradients provide a recovery method so the algorithm can gracefully continue.

This is often not the experience for general nonlinear models, but I guess deep learning problems have a special form that lets you get away with it. This is very interesting.

I don't know why you think subgradient is that important. It's just a shorthand for anything reasonable. DNNs are overwhelmingly underdetermined and have many many minimizers. It's not so important to find the best one (an impossible task for sgd) as to find one that is good enough.

> I don't know why you think subgradient is that important.

I underquoted. It's more the approach to handling of nondifferentiability in deep learning problems that is of interest to me, whether it involves subgradients or some other recovery approach.

These approaches typically do not work well in general nonlinear systems, but they seem to be ok in deep learning problems. I haven't read any attempts to explain this until I read parent comment.

> It's just a shorthand for anything reasonable. DNNs are overwhelmingly underdetermined and have many many minimizers.

This is not true for general nonlinear systems, hence my interest.

Agreed. I was just reacting to the parent's comment that

> Non-differentiable here doesn't mean actually non-differentiable in the mathematical sense, it just means that the function does not expose a derivative that is accessible to you.

I read that as meaning that the loss functions being considered were differentiable in the mathematical sense, it was just hard to calculate the derivative.

My point is that the parent's comment is mostly right: a frequent challenge in ML is black box computational units which don't expose the necessary information to run autograd. Even if the underlying mathematical function is not differentiable everywhere, but only on most points, having autograd available is valuable for use in training.

Hence you get works like this [1] which reimplement existing systems in a way that is amenable to autograd. [1] https://arxiv.org/abs/1910.00935

In practice the functions just need to be piecewise differentiable. The RELU is the canonical example for deep learning. At kinks a subderivative is used.

It’s a little trickier than that, but you are generally right. The relaxation can involve many different things though, not just piecewise differentiability.

For example, when processing images that themselves may be intensity functions over a spatial domain, and the intensity function can have cusps, edges, occlusions, etc., then you need different weak-sense differentiability conditions, such as from Sobolev spaces, to guarantee that numerical gradient operators will succeed / converge where needed.

https://en.m.wikipedia.org/wiki/Sobolev_space

This is beyond my reach but I’m giving it a stab:

Whatever your computer actually computes is more or less a bunch of pluses, minuses, multiplications, divisions. The function you want to compute may not be differentiable, but the sequence of arithmetic operations you actually compute to arrive at your output should basically be differentiable by the chain rule. Then an AD tool should let you differentiate that.

It seems likely to me that there’s nothing extraordinarily good about this technique without further discussion of the specific problem application given that the field of derivative free optimization exists and is actively researched. I don’t really know why those guys would bother if AD (not a new thing) supplanted their field.

In numerical computing in general I’ve seen quite a bit the idea that you can either differentiate the thing you’re trying to optimize, then use a numerical method to evaluate the derivative, or use a numerical method to evaluate the thing you want to optimize and then differentiate the numerical method. I’m not expert enough to provide any real analysis of when to use each.

In theory if you had a source-to-source automatic differentiator (a programme which takes the source code for another programme f and outputs the source code for its derivative f'), then you could run backpropagation through that programme and compute the error at all computation steps.

This begs the question, what is the derivative of a programme? At its simplest (and I guess this is what the OP was trying to say), any computer programme can be reduced to arithmetic and also logical control flow operations, and thus the derivative of a programme involves calculating the derivatives each computation along all computation paths.

I'm no expert but this talk by Simon Peyton-Jones should make the concept much clearer:

https://2019.ecoop.org/details/ecoop-2019-papers/11/Automati...

This is a bit of an irritation of mine. To be clear, if a function is nondifferentiable, automatic differentiation will not magically fix this. Ever.

There are plenty of nondifferentiable, elementary functions. For example, `sqrt` is nondifferentiable at 0. Depending on if you want to include it as an elementary function, `abs` is another. Now, what people get away with is that really pathological functions such as the Dirichlet function aren't readily representable on a computer. Most of the time, points of discontinuity or nondifferentiability occur at a small finite number of locations. As such, we pretend like we can just ignore these points.

The unfortunate part is that these points of nondifferentiability to come up quite often because a lot of the interesting action in a function occurs near these points. Take `abs`, if we were doing something like `min abs(x)` we'd be cruising along moving toward the origin until we hit the kink, which turns out to be the interesting part of this function. At this point, most algorithms will have trouble.

Now, we could also say that, "Well, a subderivative/subgradient exists and the routine can just return that and it will be fine." That's all well and true if you're using an optimization algorithm that correctly deals with subderivatives and is notified of them. Most optimization algorithms do not handle subderivatives. The ones that do generally need to know that there are many subderivatives in a particular location and get to choose the one they want. A typical AD routine will not return this information.

If you want an example from the deep learning field, look at the old sigmoid activation functions. If we set aside the pretense that this is a simulation of the brain, sigmoids are a differentiable representation of the step (Heaviside) function. Recall how these algorithms had trouble with vanishing gradients? Well, the derivative of the step function is zero everywhere except at the origin where it doesn't exist. How are we supposed to optimize with that? We can't. That's why we used sigmoid functions, at the time.

Ok, does that mean we can't use nondifferentiable functions when optimizing? No. However, I would encourage everyone to clearly document the points of nondifferentiability because it makes debugging these issues vastly easier. Outside of this, the way we fix them depends on a point by point case. For example, if we wanted to solve the problem `min abs(f(x))` we could instead solve the problem `min y st y>=f(x) y>=-f(x)`. Assuming that `f` is differentiable, the new optimization problem is as well, but we now have to handle inequality constraints. For a soft variety, I like `sqrt(1+f(x)^2)-1`. In the deep learning field, step functions are replaced by sigmoids and linear rectifiers are replaced by soft rectifiers. And, yes, there are plenty of examples where soft rectifiers weren't used and things worked great. That's great when it works. I'll contend that such techniques don't

haveto break every time for it to be a problem.Anyway, the bottom line is that nondifferentiability is a problem because virtually all gradient based optimization solvers require this property. Subgradients are only good enough if the underlying algorithm explicitly works with this information and most solvers don't. None of this means that everything has to break every time. Certainly, if we find an optimal solution away from points of nondifferentiability, we're good. However, it does create some really difficult to debug situations. Finally, in case it wasn't clear, yes, the article is talking about deep learning, but the kind of deep learning techniques under discussion are using optimization solvers to fit their particular model to the data, so the general discussion holds.

Reading about how Julia does automatic differentiation helped me to understand this a lot:

http://www.juliadiff.org/

The course notes for "Learning Discrete Latent Structure" should be helpful if you're interested in this topic: https://duvenaud.github.io/learn-discrete/

Related: "In general a machine learning system is built and trained to optimize a specified target objective: classification accuracy in a spam filter or tumor diagnostic, efficiency in route planning or Amazon box packing. Unlike these precise performance metrics, the criteria of safety, trust, and nondiscrimination often cannot be completely quantified. How then can models be trained towards these auxiliary objectives?"

From "Interpreting AI Is More Than Black And White": https://www.forbes.com/sites/alexanderlavin/2019/06/17/beyon...

I thought this was going to be about differential inclusions and deep learning which sounds pretty rad! Is anyone out there working in this direction ?