opinion
|
May 27, 2026
Why Learning from Data Gets Harder in the Tail
Lessons for Autonomous Driving
Prof. Shai Shalev-Shwartz and Prof. Amnon Shashua

Knowing the obstacle exists means we can navigate around it.
The scaling paradox
There's a puzzling gap in how machine learning systems improve. Classical learning theory tells us that as we collect more data, our models should get better at a steady, predictable rate—roughly proportional to how much new data we gather. But empirical evidence paints a different picture. Language models and end-to-end learning systems in the real world often improve much more slowly than theory predicts, requiring orders of magnitude more data for only marginal improvements.
This isn't a failure of implementation—it's a fundamental consequence of how Stochastic Gradient Descent (SGD) interacts with certain problem structures. Understanding why matters deeply for autonomous driving, especially if one relies on end-to-end imitation learning to handle the full complexity of real-world scenarios.
Grounding in empirical reality: the Chinchilla Law
The severity of this problem becomes concrete when we look at empirical scaling laws. Recent research on large language models (the Chinchilla law) has shown how error decreases with the number of training examples \(T\), finding that error scales as \(T^{-\alpha}\) where \(\alpha < 0.1\). An exponent of \(0.1\) means convergence is orders of magnitude slower than classical learning theory predicts.
To appreciate the implications: suppose we want to reduce error from \(10^{-2}\) to \(10^{-4}\)—a hundred-fold improvement. With \(\alpha = 0.1\), this requires:
Twenty orders of magnitude more data. For systems requiring very high accuracy—like autonomous driving, where safety demands error rates far below \(10^{-4}\)—this scaling becomes prohibitively expensive. The theoretical construction below explains why this slow convergence arises from the interaction of weak supervision and heavy-tailed data when trained with SGD.
The two properties that slow everything down
Real-world driving data has two distinctive characteristics:
- Multiple correct answers: At any given moment, a driver facing a particular road scene could legitimately make several different safe choices—merge left or right, slow down or maintain speed, take the main road or the alternate route. Yet in our training data, we observe only one actual choice. This ambiguity is fundamental to driving: the correct action cannot be uniquely determined from observation alone.
- Heavy-tailed distributions: Driving is mostly routine. Highways, clear weather, predictable traffic. The exciting edge cases—recovering from a skid, navigating a flooded intersection, dealing with an unexpected obstacle—happen rarely. The data distribution has a long tail of rare scenarios where each individual case appears only a handful of times.
Together, these properties create a signal-to-noise problem that translates to slow scaling.
A synthetic model that captures reality
To understand this problem rigorously, consider a simple construction: a neural network learning to classify inputs into one of many categories, where:
- The input data follows a Zipfian distribution (like language or many natural phenomena)
- For a constant mass of the examples, the true labels are generated probabilistically—several outputs could be correct given the input
- The network needs to learn from single observations of these inherently ambiguous labels
This construction is simple enough to analyze mathematically, yet it exhibits the same pathological scaling behavior we see in real systems.
Why the tail matters: a signal-to-noise analysis
Here's where it gets interesting. The core issue stems from the stochasticity in labels combined with the rarity of certain inputs. Consider a rare input class—one that appears with probability \(p_i\).
- Signal: When this rare input appears, we observe one of its correct labels and the gradient points in the right direction. Over \(T\) training iterations, we see roughly \(p_i \cdot T\) such examples, providing signal for learning this rare class.
- Noise: The problem comes from the \((1-p_i)\) fraction of iterations where we see other (non-rare) inputs. The model may already be perfectly correct on these common examples, in the sense that it captures the true underlying probability over the labels given each of the inputs. But because there are multiple correct answers and we only observe one, the observed label induces high-variance gradients because only one of several valid outputs is sampled. These noisy updates not only fail to help us learn rare inputs—they actively work against it. The noise magnitude accumulates because rarity means we see these conflicting updates far more often than we see the rare class itself.
The signal-to-noise ratio scales as:
where \(d\) is the dimension of the problem. To get adequate signal, we need \(T = \Omega(1/(d \cdot p_i^2))\) iterations.
This is already discouraging—we need quadratically more iterations for rarer classes. But the true impact comes from connecting this to the tail distribution.
Connecting to tail distributions
For a Zipfian distribution \(p_i = \frac{i^{-\beta}}{Z_\beta}\) with \(\beta > 1\), the tail probability is:
Inverting: \(i \asymp \epsilon^{-1/(\beta-1)}\), which gives:
Substituting into the SnR-based bound:
This reveals how the exponent depends on tail heaviness: the convergence rate is \(T^{-(\beta-1)/(2\beta)}\), which decreases as \(\beta\) decreases (heavier tail). The heavier the tail (smaller \(\beta\)), the slower the convergence.
With \(\beta = 2\) (a moderate tail like natural language):
giving convergence rate \(\approx T^{-1/4}\)—already far slower than the classical \(T^{-1}\) rate.
For \(\beta = 1.25\) (a heavier tail):
with convergence rate \(\approx T^{-0.1}\).
As the tail gets heavier (\(\beta \to 1^+\)), the exponent \(2\beta/(\beta-1)\) diverges—convergence becomes arbitrarily slow. This is the crux of the problem: real data often exhibits very heavy tails, making slow convergence rates unavoidable.
The details of the construction are given in the appendix.
The construction intentionally uses a simple one-hidden-layer network for analytical tractability. Real-world architectures—particularly transformers with many layers and complex interdependencies—are substantially more intricate. The signal-to-noise problem identified here affects gradient flow through each layer, and the coupling between layers means errors and noise compound through the network. In fact, for deep architectures, we conjecture that the compounding of these effects across layers would make scaling even worse than this simplified analysis suggests. That is, the lower bound derived here may be optimistic relative to practical systems.
What this means for self-driving
Autonomous driving shares both properties with our model:
- Stochasticity in decisions: Many driving situations admit multiple valid responses. A human driver and an autonomous vehicle might both be "correct" taking different paths through a complex intersection.
- Heavy-tailed scenarios: Most driving is on familiar routes with predictable conditions. The truly challenging cases—rare weather conditions, unusual traffic patterns, edge cases requiring rapid reaction—represent a long tail of scenarios that appear infrequently in data.
Our analysis suggests that scaling an end-to-end learned driving system by simply collecting more data may hit efficiency walls that classical learning theory doesn't predict. And this problem is acute even for simplified models—real transformer-based architectures used in practice are likely to face even steeper scaling challenges due to the compounding effects of weak supervision across many layers. The tail isn't just a problem because rare events are hard to handle—it's fundamentally limiting how fast we can improve the system.
The bottom line
The mathematics of this construction is simple, but the lesson is profound: slow scaling in real-world systems isn't a bug—it's often a feature of the problem structure itself. Heavy-tailed data combined with inherent ambiguity creates fundamental information-theoretic limits that can't be overcome by simple scaling.
This analysis does not mean end-to-end learning can't work. Rather, it highlights why a naive "collect more data" strategy has diminishing returns. For autonomous driving, this is both a warning and a guide. It tells us that building robust self-driving systems requires not just more data, but smarter approaches to data—understanding the tail, addressing ambiguity, and potentially rethinking what architectures and learning strategies can overcome these fundamental limits.
The good news? Knowing the obstacle exists means we can navigate around it.
See our Meteor agent blog on the way Mobileye navigates around the SnR problem
---
Appendix
Technical details of the construction
For readers seeking a deeper understanding, here's the precise mathematical setup:
The model
We consider a one-hidden-layer neural network with parameters \(\theta = (U, b_h, V, b_o)\):
- \(U \in \mathbb{R}^{m \times d}, b_h \in \mathbb{R}^m\) (input-to-hidden weights)
- \(V \in \mathbb{R}^{k \times m}, b_o \in \mathbb{R}^k\) (hidden-to-output weights)
The model computes:
where \([\cdot]_+\) denotes the ReLU activation.
Data distribution
The data distribution consists of one common type of examples and many rare examples. The common type of examples are generated as \(x = x_0 + \nu\), where \(x_0\) is a fixed, unit-norm, arbitrary vector and \(\nu \sim N(0, 1/d)\).
We build an additional \(N = (m-1)(d-1)\) rare examples indexed by \((i,j) \in [m-1] \times [d-1]\). For every \(i \in [m-1]\) we randomly choose an orthonormal set
and set the \((i,j)\) rare example to be
Let \(p = (p_0,p_1,p_2,\ldots,p_N)\) be a probability vector, where \(p_0 = 0.9\), and for \(r \ge 1\),
That is, the tail follows a Zipfian distribution with parameter \(\beta\). The generation of examples is according to:
- Pick \(r \sim p\)
- If \(r = 0\), generate a common example \(x = x_0 + \nu\)
- Else, decompose \(r\) to a pair \((i,j)\) and set \(x = x_{i,j}\)
Labels
The label of the first type of example has uncertainty: there are two correct labels and the probability to pick each of the two is equal. For the rest of the examples, the label is deterministic. For simplicity, the construction can assume only two labels, which we will denote \(\pm 1\). So there is a fixed \(y_{i,j} \in \{\pm 1\}\) for every \(x_{i,j}\). The labels \(y_{i,j}\) may be chosen arbitrarily; for our shatter-style construction, one can choose them independently at random.
Ground truth
We next show that for every arbitrary choice of the \(y_{i,j}\) there are weights, \(\theta^*\), that realizes the underlying distribution.
The first layer
For \(i \ge 1\), we define the \(i\)-th row of the target first-layer matrix \(U^*\) as
where \(A \gg 1\) is a constant, for example, \(A=10\).
Now evaluate \(u_i^*\) on an example from its own group \(x_{i,j}\):
Using orthonormality of the set \(\{z_{i,0},\ldots,z_{i,d-1}\}\), we get the exact identity
Set the ReLU threshold to
Then
For \(i'\neq i\), \(i \ge 1\), the random frames are chosen incoherently, so with high probability
This will be smaller than \(A-1\) with high probability if \(A \gg 1\).
And for a common example we have
Hence
Thus the full hidden representation satisfies
In words, for \(i \in [m-1]\), the \(i\)-th hidden neuron turns on for those examples in its group and remains off for all other examples. For examples in its group, its value is \(1+0.5 y_{i,j}\), which indicates the label.
Finally, the last row of \(U^*\) is set to \(A x_0\), and its bias is \(A-1\). Using similar arguments to the above, the corresponding neuron is activated only for examples of this type.
Output layer
For \(i \in [m-1]\), the \(i\)'th column of \(V^*\) is set to \([1,-1]^\top\) and the output bias is set to the vector \([-1, 1]^\top\). Therefore,
Hence, the logits of the output represent the label as required.
The last column of \(V^*\) is set to be \(-b_o\), so that when we get a common example the logits become \([0,0]^\top\), representing an equal probability to select each of the labels.
SGD dynamics and signal-to-noise analysis
Consider SGD with minibatch size 1 (similar analysis holds for a larger, finite, mini-batch size).
Fix a rare example
Its probability is
The relevant learnable coordinate is the component of the \(i\)-th hidden row in the direction \(z_{i,j}\). Let
denote this coordinate during training.
When \(x_{i,j}\) appears, the SGD update gives a coherent signal to \(a_{i,j}^{(t)}\), because the label of \(x_{i,j}\) is deterministic and correlated with the direction \(z_{i,j}\). Over \(T\) SGD steps, this coherent signal has magnitude
A common example \(x_0 + \nu\), however, appears with constant probability \(0.9\), and its label is sampled uniformly at random. These updates have mean zero but nonzero variance. When projected onto a specific rare direction \(z_{i,j}\), their cumulative effect behaves like a random walk.
Because the directions \(z_{i,j}\) live in \(d\)-dimensional space, the projection of a generic noisy update onto a fixed direction has size of order \(1/\sqrt d\). Therefore, after \(T\) steps, the accumulated noise in the coordinate \(a_{i,j}^{(t)}\) has magnitude
Thus the signal-to-noise ratio for learning rare example \((i,j)\) is
To learn \(x_{i,j}\) reliably, this signal-to-noise ratio must be at least a constant:
Equivalently,
Thus rare examples require quadratically many samples in their inverse probability, up to the factor \(d\).
Finally, this lower bound is converted to
by connecting to the properties of the Zipfian distribution.
Share article
Press Contacts
Contact our PR team





