The interpretation of nu in nu-SVM

A commonly used variant of SVM is the nu-SVM, where nu is commonly interpreted as "an upper bound on the fraction of margin errors and a lower bound of the fraction of support vectors relative to the total number of training examples." The reason people commonly think of nu as an upper-bound on margin-errors and a lower-bound on fraction of support vectors is probably because of Proposition 5 in this paper, "New Support Vector Algorithms", by Scholkopf, Smola, Williamson, and Bartlett which states asymptotically nu equals both the fraction of SVs and the fraction of errors. 

But if think a little harder, there seems to be a catch. If the nu equals both the fraction of SVs and the fraction of errors, then it seems that by reducing nu and setting it closer and closer to zero, we can keep decreasing the number of support-vectors and decrease our classifier's runtime and decrease its complexity, and at the same time keep decreasing its training errors. Obviously this is wrong , because we can't just keep making our function simpler and simpler and expect the errors to keep decreasing! And in fact we can test this trivially, just train nu-SVM with varying nu on any sample dataset and we get a plot like the following where the training error has a U-shaped-curve w.r.t nu, and the number of support-vectors keeps increasing as we increase nu. So how to reconcile proposition 5, which is a mathematical truth, with our observation reality ? The catch lies in the definition of margin-error !!

Margin-error is the fraction of points that are either errors, or points that lie within the margin. i.e. fraction of points for which \(\xi_i\) are strictly greater than zero in the following problem formulation

\(\displaystyle \begin{align}&\arg\min_{w, \xi} \frac{||w||_2^2}{2} - \nu \rho + \frac{1}{l} \sum_i \xi_i \\ \color{green}{\text{ subject to }} &y_i(x_i ^T w + b) \ge \rho - \xi_i, \text{ and } \xi_i \ge 0, \text{ and } \rho \ge 0\end{align}\)

So margin-error happens when \(\xi > 0\) but actual error happens when \(\xi > \rho\)

\(\rho\) can be understood as follows, basically if our data was linearly separable so that \(\xi_i = 0 \forall i\) then the two classes are separated by the margin \(2\rho/||w||\) . Also note that nu is the weight put on rho, so if nu becomes large, then there is bigger incentive to make rho larger because that influences the overall objective, and so the margin will increase, and more points will become support-vectors, because they will lie inside the margin.

There is another theorem (proposition 10) within the paper that states that the margin-error and the true-error have a gap that's bounded by 1/margin. This means that the smaller the nu, the smaller margin can become and the larger the gap between the true error and the margin error. Also if we look closely at the proof for Proposition 5.iii we can see the requirement that the fraction of points that lie exactly on the margin needs to tend to zero for the equality to hold. But that relationship breaks down for really small values of nu.

All of this put together basically means that at really small values of nu we can't say that the true-error will keep going down, because of course that's absurd.

Code to reproduce the above plot

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm

xx, yy = np.meshgrid(np.linspace(-3, 3, 500), np.linspace(-3, 3, 500))
np.random.seed(0)
X = np.random.randn(300, 2)
Y = np.logical_xor(X[:, 0] > 0, X[:, 1] > 0)

X2 = np.random.randn(300, 2)
Y2 = np.logical_xor(X2[:, 0] > 0, X2[:, 1] > 0)

num = 7
res = np.empty((num, 7))
for idx, nu in enumerate(np.logspace(-(num - 1), 0, num=num)):
    if nu == 1:
        nu = 0.9
    clf = svm.NuSVC(gamma="auto", nu=nu).fit(X, Y)
    trerr = 1 - clf.score(X, Y)
    nsv = clf.n_support_
    frac_sv = nsv.sum() / X.shape[0]
    # Use -clf.intercept_ because rho is negative of intercept according to this comment
    # https://github.com/scikit-learn/scikit-learn/blob/95119c13af77c76e150b753485c662b7c52a41a2/sklearn/svm/_libsvm.pyx#L209
    marg_err = (Y * clf.decision_function(X) < -clf.intercept_).mean()
    marg_eq = (Y * clf.decision_function(X) == -clf.intercept_).mean()
    marg_le = (Y * clf.decision_function(X) <= -clf.intercept_).mean()
    marg_gt = (Y * clf.decision_function(X) > -clf.intercept_).mean()
    res[idx, :] = np.array([nu, trerr, frac_sv, marg_err, marg_eq, marg_le, marg_gt])
    # print(f'{nu:6g}, {trerr:.2f}, {frac_sv:4f}')

from matplotlib import pyplot as plt
label = np.array(['', 'Error', '#SV / Len(traindata)', 'Marg-Err', 'Marg-Eq', 'Mark_Le', 'Marg-Gt'])
pick = [1, 2, 3, 4]
plt.semilogx(res[:,0], res[:, pick],
             label=label[pick])
plt.legend()
plt.xlabel('nu')