2
$\begingroup$

I'm studying section $5.11$ on support vector machines from Duda and Hart's Pattern Classification. The authors write:

With an appropriate nonlinear mapping $\phi$ to a sufficiently high dimension, data from two categories can always be separated by a hyperplane (Problem 27)

Suppose we have $n$ points of dimension $d$ with $n_1, n_2$ points in classes $1,2$ respectively. There seem to be many posts that either show visualizations of kernels transforming non-linearly separable representations to separable ones, but very little information describing how these kernel functions are chosen for examples that are not hand-crafted, and not much proof of linear separability in the transformed space. Posts like these (https://stackoverflow.com/questions/3997454/help-me-understand-linear-separability-in-a-binary-svm).

I'm looking for a hint why such a kernel function always exists, assuming no restraints on the dimension of the transformed space.

$\endgroup$
3
  • $\begingroup$ Not sure I understand the question, but do you know the Gaussian kernel? It is universal and can hence approximate any decision boundary. The dimension of its feature space is infinite $\endgroup$ Commented Jan 2, 2023 at 21:27
  • $\begingroup$ Can you show that the embedding of the data given by a Gaussian kernel is linearly separable? My first rather simple attempts to do so, were not successful. $\endgroup$ Commented Jan 3, 2023 at 16:16
  • $\begingroup$ @ClaudioMoneo I'm assuming you mean $$K(x,y) = e^{\frac{-||x-y||^2}{2 \sigma^2}}$$ not sure what you mean by an infinite feature space? $\endgroup$ Commented Jan 4, 2023 at 7:06

3 Answers 3

2
$\begingroup$

In the exercise it is not required that the non-linear map $\phi$ is determined by a kernel function. If $x_1,\ldots ,x_{n_1}\in\mathbb{R}^d$ are the samples of class $1$ and $y_1,\ldots ,y_{n_2}$ are those of class $2$ one could map the samples $x_i$ to $(x_i,0)\in\mathbb{R}^{d+1}$ and $y_i$ to $(y_i,1)$ and extend the resulting map somehow to the whole of $\mathbb{R}^d$, if one wishes so. The hyperplane $x_{d+1}=\frac{1}{2}$ separates the two classes.

$\endgroup$
2
  • $\begingroup$ I'm thinking about this a bit. So if I label the samples in class 1 with last component $-1$ rather than $0$, and then consider a normal vector $w = (0, \ldots, 1)$, then for all elements in class $1$, $w \cdot y_i = -1 < 0$ and for all elements in class $2$, $w \cdot x_i = 1 > 0$. Is this the idea? $\endgroup$ Commented Jan 4, 2023 at 7:15
  • $\begingroup$ Yes, that's a way to look at it. $\endgroup$ Commented Jan 4, 2023 at 8:57
1
$\begingroup$

Claudio Moneo is right: using the Gaussian kernel yields linear separability. One considers the map

$ \phi :X\rightarrow V,\; x\mapsto K(\cdot,x) $

where $K$ is the Gaussian kernel of bandwidth $1$ say, $X=\{x_1,\ldots ,x_{n_1},y_1,\ldots ,y_{n_2}\}$ and $V$ is the vector space generated by the functions $K(\cdot ,x)$. It is known, that these functions then form a basis of $V$. So we can define a linear functional $f:V\rightarrow\mathbb{R}$ by $f(\phi(x_i)):=-1$ for all $i$ and $f(\phi(y_j))=1$ for all $j$. The hyperplane $f^{-1}(0)$ then separates the classes.

$\endgroup$
1
$\begingroup$

I am going to give an answer based on the Gaussian kernel $$K(x,y) = \exp(- \gamma \|x-y \|^2)$$ First note that any PSD kernel is associated with a Hilbert space $H$ that it implicitly maps points to through its feature map $\phi$. This feature map satisfies $$K(x,y) = \langle \phi(x), \phi(y) \rangle$$ for all $x,y \in \mathbb{R}^d$. These Hilbert spaces are actually reproducing kernel Hilbert spaces (RKHS), on which there is a vast literature. You could take a look at wikipedia if you want.

The Gaussian kernel satisfies the nice property of being universal, which means that for any two compact, disjoint sets $A,B \subset \mathbb{R}^d$ there exists some $w \in H$ such that $$sgn(\langle w , \phi(x) \rangle) = 1_A(x) - 1_B(x)$$ Note that this implies that we have linearly separated $A$ and $B$ in the space $H$, which is what you wanted (just choose $A,B$ as subsets of your dataset corresponding to whatever labels your points have).

For the Gaussian kernel, one can show that $H$ is of infinite dimension. A feature map is given by $$\phi(x) = exp(- \gamma \|x- \cdot \|^2)$$ Note that we map a point to a function (a Gaussian centered at that very point, to be specific)!

So why is the Gaussian kernel universal? There are more elegant proofs on this (using e.g. Stone Weierstrass theorem), but if you trust me that for $n$ distinct points, the vectors $v_i = \phi(x_i)$ are linearly independent in the feature space $H$ (which makes sense because no Gaussian can be written as the linear combination of any weighted sum of Gaussians centered at other points) then there is a quick way to see this:

For $n$ linearly independent points $v_1,\dots,v_n$ we may simply define the map $$f(v_i) = 1_A(x_i) - 1_B(x_i)$$ and extend it to a linear functional on the linear span of the $n$ points $v_1,\dots,v_n$ in the feature space. By Riesz representation, there must hence exist $w \in H$ such that $$f(v_i) = \langle w, \phi(x_i) \rangle$$

$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.