Proving that a special kernel can learn a generic binary dataset

Question

Suppose we have a generic (i.e. $\mathbf{x}_i = \mathbf{x}_j \implies y_i = y_j)$ dataset $(\mathbf{x}_i, y_i)_{i=1}^{n}$ where $\mathbf{x}_i\in \{0,1\}^n$ and $y_i\in\{0,1\}$. Define the kernel $K$ to be the following elementwise product: $$K(\mathbf{x}, \mathbf{t})=\prod_{j=1}^{n}(1 + x_jt_j + (1- x_j)(1 - t_j))$$

How can we show that that for a generic dataset, there exists some $\alpha\in\mathbb{R}^n$ such that $$\sum_{j=1}^{n}\left(\sum_{i=1}^{n}\alpha_i K(\mathbf{x}_i, \mathbf{x}_j) - y_j\right)^2 = 0$$

Follow up question - consider a dataset in $\mathbb{R}^n\times \mathbb{R}$, now ask the same question.

My thoughts:

This seems to related to the representer theorem - I think the question is essentially asking if kernel $K$ is expressive enough to learn any binary classification function perfectly (i.e. zero mean-squared error/perfectly fits each datapoint). For the follow up question on $\mathbb{R}^n$, my intuition says that the answer is no, the kernel doesn't look expressive enough to learn an arbitrary function, but I'm not sure how to construct a good counterexample.
In 1D, this is true because the gram matrix is $\begin{bmatrix}2&1\\1&2\end{bmatrix}$ which is positive definite, so $\alpha = K^{-1}y$. However, I'm not sure how I'd prove the gram matrix for $K$ is strictly positive definite for $n\geq 2$. Perhaps the matrix can be shown to be diagonally dominant, but Gershgorin's theorem isn't assumed knowledge. I have a feeling there has to be a simpler way.
The feature map per element seems to be $\phi(x) = (1,x, x-1)$, but I'm not sure how this helps us.

I probably misunderstand your question but you seem to be asking how the sum of positive terms can be equal to zero. Isn't the answer that each of the terms must be zero? This gives rise to a system of equations $\sum_i\alpha_iK_{ij}=y_j$. — Ted Black
– Ted Black, Commented May 16 at 12:37
Yes, we seek a solution to $K\alpha =y$. Since $K$ is a gram matrix of a kernel, it is positive semi-definite. However, I don't know that it is positive definite, so I can't be sure $K$ is invertible. Any thoughts? — BaroqueFreak
– BaroqueFreak, Commented May 16 at 15:08
Then you can only solve the minimization problem that leads to the use of the pseudoinverse. — Ted Black
– Ted Black, Commented May 16 at 16:09

Winger 14 · Accepted Answer · 2025-05-19 14:03:58Z

I wrote my earlier answer from a mobile. This didn't look nice and even included a wrong statement. My apologies if this caused any confusion.

The binary case

Before we start, fix some notation. We use the superscript to enumerate elements $x^i$ in a dataset, and use the subscript $x_i$ to refer to coordinates of points.

Firstly, note that it is sufficient to prove existence of an $\alpha$ that ensures $\| K \alpha - y \|_2^2 = 0$ for all datasets of distinct points $X = \{x^i\}_{i=1}^m$. This is true because $y^i = y^j \iff x^i = x^j$.

If the kernel matrix $K$ is positive definite, such $\alpha$ certainly exists: Simply choose $\alpha = K^{-1}y$.

Thus, it is sufficient to prove that the kernel matrix $K$ is positive definite for all datasets of distinct points $\{x^i\}_{i=1}^m$. To prove this, we use a result on the feature map $\phi$ of a kernel.

Lemma: If the feature map $\phi$ maps any set of distinct points $\{x^i\}_{i=1}^m$ to a set of linearly independent points $\{\phi(x^i)\}_{i=1}^m$, then the kernel matrix is positive definite for any such set.

Proof: Suppose instead that the kernel matrix was not positive definite for some choice of distinct points $X = \{x^i\}_{i=1}^m$. Then, there exists $v \in \mathbb{R}^m$ with $v^\top K v = v^\top \Phi \Phi^\top v = 0$, where we denote $\Phi$ for the feature map matrix corresponding to $X$. But this implies that $\| \Phi^\top v \|^2 = \| \sum_{i=1}^m v_i \phi(x^i) \|^2 = 0$, so the set $\{ \phi(x^i) \}_{i=1}^m$ cannot be linearly independent, a contradiction.

But what is the feature map of the kernel $k$? We use the product structure of $k$ to get a feature map as $$ \phi(x) = \otimes_{i=1}^n \phi_i(x)$$ where each $\phi_i(x) := (1,x_i,1-x_i)$ is a valid feature map for the kernel restricted to the $i$-th axis.

Note that each $\phi_i$ does map distinct inputs to linearly independent vectors, since $\phi_i(1) = (1,1,0) \neq (1,0,1) = \phi_i(0)$ holds for all $i$. This can be used to say something about the tensor product $\phi$ as well.

Lemma: For any set of distinct points $\{x^i\}_{i=1}^m \subset \{0,1\}^n$ the set $\{ \phi(x^i) \}_{i=1}^m$ is linearly independent.

Proof: Recall that if for all $j \in [n]$ the set $B_j$ is a basis for the vector space $V_j$, then the tensor products of basis vectors $B’ = \{ \otimes_{j=1}^n b_j : b_j \in B_j \}$, is a basis for the tensor product space $V’ = \otimes_{j=1}^n V_j$. In our setting, we have $B_j = \{ \phi_j(0), \phi_j(1) \}$. This implies that $B’ = \{ \phi(x) : x \in \{0,1\}^n \}$ contains linearly independent vectors. In particular, we conclude that for any set of distinct points $\{x^i\}_{i=1}^m$ the set $\{ \phi(x^i) \}_{i=1}^m$ is linearly independent (since it is a subset of $B’$).

This shows that the kernel matrix is indeed positive definite for any set of distinct points.

The general case

To see that this is not true for $\mathcal{X} = \mathbb{R}^n$ even when $n=1$, it is sufficient to find a set of distinct points with a kernel matrix that is singular. Then, simply choose some $y$ from the orthogonal complement of its image. This makes is impossible to solve $K \alpha = y$.

Since $\phi(x) = (1,x,1-x)$, we see that the feature space of the kernel is $3$-dimensional. Consequently, there must exist $4$ distinct points in $\mathbb{R}$ such that the feature matrix $\Phi \in \mathbb{R}^{4 \times 3}$ has rank $3$. The same will then hold for the kernel matrix $K = \Phi \Phi^\top$, so that a suitable $y$ certainly exists.

$\begingroup$ Clear and intuitive, thanks! $\endgroup$

BaroqueFreak
– BaroqueFreak

2025-05-20 11:43:45 +00:00
Commented May 20 at 11:43 — BaroqueFreak
– BaroqueFreak, Commented May 20 at 11:43

Stack Exchange Network

Proving that a special kernel can learn a generic binary dataset

1 Answer 1

The binary case

The general case

You must log in to answer this question.

Hot Network Questions

Proving that a special kernel can learn a generic binary dataset

1 Answer 1

The binary case

The general case

You must log in to answer this question.

Related

Hot Network Questions