Geometry Without Coordinates
How kernels and similarity functions define structure when coordinates aren’t enough
The previous three posts built a geometric vocabulary around coordinates — vectors, norms, projections and linear maps. That vocabulary is powerful. But it rests on an assumption that is not always natural: that the objects of interest live in a space where the coordinates themselves are the right place to define geometric structure.
That assumption breaks down often.
Consider a collection of documents. Each can be represented as a word-count vector, but those raw coordinates are sparse, high-dimensional and poor summaries of what the documents are about. What matters for retrieval is whether two documents are semantically related — a relation that raw coordinates may obscure entirely.
Consider a recommender system: users and items may have no natural coordinates at all, yet the system must still decide which items are near a user’s preferences.
Consider a molecular dataset: the relevant notion of similarity may depend on shared substructures or biological activity rather than coordinate-wise proximity.
Consider a social network: nodes are defined primarily by their connections, not by positions in any ambient space.
In each case, the structure that matters is relational — it lives in the connections between objects, not in their individual positions. This chapter introduces the geometric tools for that setting: kernels, affinities and graphs.
Distance and Similarity Are Not the Same Thing
Before introducing kernels, it is worth being precise about a distinction that is often blurred.
A distance function d(x, z) is small when objects are close and grows as they move apart. It is often unbounded. A similarity function s(x, z) works in the opposite direction: large when objects are closely related, small or zero when they are not. The two ideas are related but not interchangeable.
Distance asks how much effort it takes to get from one object to another. Similarity asks how much two objects have in common.
Two objects may be near in Euclidean distance but dissimilar under a task-relevant similarity function — they happen to sit close in raw coordinates but share no meaningful structure. Conversely, two objects may be far apart in coordinate space but highly similar after normalization or under a learned metric. Choosing between them is not a technical detail. It is a modeling commitment about what kind of sameness matters.
Several similarity functions appear throughout machine learning. Cosine similarity measures directional alignment, ignoring magnitude — the natural choice when direction matters more than scale. Gaussian affinity converts Euclidean distance into a smooth similarity that decays with separation, controlled by a bandwidth parameter. Forest proximity measures co-occurrence in the same tree leaves — similarity induced by partition structure rather than metric distance.
Each choice shapes the geometry of comparison, just as norm choice did in the previous post.
Kernels Are Generalized Inner Products
The previous post introduced inner products as the language of alignment between vectors. A kernel extends that idea to settings where the relevant inner product lives in a hidden space.
A kernel is a function k(x, z) that takes two inputs and returns a real number, with the property that it behaves like an inner product in some feature space. Formally, there exists a mapping φ from the input space into a (possibly very high-dimensional) inner product space such that:
k(x, z) = ⟨φ(x), φ(z)⟩
The function φ is the feature map. It lifts each input into a richer space, and the kernel computes the inner product in that space without requiring us to write down φ explicitly. This is the kernel trick: the inner product in the feature space can be computed directly from the inputs, without ever constructing the feature map.
The crucial condition is positive semidefiniteness. For any collection of points and any real coefficients, the kernel must satisfy ∑ᵢⱼ cᵢcⱼk(xᵢ, xⱼ) ≥ 0. This guarantees that the kernel matrix K, defined by Kᵢⱼ = k(xᵢ, xⱼ), behaves like a matrix of inner products — symmetric and positive semidefinite, just as a Gram matrix would be. Mercer’s theorem ensures that a continuous positive semidefinite kernel corresponds to an inner product in some feature space.
Three kernels illustrate the range:
Linear kernel: k(x, z) = xᵀz. The feature map is the identity. This recovers the standard inner product geometry from the previous post.
Polynomial kernel: k(x, z) = (xᵀz + c)ᵈ. The implicit feature map includes all monomials up to degree d. The geometry operates in a space of polynomial features without explicitly constructing them.
Gaussian (RBF) kernel: k(x, z) = exp(−‖x − z‖² / 2σ²). The implicit feature space is infinite-dimensional. The kernel measures localized similarity that decays smoothly with distance. The bandwidth σ controls the scale at which similarity is measured.
The key insight: a kernel is not a computational shortcut. It is a geometric commitment — a choice of how alignment should be measured, even when the coordinates in which that alignment lives are never explicitly observed. The linear kernel says alignment is measured by the standard dot product. The polynomial kernel says it should also reflect higher-order interactions. The Gaussian kernel says it should decay smoothly with distance at a rate controlled by bandwidth. Each kernel defines a different geometry.
Feature Maps and Hidden Geometry
Many learning problems are difficult because the raw geometry is wrong for the task. Two classes that overlap in the original feature space may become cleanly separated after a nonlinear transformation. A regression relationship that is curved in original coordinates may become linear in a richer set of features.
Kernels allow us to work as though the data had been transformed into a richer feature space, without ever constructing the transformation explicitly.
The concentric rings example from Post 2 illustrates this directly. No linear classifier can separate the two classes in the original (x₁, x₂) coordinates. Adding the feature r² = x₁² + x₂² lifts the data into a space where a linear threshold separates them perfectly. The polynomial kernel achieves the same effect implicitly: it computes inner products in a feature space that includes the necessary quadratic terms, so a kernel-based classifier can find the separating surface without the feature map ever being written down.
The Gaussian kernel goes further. Its implicit feature space is infinite-dimensional, and it can represent very complex similarity structures. But that flexibility is also a risk — it is easier to overfit, and the choice of bandwidth σ becomes a critical geometric decision.
Kernels also extend beyond numerical vectors entirely. String kernels compare sequences by counting shared subsequences. Graph kernels compare networks by shared substructures. In each case, the kernel defines feature-space geometry for objects that have no natural coordinates. This is the sense in which kernels provide geometry without coordinates.
Self-Attention as a Learned Kernel
The Gaussian and polynomial kernels are fixed — their form is specified before training, and only their parameters are tuned. But a similarity function can also be learned end-to-end. When it is, we get a learned kernel: one whose geometry is optimized directly for the task.
The clearest example in modern machine learning is self-attention.
In a transformer, the attention score between token i and token j is computed as:
Aᵢⱼ = softmax(qᵢᵀ kⱼ / √d)
where qᵢ = W_Q xᵢ is the query vector for token i and kⱼ = W_K xⱼ is the key vector for token j — both learned linear projections of the input. The dot product qᵢᵀ kⱼ measures the alignment between them in the learned query-key space. This is precisely what a kernel does: it takes two inputs and returns a scalar that measures their geometric relationship.
The attention matrix A over the full sequence is therefore a kernel matrix — but one that differs from everything in the previous sections in three ways.
It is learned. The weight matrices W_Q and W_K are parameters trained by gradient descent. The geometry they define is not prescribed by the practitioner — it is discovered from data to make the prediction task tractable. This is the cleanest instance of the book’s central thesis: the model finds the geometry that makes the task intelligible.
It is input-dependent. The Gaussian kernel between two points depends only on their coordinates and the bandwidth. The attention kernel between two tokens depends on the entire sequence being processed — a different sequence produces a different kernel matrix from the same weights. The geometry is dynamic, not static.
It is asymmetric. In general Aᵢⱼ ≠ Aⱼᵢ, because query and key projections are different. Token i attending to token j does not mean token j attends equally to token i. This asymmetry is precisely the directional structure captured by the antisymmetric component A in the K = S + A decomposition introduced in the first post. The symmetric part S of the attention matrix encodes mutual proximity; the antisymmetric part encodes directed influence — who attends to whom.
Self-attention will be developed in full in a later post. The point here is to name it for what it is: a learned, context-dependent, asymmetric kernel. It belongs to the same geometric family as the Gaussian and polynomial kernels — it defines pairwise similarity structure over a set of objects — but the similarity is constructed by the model rather than prescribed by the practitioner. The progression from fixed kernels to learned kernels is a progression from assumed geometry to discovered geometry.
Bandwidth Is Part of the Geometry
The Gaussian kernel has a single free parameter: the bandwidth σ. It is tempting to treat this as a tuning parameter. It is more useful to treat it as a geometric choice.
A small σ produces a highly local geometry. Only the nearest neighbors have appreciable affinity. Everything beyond a narrow radius is effectively invisible to the kernel. The kernel matrix is nearly an identity — each point is similar only to itself.
A large σ makes the geometry global. Even moderately distant points retain significant similarity, and fine-grained local structure is washed out. The kernel matrix is nearly all-ones — every point looks similar to every other.
Between these extremes, the bandwidth defines the characteristic scale at which the geometry operates. The median heuristic — setting σ to the median pairwise distance among the data — provides a robust data-adaptive default that puts the kernel on scale with the data. It is a starting point, not a final answer, but it avoids both failure modes.
The same logic applies to the number of neighbors k in a k-NN graph, or the threshold ε in an ε-neighborhood graph. These are not tuning parameters. They are geometric decisions about the resolution at which the model sees the world.
Affinity Matrices and Neighborhood Graphs
Once a similarity function is defined pairwise, the entire dataset can be represented by an affinity matrix W, where Wᵢⱼ records how strongly observation i is connected to observation j. This matrix is the geometry of the sample expressed as a relational object.
From an affinity matrix, several graph constructions are natural. A k-nearest-neighbor graph connects each observation to its k closest neighbors, producing a sparse graph that preserves local structure while discarding long-range noise. An ε-neighborhood graph connects all pairs whose similarity exceeds a threshold. A fully weighted graph keeps all pairwise affinities, with the weights encoding the relevant structure.
Each construction is a modeling decision. A k-NN graph with k = 5 and one with k = 50 represent different geometric hypotheses about local structure. These choices shape the geometry that downstream methods will use.
The dataset now has two complementary geometric descriptions. In coordinate geometry, each observation is a point in a vector space, and structure comes from positions, distances and projections. In affinity geometry, each observation is a node in a weighted graph, and structure comes from connectivity, neighborhoods and paths. Both are legitimate. Both encode real geometric information.
Bridge points — observations that connect otherwise separate subpopulations — are visible in affinity geometry but invisible in coordinate summaries. A point may look entirely ordinary in its individual feature values while playing a structurally critical role in the relational graph. This is observation geometry from Post 3, now made precise through affinities.
Spectral Methods: Reading Global Structure from Local Relations
An affinity graph encodes local relations. Spectral methods convert those local relations into global geometric understanding.
The key object is the graph Laplacian L = D − W, where D is the diagonal degree matrix with Dᵢᵢ = ∑ⱼ Wᵢⱼ. The Laplacian has a natural geometric interpretation through its quadratic form:
fᵀLf = ½ ∑ᵢⱼ Wᵢⱼ(fᵢ − fⱼ)²
This measures the roughness of a function f on the graph. It is small when f varies little across strongly connected nodes. Functions that vary slowly across well-connected regions — smooth functions on the graph — have small Laplacian quadratic form.
The eigenvectors of the Laplacian corresponding to its smallest eigenvalues are the smoothest non-trivial functions on the graph. If the graph has k well-separated clusters, there will be k eigenvalues near zero, and the corresponding eigenvectors will be approximately constant within each cluster and change sharply at the cluster boundaries.
Spectral clustering works by computing these eigenvectors, using them as new coordinates for each observation, and then applying a simple clustering algorithm in the eigenvector space. The geometric insight: the Laplacian eigenvectors define a spectral embedding in which cluster structure that was nonlinear and intertwined in the original coordinates becomes compact and separable. This is why spectral clustering can find clusters that k-means cannot — it reads global structure from local similarity rather than from centroid proximity.
Diffusion maps extend this further. They embed observations using the eigenvectors of the diffusion operator — the transition matrix of a random walk on the affinity graph — weighted by their eigenvalues. The resulting coordinates reflect the intrinsic geometry of the data as seen through the lens of local connectivity. When data lie near a curved manifold, diffusion coordinates can unfold that manifold into a flat representation that preserves local structure.
Two Frameworks, One Vocabulary
This chapter introduces a second way to think about geometry, and it helps to see how it extends rather than replaces the coordinate-based approach from the previous post:
Chapter 3 concept Chapter 4 extension What changes Inner product Kernel Alignment in a possibly hidden space Euclidean distance Kernel-induced distance Distance defined through similarity Point cloud Affinity graph From positions to relations Projection onto subspace Spectral embedding From explicit to eigenvector coordinates Explicit coordinates Implicit feature space Geometry without writing down φ Fixed representation Learned embedding Geometry optimized for the task Fixed kernel Self-attention Learned, input-dependent, asymmetric similarity
Each row shows a coordinate-based idea and its relational counterpart. The underlying geometric operations are recognizably the same — measuring alignment, finding preferred directions, reducing dimension, identifying structure. What changes is how the geometry is specified.
Coordinates are one way to define geometry. Similarity is another. Strong learning systems often use both.
Lab: Kernels and Spectral Analysis in Code
The kernel toolkit — Gaussian kernel, bandwidth selection, spectral analysis and diagnostics — is implemented in geomlearn.ch04_kernels:
pip install geomlearnGaussian kernel: bandwidth changes the geometry
import torch
import geomlearn as gl
torch.manual_seed(42)
# 100 points from 2 well-separated clusters
cluster1 = torch.randn(50, 2) * 0.5 + torch.tensor([-2.0, 0.0])
cluster2 = torch.randn(50, 2) * 0.5 + torch.tensor([2.0, 0.0])
X = torch.cat([cluster1, cluster2], dim=0)
# Three bandwidths — three different geometries
for sigma in [0.3, 1.0, 5.0]:
K = gl.ch04_kernels.gaussian_kernel(X, sigma=sigma)
analysis = gl.ch04_kernels.analyze(K)
print(f"sigma = {sigma}:")
print(f" Effective rank: {analysis['effective_rank']:.2f} / {K.shape[0]}")
print(f" Spectral decay: {analysis['spectral_decay_rate']:.4f}")
print()
Small σ: nearly an identity matrix, effective rank collapses. Large σ: nearly all-ones, structure washed out. Medium σ: captures the cluster geometry cleanly.
Bandwidth selection: the median heuristic
bw = gl.ch04_kernels.bandwidth_selection(X, method="median")
print(f"Median pairwise distance: {bw['dist_median']:.4f}")
print(f"Recommended sigma: {bw['sigma']:.4f}")
The median heuristic anchors σ to the typical inter-point scale of the data — a robust default that avoids both failure modes.
Nystrom approximation: scalable kernel geometry
sigma = bw['sigma']
kernel_fn = lambda A, B: gl.ch04_kernels.gaussian_kernel(A, B, sigma=sigma)
for m in [5, 10, 20, 50]:
result = gl.ch04_kernels.nystrom_approx(X, kernel_fn, m)
print(f"m = {m:3d} landmarks: "
f"relative error = {result['relative_error']:.6f}")
With fast spectral decay, a small number of landmarks approximates the full n×n kernel matrix with very low error — making kernel methods tractable at scale.
Kernel diagnostics
K_good = gl.ch04_kernels.gaussian_kernel(X, sigma=bw['sigma'])
K_tiny = gl.ch04_kernels.gaussian_kernel(X, sigma=0.05)
for label, K in [("Good (median sigma)", K_good), ("Degenerate (sigma=0.05)", K_tiny)]:
diag = gl.ch04_kernels.diagnose(K)
print(f"\n{label}:")
print(f" Effective rank: {diag['effective_rank']:.2f}")
print(f" Condition number: {diag['condition_number']:.2e}")
for f in diag['findings']:
print(f" • {f}")
The well-tuned kernel has a meaningful effective rank and moderate condition number. The degenerate kernel — bandwidth far too small — approaches an identity matrix, effective rank collapses to near 1, and the diagnostic flags it immediately.
What Comes Next
Kernels and affinity graphs define geometry through relations. The next chapter introduces a third way: geometry through partition. Decision trees and random forests do not measure distance or compute similarities. They divide the space into regions where different local rules apply, and similarity is defined by co-occurrence — which observations fall into the same regions. That partition-induced geometry has its own logic and its own mathematical structure.
The progression continues: from coordinates, to similarities, to partitions, and eventually to the richer structures — manifolds, directional geometry and asymmetric relations — that modern learning demands.
Geometry does not require coordinates as its starting point. A kernel, a similarity function, or an affinity graph can define a geometry that is just as real and just as useful as one specified by vectors and norms. The choice between coordinate-based and similarity-based geometry is a modeling decision, and the best learning systems often combine both.
This post is part of a series accompanying the book Learning as Geometry Discovery: A Geometric View of Machine Learning, Data Science and Reasoning.
Previous posts: Learning as Geometry Discovery · What “Learning Is Geometry Discovery” Actually Means · You Have Been Doing Geometry All Along · The Geometry Underneath the Algebra
Library: github.com/asudjianto-xml/geomlearn · pip install geomlearn

