|
Sheaf Neural Network (SNN) extends the concept of a Graph Neural Network to a sheaves over a cell complexes. Given a cell complex $X$, a cellular sheaf $\mathcal{F}$ over a vector space, (or abelian group, or module), $V$, is a functor from the partially ordered set of faces $(X, \trianglelefteq)$ to $V$:
$$\mathcal{F} : (X, \trianglelefteq) \to V$$
- Sites are a cells $\sigma \in X$.
- Stalks are vector spaces (or abelian groups, or modules) at each site $\mathcal{F}(\sigma)$ and correspond to *data*.
- Inclusion maps ordered cell inclusions: $\sigma \trianglelefteq \tau$, for cells $\sigma, \tau \in X$, where $\dim \sigma < \dim \tau$.
- Compatibility constraints are linear maps (or morphisms) between stalks $\mathcal{F}_{\sigma \trianglelefteq \tau} : \mathcal{F}(\sigma) \to \mathcal{F}(\tau)$ corresponding to inclusion maps between respective sites.
For a sheaf $\mathcal{F}$ on $X$ we define a cochain complex $C^\bullet(X; \mathcal{F})$, as sequence of vector spaces that record data of $\mathcal{F}$ together with the coboundary maps $\delta$ comparing incident data. Training process of a sheaf neural network constitutes learning of a sheaf *constraint maps* from data by approximating attachment maps. The resulting sheaf diffusion models have many desirable properties that address the limitations of classical graph diffusion equations.
Sheaf Cohomology and Hodge Laplacian
See also: Spectral sheaf theory
Recall that for a sheaf $\mathcal{F}$ on $X$ the cohomology of the sheaf $H^\bullet(X; \mathcal{F}) = \operatorname{ker}\delta / \operatorname{im}\delta$, is the sequence of algebraic objects (in this case, of quotient vector spaces that captures obstructions to gluing local data to global data in the sheaf $\mathcal{F}$ on $X$. In particular, the 0-th cohomology $H^0(X; \mathcal{F})$ records discrete global sections of the sheaf, so dimension of vector space $H^0(X; \mathcal{F})$ is then the number of independent consistent solutions to the constrained system. Therefore, sheaf cohomology classifies consistent solutions and obstructions to such.
To this end it may help the reader to consider what cohomology means in the computational setting of vector spaces. Let $n_{k-1}, n_k, n_{k+1} \in \mathbb{N}$ be dimensionalities of vectors at stlaks in dimensions $k-1$, $k$, $k+1$ respectively. Then $(k-1),k,(k+1)$ chains are $C_{k-1} = \mathbb{R}^{n_{k-1}}$, $C_{k} = \mathbb{R}^{n_{k}}$, $C_{k+1} = \mathbb{R}^{n_{k+1}}$. Coboundary maps $\delta$ are then linear maps $\delta{k} = M^{n_{k+1} \times n_k}$, and $\delta{k-1} = M^{n_{k} \times n_{k+1}}$. The cochain complex $C_\bullet$ is then:
$$\cdots \xrightarrow[]{} \mathbb{R}^{n_{k-1}} \xrightarrow[]{\mathbf{M}_{k-1}}\mathbb{R}^{n_{k}} \xrightarrow[]{\mathbf{M}_{k}}\mathbb{R}^{n_{k+1}} \xrightarrow[]{}\cdots$$
Cocyles correspond to the elements of $\mathrm{ker}(\mathbf{M}_{k})$, coboundaries correspond to the elements of $\mathrm{im}(\mathbf{M}_{k-1})$, cohomology classes:
$$H^{k} = \mathrm{ker}(\mathbf{M}_{k}) / \mathrm{im}(\mathbf{M}_{k-1})$$
Hodge Laplacians are:
$$\Delta^k = \left(\mathbf{M}_{k}^\ast \mathbf{M}_{k} + \mathbf{M}_{k-1}^\ast \mathbf{M}_{k-1}\right) \in \mathbb{R}^{n_k \times n_k}$$
Given a vector $\mathbf{x} \in \mathbb{R}^{n_k}$:
- Vector $\mathbf{x}$ is closed if $\mathbf{M}_k \mathbf{x} = 0$, and $\mathbf{x}$ is exact if $\mathbf{x} = \mathbf{M}_{k-1} \mathbf{y}$ for some $\mathbf{y} \in \mathbb{R}^{n_{k-1}}$.
- Similarly, $\mathbf{x}$ is coclosed if $\mathbf{M}_{k-1}^\ast \mathbf{x} = 0$, and coexact if $\mathbf{x} = \mathbf{M}_{k}^\ast \mathbf{y}$ for some $\mathbf{y} \in \mathbb{R}^{n_{k+1}}$.
- Finally, $\mathbf{x}$ is *harmonic* if $\mathbf{x} \in \mathrm{ker}\Delta$, that is $\Delta^k \mathbf{x} = 0$.
In summary, there are three ways of defining cohomology. If linear maps $\mathbf{A}\mathbf{B}$ satisfy $\mathbf{A}\mathbf{B} = 0$, then the cohomology group with respect to $\mathbf{A}$ and $\mathbf{B}$ may be taken as any of the following:
$$H^\bullet = \frac{\mathrm{ker}(\mathbf{A})}{\mathrm{im}(\mathbf{B})} = \mathrm{ker}(\mathbf{A}) \cap \mathrm{ker}(\mathbf{B}^\ast) = \mathrm{ker}(\mathbf{A}^\ast \mathbf{A} + \mathbf{B} \mathbf{B}^\ast)$$
Similarly, homology can be defined as:
$$H_\bullet = \frac{\mathrm{ker}(\mathbf{B}^\ast)}{\mathrm{im}(\mathbf{A}^\ast)} = \mathrm{ker}(\mathbf{B}^\ast) \cap \mathrm{ker}(\mathbf{A}) = \mathrm{ker}(\mathbf{B} \mathbf{B}^\ast + \mathbf{A}^\ast \mathbf{A})$$
Note that $H_\bullet \cong H^\bullet$ in this context, (but only in this particular context, this is not true in general). This is because we are working over linear spaces which does not capture the torsion. This is because there is no difference between cohomology and homology within this purview.
Restricted Sheaf Neural Network
Particular choices of a standard stalk space may be more advantageous than others, for example by using a hypersphere autoencoder to prepare latents, we may restrict latent space representations to unit vectors of a hypersphere with compatibility constraint linear maps between them restricted to hypersphere rotations, that is the Orthonormal group $\mathrm{O}(n)$.
This choice of restriction may be particularly computationally advantageous because orthogonal matrices have fewer free parameters, making them more efficient to work with. Lie group $\mathrm{O}(n)$ has a $\frac{n(n-1)}{2}$-dimensional manifold structure (compared to the $n^2$-dimensional manifold of $\mathrm{GL}(n)$).
Furthermore, by restricting $\mathcal{F}_{\sigma \hookrightarrow \tau} \in O(n)$, the sheaf is equivalent to a discrete $\mathrm{O}(n)$ bundle on a hypersphere $S^n$, as orthogonal restrictions describe how vectors are rotated when transported between stalks. Thus, it is equivalent to a discretized version of a tangent bundle on a manifold. The sheaf Laplacian of the $\mathrm{O}(n)$-bundle is equivalent to the connection Laplacian.
Sheaf diffusion
Cellular sheaves equip graphs with a general "geometrical data-structure" by assigning vector spaces and linear maps to nodes and edges. Solving any node classification task can be reduced to performing diffusion with the right sheaf.
Consider a simplest case of a graph $G=(V,E)$ where each node $v$ has an $n$-dimensional feature vector $\mathbf{x}_v \in \mathcal{F}(v)$. We construct a $kn$-dimensional vector in the 0-dimensional vector space of the sheaf's cochain complex $\mathbf{x} \in C^0(G,\mathcal{F})$ by column-stacking individual vectors $\mathbf{x}_v$. By allowing for $f$ feature channels, we produce the feature matrix $\mathbf{X} \in \mathbb{R}^{kn \times f}$. The columns in $\mathbf{X}$ are vectors $C^0(G,\mathcal{F})$, one for each of the $f$ channels.
The [[sheaf diffusion]] process works on $(G,\mathcal{F})$ and is governed by the following differential equation:
$$\mathbf{X}(0) = \mathbf{X},\; \dot{\mathbf{X}}(t) = -\Delta_\mathcal{F}\mathbf{X}(t)$$
Discretized as:
$$\mathcal{X}(t+1) = \mathbf{X}(t) - \Delta_\mathcal{F}\mathbf{X}(t) = (\mathbf{I}_{kn} - \Delta_\mathcal{F})\mathbf{X}(t)$$
With validation:
$$\dot{\mathbf{X}} = -\sigma\left(\Delta_{\mathcal{F}(t)}(\mathbf{I}_{n} \otimes \mathbf{W}_1) \mathbf{X}(t) \mathbf{W}_2\right)$$
In turn, discretized as:
$$\mathbf{X}(t+1) = \mathbf{X}(t) - \sigma \left( \Delta_{\mathcal{F}(t)}(\mathbf{I}_n \otimes \mathbf{W}_1(t)) \mathbf{X}(t) \mathbf{W}_2^t\right)$$
Here $\mathbf{W}_1$ and $\mathbf{W}_2$ are weight matrices in $O(n)$, the restriction maps defining $\Delta_{\mathcal{F}(t)}$ are computed by a learnable parametric matrix-valued function:
$$\mathcal{F}_{v \hookrightarrow e := (v,u)} = \mathbf{\Phi}(\mathbf{x}_v,\mathbf{x}_u)$$
Sheaf $\mathcal{F}(t)$ and weights $\mathbf{W}_1(t)$ and $\mathbf{W}_2(t)$ are time-dependent, permitting the underlying geometry to evolve over time. While heat diffusion struggles separating classes on heterophilic graphs, by considering a hierarchy of increasingly general sheaves, we can study how the ability of the sheaf diffusion process to achieve linear separation of the classes in the infinite time limit expands as can be observed from continuous formulation of sheaf diffusion process:
$$\dot{\mathbf{X}}(t) = -\sigma\left(\Delta_{\mathcal{F}(t)}(\mathbf{I}_n \otimes \mathbf{W}_1)\mathbf{X}(t)\mathbf{W}_2\right)$$
Training process of a sheaf neural network constitutes learning of a sheaf *constraint maps* from spatiotemporal data by approximating the attachment map $\sigma_v^k \to \sigma_u^{k+1}$, (where stalks on $\sigma_v^k$, $\sigma_u^{k+1}$ are $\mathbf{x}_v, \mathbf{x}_u$ respectively):
$$\mathcal{F}_{\sigma^{k}_v \to \sigma^{k+1}_u} = \Phi(\mathbf{x}_v,\mathbf{x}_u)$$
Sheaf Attention Networks
Attention mechanism has become a central inductive bias for deep learning models irrespective of domain. Graph attention networks, first proposed by [(Veličković, Cucurull, Lio, et al., 2018)](https://arxiv.org/abs/1710.10903) can similarly be extended to sheaves as above, [(Barbero, Bodnar, et al., 2023)](https://openreview.net/pdf?id=LIDvgVjpkZr).
Besides the usual attention coefficient $\alpha_{ij}$, the sheaf-based attention mechanism also learns a $n \times n$ orthogonal transport matrix $\mathbf{P}_{ij}$ describing how vectors in the stalk of $j$ should be moved into the stalk of $i$. Sheaf Attention Network updates the features at node $i$ by transporting all the vectors from neighboring stalks and aggregating them proportionally to the attention coefficients.
$$\Lambda_{ij} = a(\mathbf{x}_i,\mathbf{x}_j) = \frac{\exp(\mathrm{LeakyReLU}(\mathbf{a}[\mathbf{W}\mathbf{x}_i\|\mathbf{W}\mathbf{x}_j]))}{\sum_{k \in \mathcal{N}_i}\exp(\mathrm{LeakyReLU}(\mathbf{a}[\mathbf{W}\mathbf{x}_i\|\mathbf{W}\mathbf{x}_j]))}$$
Where $\mathrm{LeakyReLU}(x) = \max(0,x) - 0.01\min(0,x)$ provided as torch functionality `torch.nn.LeakyReLU`, and $\mathbf{a}$, $\mathbf{W}$ are learnable weights of appropriate dimension, and $[\cdot\|\cdot]$ denotes operation of vector concatenation. Generalize $\hat{\Lambda}$ to $n$-dimensional sheaves by taking a Kronecker product with $\mathbf{1}_n$, a $n \times n$ dimensional matrix with every element $1$:
$$\hat{\Lambda} = \Lambda \otimes \mathbf{1}_n$$
$\hat{\mathbf{A}}_\mathcal{F}$ is a sheaf adjacency matrix with added self-loops:
$$\hat{\mathbf{A}}_\mathcal{F} = \mathcal{F}_{\sigma_i \to \tau}^\mathrm{T} \mathcal{F}_{\sigma_j \to \tau} = \mathbf{P}_{ij}$$
We get:
$$\Delta_\mathcal{F} = \left( \hat{\Lambda}(\mathbf{X}) \odot \hat{\mathbf{A}}_\mathcal{F} - \mathbf{I} \right)$$
With:
$$\mathbf{X}(t+1) = \mathbf{X}(t) - \sigma \left( \Delta_{\mathcal{F}(t)}(\mathbf{I}_n \otimes \mathbf{W}_1(t)) \mathbf{X}(t) \mathbf{W}_2(t)\right)$$
This allows us to generalize attention mechanisms onto sheaf-based neural network models.
See also
|
|