Dual-form Networks for Multi-modal SITS Land Monitoring

Abstract

Multi-modal Satellite Image Time Series (SITS) analysis faces significant computational challenges for live land monitoring applications. While Transformer architectures excel at capturing temporal dependencies and fusing multi-modal data, their quadratic computational complexity and the need to reprocess entire sequences for each new acquisition limit their deployment for regular, large-area monitoring.

This paper studies various dual-form attention mechanisms for efficient multi-modal SITS analysis, that enable parallel training while supporting recurrent inference for incremental processing. We compare linear attention and retention mechanisms within a multi-modal spectro-temporal encoder. To address SITS-specific challenges of temporal irregularity and unalignment, we develop temporal adaptations of dual-form mechanisms that compute token distances based on actual acquisition dates rather than sequence indices.

Our approach is evaluated on two tasks using Sentinel-1 and Sentinel-2 data: multi-modal SITS forecasting as a proxy task, and real-world solar panel construction monitoring. Experimental results demonstrate that dual-form mechanisms achieve performance comparable to standard Transformers while enabling efficient recurrent inference. The multi-modal framework consistently outperforms mono-modal approaches across both tasks, demonstrating the effectiveness of dual mechanisms for sensor fusion.

Key Contributions

A unified formulation of dual form mechanisms for SITS analysis
A novel multi-modal spectro-spatio-temporal architecture that enables parallel training and efficient inference for land monitoring applications
Comprehensive experimental results demonstrating that dual form mechanisms are effective for multi-modal SITS analysis across both forecasting and multi-temporal segmentation tasks

Dual Form neural network

$X$ is a sequence composed of $N$ tokens $(\mathbf{x}_j)_{1 \leq j \leq N}$,
$\q{i} = \mathbf{x}_i \cdot W_Q$, $\k{j} = \mathbf{x}_j \cdot W_K$, $\mathbf{v}_i = \mathbf{x}_i \cdot W_V$

$$\mathbf{o}_i = \sum_{j=1}^i a_{i,j} \mathbf{v}_j$$

Causal Attention

$$a_{i,j} = \frac{\text{s}(\q{i}, \k{j})}{\sum_{j=1}^i \text{s}(\q{i}, \k{j})}$$

Kernel-based attention mechanisms:

$$\text{s}(\q{i}, \k{j}) = \phi(\q{i}) \cdot \phi(\k{j})^T$$

Similarity function is separable

Retention

$$a_{i,j} = \gamma^{i-j} \phi^{\text{ret}}(\q{i}) \cdot \phi^{\text{ret}}(\k{j})^T$$

$\gamma$ is a decay term
$a_{i,j}$ are not normalized

Attention mechanism	Similarity function $\text{s}(\mathbf{q}_i,\mathbf{k}_j)$
Transformer [Vaswani et al., 2017]	$\exp\!\left(\dfrac{\mathbf{q}_i \cdot \mathbf{k}_j}{\sqrt{d_K}}\right)$
Linear Transformer [Katharopoulos et al., 2020]	$\left(1+\text{elu}(\mathbf{q}_i)\right)\cdot \left(1+\text{elu}(\mathbf{k}_j)\right)^T$
CosFormer [Qin et al., 2022]	$\cos\!\left(\dfrac{\pi}{2} \cdot \dfrac{i-j}{M}\right)\psi(\mathbf{q}_i)\cdot\psi(\mathbf{k}_j)^T$
RoPE Linear Transformer [Su et al., 2024]	$(\psi(\mathbf{q}_n) R_{i\boldsymbol{\theta}})(\psi(\mathbf{k}_m) R_{j\boldsymbol{\theta}})^T$

Table 1: Considered attention mechanisms in this study and corresponding similarity functions. $\psi$ is a positive feature map function.

Architecture Overview

The proposed Multi-Modal Spectro-Spatio-Temporal Encoder (MMSSTE) integrates dual-form attention mechanisms for efficient processing of Sentinel-1 and Sentinel-2 satellite time series.

Figure 1: Overview of the proposed multi-modal spectro-spatio temporal architecture, composed of: a modality specific spectro-spatial encoder (noted g^S1 resp. g^S2), a temporal fusion encoder, an upsampling operation with pixel shuffle (P.S) and a task specific decoder layer.

Dual-Form Mechanisms

We compare several dual-form attention mechanisms:

Linear Transformer: Uses separable similarity functions for efficient computation
CosFormer: Integrates cosine-based reweighting for token proximity
LinRoFormer: Applies Rotary Positional Encoding (RoPE) to linear attention
Retention: Incorporates decay mechanisms with RoPE

Additionally, we propose temporal variants (Time CosFormer, Time LinRoFormer, Time Retention) that compute distances based on actual acquisition dates rather than sequence indices, addressing SITS temporal irregularity.

Supervised Training Task

Multi-Modal Forecasting Task

We evaluate dual-form mechanisms on a forecasting task where the model predicts the next satellite acquisition given past observations.

Figure 2: Multi-modal forecasting framework

Figure 3: Visualization of the reconstruction in the multi-modal forecasting task using the Time CosFormer mechanism

Solar Panel Detection

Multi-modal approaches consistently outperform mono-modal baselines on real-world construction monitoring.

Figure 4: Solar panel detection results

Additional Results

Forecasting Performance

Dual-form mechanisms achieve comparable performance to standard Transformers while enabling efficient recurrent inference.

Look-back Length Effect

Models generalize well to longer sequences than seen during training, with temporal variants showing better adaptation.

Conclusion

We compared several dual-form attention mechanisms for multi-modal satellite image time series analysis. On both the supervised forecasting task and the multi-temporal segmentation task, these mechanisms successfully fuse Sentinel-1 and Sentinel-2 data, often matching the performance of standard causal attention while offering a recurrent formulation better suited for efficient inference. This is promising for regular and up-to-date land monitoring applications.

Our study of the look-back length shows that, in the mono-modal forecasting setting, simple temporal reweighting schemes such as CosFormer and our proposed Time CosFormer perform on par with, or better than, more complex mechanisms like LinRoFormer or Retention. In the multi-modal setting, however, plain linear attention outperforms these reweighted dual-form variants.

Citation

@article{dumeur2026dualform,
  title={Comparative analysis of dual-form networks for live land monitoring 
         using multi-modal satellite image time series},
  author={Dumeur, Iris and Anger, J{\'e}r{\'e}my and Facciolo, Gabriele},
  journal={ISPRS},
  year={2026}
}

Acknowledgments

This work was financed by the Agence Innovation Défense (AID), within the framework of the Dual Innovation Support Scheme (RAPID - Régime d'APpui à l'Innovation Duale), for the project 'DETEVENT' (Agreement No. 2024 29 0970).

This work was granted access to the HPC resources of IDRIS under the allocations 2025-AD011016513 and 2025-AD011012453R4 made by GENCI.

Sentinel-1 and Sentinel-2 data is from Copernicus. This paper contains modified Copernicus Sentinel data.

Comparative Analysis of Dual-Form Networks for Live Land Monitoring Using Multi-Modal Satellite Image Time Series