Comparative Analysis of Dual-Form Networks for Live Land Monitoring Using Multi-Modal Satellite Image Time Series

Iris Dumeur^1,2, Jérémy Anger^1,2, Gabriele Facciolo^2,3

¹ Kayrros SAS

² Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, 91190, Gif-sur-Yvette, France

³ Institut Universitaire de France

📄 Paper (preprint) 💻 Code 📊 Dataset (Coming Soon)

Abstract

Multi-modal Satellite Image Time Series (SITS) analysis faces significant computational challenges for live land monitoring applications. While Transformer architectures excel at capturing temporal dependencies and fusing multi-modal data, their quadratic computational complexity and the need to reprocess entire sequences for each new acquisition limit their deployment for regular, large-area monitoring.

This paper studies various dual-form attention mechanisms for efficient multi-modal SITS analysis, that enable parallel training while supporting recurrent inference for incremental processing. We compare linear attention and retention mechanisms within a multi-modal spectro-temporal encoder. To address SITS-specific challenges of temporal irregularity and unalignment, we develop temporal adaptations of dual-form mechanisms that compute token distances based on actual acquisition dates rather than sequence indices.

Our approach is evaluated on two tasks using Sentinel-1 and Sentinel-2 data: multi-modal SITS forecasting as a proxy task, and real-world solar panel construction monitoring. Experimental results demonstrate that dual-form mechanisms achieve performance comparable to standard Transformers while enabling efficient recurrent inference. The multi-modal framework consistently outperforms mono-modal approaches across both tasks, demonstrating the effectiveness of dual mechanisms for sensor fusion.

Key Contributions

A unified formulation of dual form mechanisms for SITS analysis
A novel multi-modal spectro-spatio-temporal architecture that enables parallel training and efficient inference for land monitoring applications
Comprehensive experimental results demonstrating that dual form mechanisms are effective for multi-modal SITS analysis across both forecasting and multi-temporal segmentation tasks

Dual Form neural network

$X$ is a sequence composed of $N$ tokens $(\mathbf{x}_j)_{1 \leq j \leq N}$,
$\q{i} = \mathbf{x}_i \cdot W_Q$, $\k{j} = \mathbf{x}_j \cdot W_K$, $\mathbf{v}_i = \mathbf{x}_i \cdot W_V$

$$\mathbf{o}_i = \sum_{j=1}^i a_{i,j} \mathbf{v}_j$$

Causal Attention

$$a_{i,j} = \frac{\text{s}(\q{i}, \k{j})}{\sum_{j=1}^i \text{s}(\q{i}, \k{j})}$$

Kernel-based attention mechanisms:

$$\text{s}(\q{i}, \k{j}) = \phi(\q{i}) \cdot \phi(\k{j})^T$$

Similarity function is separable

Retention

$$a_{i,j} = \gamma^{i-j} \phi^{\text{ret}}(\q{i}) \cdot \phi^{\text{ret}}(\k{j})^T$$

$\gamma$ is a decay term
$a_{i,j}$ are not normalized

Architecture Overview

The proposed Multi-Modal Spectro-Spatio-Temporal Encoder (MMSSTE) integrates dual-form attention mechanisms for efficient processing of Sentinel-1 and Sentinel-2 satellite time series.

Figure 1: Overview of the proposed multi-modal spectro-spatio temporal architecture, composed of: a modality specific spectro-spatial encoder (noted g^S1 resp. g^S2), a temporal fusion encoder, an upsampling operation with pixel shuffle (P.S) and a task specific decoder layer.

Dual-Form Mechanisms

We compare several dual-form attention mechanisms:

Linear Transformer: Uses separable similarity functions for efficient computation
CosFormer: Integrates cosine-based reweighting for token proximity
LinRoFormer: Applies Rotary Positional Encoding (RoPE) to linear attention
Retention: Incorporates decay mechanisms with RoPE

Additionally, we propose temporal variants (Time CosFormer, Time LinRoFormer, Time Retention) that compute distances based on actual acquisition dates rather than sequence indices, addressing SITS temporal irregularity.

Supervised Training Task

Multi-Modal Forecasting Task

We evaluate dual-form mechanisms on a forecasting task where the model predicts the next satellite acquisition given past observations.

Figure 2: Multi-modal forecasting framework

Figure 3: Visualization of the reconstruction in the multi-modal forecasting task using the Time CosFormer mechanism

Solar Panel Detection

Multi-modal approaches consistently outperform mono-modal baselines on real-world construction monitoring.

Figure 4: Solar panel detection results

Additional Results

Forecasting Performance

Dual-form mechanisms achieve comparable performance to standard Transformers while enabling efficient recurrent inference.

Look-back Length Effect

Models generalize well to longer sequences than seen during training, with temporal variants showing better adaptation.

Citation

@article{dumeur2026dualform,
  title={Comparative analysis of dual-form networks for live land monitoring 
         using multi-modal satellite image time series},
  author={Dumeur, Iris and Anger, J{\'e}r{\'e}my and Facciolo, Gabriele},
  journal={ISPRS},
  year={2026}
}

Acknowledgments

This work was financed by the Agence Innovation Défense (AID), within the framework of the Dual Innovation Support Scheme (RAPID - Régime d'APpui à l'Innovation Duale), for the project 'DETEVENT' (Agreement No. 2024 29 0970).

This work was granted access to the HPC resources of IDRIS under the allocations 2025-AD011016513 and 2025-AD011012453R4 made by GENCI.

Sentinel-1 and Sentinel-2 data is from Copernicus. This paper contains modified Copernicus Sentinel data.