Comparative Analysis of Dual-Form Networks for Live Land Monitoring Using Multi-Modal Satellite Image Time Series

1 Kayrros SAS
2 Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, 91190, Gif-sur-Yvette, France
3 Institut Universitaire de France

Abstract

Multi-modal Satellite Image Time Series (SITS) analysis faces significant computational challenges for live land monitoring applications. While Transformer architectures excel at capturing temporal dependencies and fusing multi-modal data, their quadratic computational complexity and the need to reprocess entire sequences for each new acquisition limit their deployment for regular, large-area monitoring.

This paper studies various dual-form attention mechanisms for efficient multi-modal SITS analysis, that enable parallel training while supporting recurrent inference for incremental processing. We compare linear attention and retention mechanisms within a multi-modal spectro-temporal encoder. To address SITS-specific challenges of temporal irregularity and unalignment, we develop temporal adaptations of dual-form mechanisms that compute token distances based on actual acquisition dates rather than sequence indices.

Our approach is evaluated on two tasks using Sentinel-1 and Sentinel-2 data: multi-modal SITS forecasting as a proxy task, and real-world solar panel construction monitoring. Experimental results demonstrate that dual-form mechanisms achieve performance comparable to standard Transformers while enabling efficient recurrent inference. The multi-modal framework consistently outperforms mono-modal approaches across both tasks, demonstrating the effectiveness of dual mechanisms for sensor fusion.

Key Contributions

  • A unified formulation of dual form mechanisms for SITS analysis
  • A novel multi-modal spectro-spatio-temporal architecture that enables parallel training and efficient inference for land monitoring applications
  • Comprehensive experimental results demonstrating that dual form mechanisms are effective for multi-modal SITS analysis across both forecasting and multi-temporal segmentation tasks

Dual Form neural network

$X$ is a sequence composed of $N$ tokens $(\mathbf{x}_j)_{1 \leq j \leq N}$,
$\q{i} = \mathbf{x}_i \cdot W_Q$, $\k{j} = \mathbf{x}_j \cdot W_K$, $\mathbf{v}_i = \mathbf{x}_i \cdot W_V$

$$\mathbf{o}_i = \sum_{j=1}^i a_{i,j} \mathbf{v}_j$$

Causal Attention

$$a_{i,j} = \frac{\text{s}(\q{i}, \k{j})}{\sum_{j=1}^i \text{s}(\q{i}, \k{j})}$$

Kernel-based attention mechanisms:

$$\text{s}(\q{i}, \k{j}) = \phi(\q{i}) \cdot \phi(\k{j})^T$$

  • Similarity function is separable

Retention

$$a_{i,j} = \gamma^{i-j} \phi^{\text{ret}}(\q{i}) \cdot \phi^{\text{ret}}(\k{j})^T$$

  • $\gamma$ is a decay term
  • $a_{i,j}$ are not normalized

Architecture Overview

The proposed Multi-Modal Spectro-Spatio-Temporal Encoder (MMSSTE) integrates dual-form attention mechanisms for efficient processing of Sentinel-1 and Sentinel-2 satellite time series.

Architecture diagram
Figure 1: Overview of the proposed multi-modal spectro-spatio temporal architecture, composed of: a modality specific spectro-spatial encoder (noted gS1 resp. gS2), a temporal fusion encoder, an upsampling operation with pixel shuffle (P.S) and a task specific decoder layer.

Dual-Form Mechanisms

We compare several dual-form attention mechanisms:

Additionally, we propose temporal variants (Time CosFormer, Time LinRoFormer, Time Retention) that compute distances based on actual acquisition dates rather than sequence indices, addressing SITS temporal irregularity.

Supervised Training Task

Multi-Modal Forecasting Task

We evaluate dual-form mechanisms on a forecasting task where the model predicts the next satellite acquisition given past observations.

Forecasting framework
Figure 2: Multi-modal forecasting framework
Reconstruction visualization
Figure 3: Visualization of the reconstruction in the multi-modal forecasting task using the Time CosFormer mechanism

Solar Panel Detection

Multi-modal approaches consistently outperform mono-modal baselines on real-world construction monitoring.

Solar panel detection
Figure 4: Solar panel detection results

Additional Results

Forecasting Performance

Dual-form mechanisms achieve comparable performance to standard Transformers while enabling efficient recurrent inference.

Forecasting results

Look-back Length Effect

Models generalize well to longer sequences than seen during training, with temporal variants showing better adaptation.

Look-back effect

Citation

@article{dumeur2026dualform,
  title={Comparative analysis of dual-form networks for live land monitoring 
         using multi-modal satellite image time series},
  author={Dumeur, Iris and Anger, J{\'e}r{\'e}my and Facciolo, Gabriele},
  journal={ISPRS},
  year={2026}
}

Acknowledgments

This work was financed by the Agence Innovation Défense (AID), within the framework of the Dual Innovation Support Scheme (RAPID - Régime d'APpui à l'Innovation Duale), for the project 'DETEVENT' (Agreement No. 2024 29 0970).

This work was granted access to the HPC resources of IDRIS under the allocations 2025-AD011016513 and 2025-AD011012453R4 made by GENCI.

Sentinel-1 and Sentinel-2 data is from Copernicus. This paper contains modified Copernicus Sentinel data.