← Back to Research

GoldSSM: Selective State Space Models for Gold Price Forecasting

Architecture & Models··Rahul S. P.

Abstract

We present GoldSSM, a selective state space model for intraday gold price direction forecasting. The architecture combines a Variable Selection Network, a stack of Mamba blocks with selective scan, and temporal attention pooling. At 2.0M parameters, GoldSSM serves as a drop-in replacement for Transformer-based models with identical forward signatures, while offering linear-time sequence processing and improved handling of long-range dependencies in financial time series.

1. Introduction

Transformer architectures have become the dominant paradigm in sequence modeling, achieving state-of-the-art results across natural language processing, computer vision, and increasingly, financial time series forecasting. However, the self-attention mechanism at the core of Transformers incurs O(T²) time and memory complexity with respect to sequence length T, which becomes prohibitive for financial applications requiring long context windows.

In intraday gold trading, our production system processes four temporal scales simultaneously: SHORT (30 bars), MID (60 bars), LONG (120 bars), and SLOW (240 bars). At the SLOW scale, a Transformer with standard self-attention over 240 tokens at 128-dimensional embeddings requires substantial attention score computations per head per layer — a significant computational burden for real-time M1/M2 bar inference. The problem compounds when we consider that each of the four scales requires its own encoder, and inference must complete within the bar interval (60–120 seconds) on commodity GPU hardware.

The quadratic scaling of self-attention creates a fundamental tension in financial time series modeling. Longer context windows capture more regime information — a 240-bar M2 window spans 8 hours of trading, enough to observe full session transitions and intraday trend cycles. But the computational cost grows quadratically with window length, forcing practitioners to choose between context richness and inference speed. Linear-time alternatives would eliminate this trade-off entirely.

State Space Models (SSMs) offer a principled resolution. Rooted in control theory, SSMs model sequences through a latent state that evolves according to learned dynamics. Classical linear SSMs (S4, S5) demonstrated that structured state spaces could match or exceed Transformer performance on long-range benchmarks while maintaining O(T) complexity. However, these models used fixed state transition matrices — the same dynamics applied to every input token regardless of content. This input-independence is a fundamental limitation for financial data, where the appropriate memory horizon varies dramatically between consolidation periods (retain long history) and breakout events (rapidly update state).

Mamba (Gu & Dao, 2023) introduced selective scan: input-dependent state transition parameters that allow the model to dynamically control what to remember and what to forget. The step size Δ, input matrix B, and output matrix C are all functions of the current input, enabling content-aware state updates while preserving O(T) complexity through a hardware-efficient parallel scan algorithm.

We propose GoldSSM, a selective state space model based on the Mamba architecture that processes sequences in O(T) time while maintaining the ability to selectively attend to relevant historical patterns. GoldSSM serves as a drop-in replacement for our existing Transformer-based TrendMRModel, sharing identical forward signatures and output tuples, enabling direct A/B comparison without infrastructure changes. The architecture consists of four components: a Variable Selection Network for per-timestep feature gating, a Mamba block stack for linear-time sequence modeling, temporal attention pooling for learned aggregation, and regime-specialist output heads for trade decisions.

2. Architecture

The GoldSSM architecture consists of four main components arranged in a sequential pipeline:

Pipeline: Input $(B, T, F{=}107)$ → Variable Selection Network → Mamba Block Stack ($\times 2$) → Temporal Attention Pooling → Output Heads → $(p_{\text{trade}}, p_{\text{up}}, p_{\text{down}}, p_{\text{hold}}, \text{recon})$

Figure 1: GoldSSM Architecture Input (B, T, 107) Variable Selection Network (B, T, 128) Mamba Block ×2 d_state=8, d_conv=4 (B, T, 128) Temporal Attention Pooling 4 learned queries (B, 128) p_trade p_up p_down p_hold recon (aux)

Figure 1: GoldSSM architecture. Input features pass through a Variable Selection Network for per-timestep gating, a stack of two Mamba blocks for linear-time sequence modeling, temporal attention pooling for learned aggregation, and five regime-specialist output heads.

Variable Selection Network feature alignment in 3D

Figure 2: Variable Selection Network feature alignment visualised in 3D, showing how the VSN projects heterogeneous input features into a uniform embedding space where feature importance can be dynamically gated.

2.1 Variable Selection Network (VSN)

Financial feature sets are inherently heterogeneous: some features (e.g., residual z-score, acceleration) carry strong signal at certain times and pure noise at others. Rather than relying on offline feature selection via AUC pruning — which produces a static feature set that cannot adapt to changing market regimes — we employ a Variable Selection Network that performs per-timestep soft feature gating. The VSN is inspired by the variable selection mechanism in Temporal Fusion Transformers (Lim et al., 2021), but adapted for the SSM context with several important modifications.

The VSN operates through two parallel paths that interact via a gating mechanism:

Value Path (Input Projection): The raw input tensor x of shape (B, T, F) where $F{=}107$ is projected through a linear layer to the embedding dimension: $\mathbf{v} = W_v \cdot \mathbf{x} + b_v$, producing a tensor of shape (B, T, embed_dim) where embed_dim=128. This projection maps the heterogeneous feature space (which mixes z-scores, binary indicators, ratios, and raw prices) into a uniform representation space where features can be meaningfully compared and combined.

Prototype Path (Learnable Feature Embeddings): A separate learnable embedding matrix P of shape (F, embed_dim) stores a "prototype" vector for each of the 107 input features. These prototypes are not input-dependent — they are learned during training and represent the model's prior belief about each feature's typical informativeness and role. The prototype matrix is projected through a selector MLP with hidden dimension VSN_HID=64: $\mathbf{g} = \text{MLP}(P \cdot W_p)$, where the MLP consists of two linear layers with SiLU activation. The selector MLP transforms the static prototypes into gate logits that determine feature importance.

Per-Timestep Soft Gating: The gating mechanism combines the value and prototype paths via element-wise multiplication after sigmoid activation: $\mathbf{o} = \sigma(\mathbf{g}) \odot \mathbf{v}$. The sigmoid produces gate values in [0, 1] for each feature dimension at each timestep, allowing smooth interpolation between fully passing (gate=1) and fully suppressing (gate=0) each feature's contribution. Unlike hard attention or top-k selection, soft gating is fully differentiable and produces gradients for all features, enabling end-to-end learning of feature importance.

Context Modulation: The regime embedding — a learned vector representing the current market state (trending, mean-reverting, high-volatility, etc.) — conditions the gate biases. Specifically, the context vector ctx of shape (B, embed_dim) is projected to a bias vector that is added to the gate logits before the sigmoid: $\mathbf{g}' = \mathbf{g} + W_{\text{ctx}} \cdot \mathbf{c}$. This allows the VSN to modulate feature importance based on the detected regime. For example, the Hurst exponent (a complexity measure) should receive high gate values during trending regimes where it is informative, but low values during mean-reverting consolidation where it contributes noise.

Memory Efficiency: A naive implementation of feature selection might materialize a full (B, T, F, embed_dim) tensor representing per-feature, per-timestep embeddings before gating. The VSN avoids this by operating in the projected space: the value path projects F features down to embed_dim before gating, so the working memory is O(B × T × max(F, embed_dim)). For our configuration with F=107 and embed_dim=128, this means peak memory is O(B × T × 128), never the prohibitive O(B × T × F × embed_dim) = O(B × T × 13,696) that a naive per-feature embedding would require.

The output is a tensor of shape (B, T, 128), where each timestep's representation reflects only the features deemed relevant by the network given the current regime context. Empirically, we observe that the VSN learns to suppress features known to be noisy (e.g., features with AUC near 0.500) while amplifying high-AUC features, but crucially, the gating weights vary across timesteps — a feature suppressed during consolidation may be amplified during a breakout.

2.2 Mamba Block Stack

The core sequence modeling component is a stack of two Mamba blocks, each implementing the selective state space model with input-dependent state transitions. Understanding the selective scan algorithm requires first reviewing the classical SSM formulation, then examining how Mamba makes it adaptive.

Classical SSM (Continuous-Time): A linear state space model defines a continuous-time dynamical system with state h(t), input x(t), and output y(t):

The continuous-time dynamics:

$$h'(t) = A \cdot h(t) + B \cdot x(t)$$ $$y(t) = C \cdot h(t)$$

where $A$ is the state matrix ($d_{\text{state}} \times d_{\text{state}}$), $B$ is the input matrix ($d_{\text{state}} \times 1$), and $C$ is the output matrix ($1 \times d_{\text{state}}$). For discrete sequences, these continuous parameters must be discretized using a step size $\Delta$:

Discretization:

$$\bar{A} = \exp(\Delta A), \quad \bar{B} = (\Delta A)^{-1}(\exp(\Delta A) - I) \cdot \Delta B$$

Discrete recurrence:

$$h_t = \bar{A} \cdot h_{t-1} + \bar{B} \cdot x_t$$ $$y_t = C \cdot h_t$$

In classical SSMs (S4, S5), A, B, C, and Δ are fixed parameters learned during training but constant across all inputs. This means the model applies identical dynamics to every timestep — a consolidation bar and a breakout bar receive the same state transition.

Mamba's Selective Scan Innovation: Mamba makes the discretization parameters input-dependent. Specifically:

  • Input-dependent Δ (step size): $\Delta_t = \text{softplus}(W_\Delta \cdot x_t + b_\Delta)$. Different inputs produce different step sizes, controlling how much the state is updated. A large Δ means "pay attention to this input and update the state significantly"; a small Δ means "mostly ignore this input and retain the previous state." For financial data, high-volatility bars naturally produce larger Δ values, causing faster state updates during regime transitions.
  • Input-dependent B (input matrix): $B_t = W_B \cdot x_t$. The input projection into the state space varies per timestep, allowing the model to selectively route different aspects of the input into the latent state.
  • Input-dependent C (output matrix): $C_t = W_C \cdot x_t$. The readout from the state space also varies per timestep, allowing the model to extract different state components depending on the current context.
  • Fixed A (state matrix): The state transition matrix A remains fixed (learned but input-independent). It is initialized using the HiPPO (High-order Polynomial Projection Operator) framework, which provides optimal initialization for capturing long-range dependencies.

State Update (Discrete Recurrence): At each timestep t, the selective scan performs:

The selective scan at each timestep $t$:

$$\Delta_t = \text{softplus}(W_\Delta \cdot x_t + b_\Delta)$$ $$B_t = W_B \cdot x_t, \quad C_t = W_C \cdot x_t$$ $$\bar{A}_t = \exp(\Delta_t \cdot A), \quad \bar{B}_t = \Delta_t \cdot B_t$$ $$h_t = \bar{A}_t \cdot h_{t-1} + \bar{B}_t \cdot x_t$$ $$y_t = C_t \cdot h_t$$

The state vector h_t has dimension d_state=8, meaning the model maintains an 8-dimensional summary of all past inputs, updated selectively based on input content. This is dramatically more compact than a Transformer's key-value cache, which stores the full T × d representation.

Causal Convolution (d_conv=4): Before the SSM scan, each Mamba block applies a depth-wise causal convolution with kernel size d_conv=4 over the sequence dimension. This provides local context — the current token can see the 3 preceding tokens through the convolution — before the SSM processes the full sequence. The convolution serves an analogous role to the TCN frontend in the Transformer variant: it captures local patterns (price bars within the same candle pattern, adjacent volume spikes) that the SSM can then integrate over longer horizons.

Dual Implementation Strategy: The selective scan can be computed in two ways:

  • Sequential JIT scan: A simple loop over T timesteps, JIT-compiled via PyTorch JIT script compilation for efficiency. Used when $T > 60$ (MID, LONG, SLOW scales). Time complexity O(T · d · d_state), memory O(d_state) for the running state.
  • Parallel associative scan: Exploits the associative property of the recurrence to compute all timesteps in parallel on GPU, trading memory for speed. Available via the PyTorch associative scan operator but currently disabled (_PARALLEL_SCAN_MAX_T=0) due to numerical stability concerns with half-precision training. When enabled, would be used for T ≤ 60 (SHORT scale).

Between the two stacked Mamba blocks, dropout of 0.15 is applied. This rate was selected empirically: 0.05 was insufficient for regularization on noisy financial features, while 0.25 impaired the model's ability to learn regime transitions.

Parameter Value Description
$d_{\text{model}}$ 128 Hidden state dimension
$d_{\text{state}}$ 8 SSM state dimension (latent state size per channel)
$d_{\text{conv}}$ 4 Local convolution width (causal, depth-wise)
expand 1 Inner dimension expansion factor (no expansion)
$n_{\text{layers}}$ 2 Stacked Mamba blocks with residual connections
dropout 0.15 Applied after each block (between layers and after stack)
VSN hidden dimension 64 Selector MLP hidden dimension in VSN

For financial time series, the selective scan mechanism is a natural fit. During low-volatility consolidation periods, the model should retain longer-term trend information (small Δ, slow state updates). During high-volatility breakouts, it should rapidly update its state to reflect the new regime (large Δ, fast state updates). The selective mechanism enables this adaptive behavior without explicit regime detection — the model learns the appropriate Δ dynamics end-to-end from the training signal.

Regime gate activation heatmap

Figure 4: Regime gate activation heatmap showing how the VSN dynamically modulates feature importance across different market regimes. Brighter values indicate higher gate activation (feature passed through); darker values indicate suppression.

2.3 Multi-Scale Streams

Financial markets exhibit different patterns at different temporal horizons. A 30-bar window captures microstructure dynamics (candle patterns, tick momentum), while a 240-bar window reveals intraday trend structure and session transitions. Processing all horizons with a single sequence length forces the model to compromise between local precision and global context. Multi-scale processing eliminates this compromise.

GoldSSM processes four temporal scales independently, each with its own VSN and Mamba block stack:

Stream Sequence Length Temporal Span (M2) Captures
SHORT 30 bars ~1 hour Microstructure, candle patterns, immediate momentum
MID 60 bars ~2 hours Intra-session trends, volatility clustering
LONG 120 bars ~4 hours Session-level structure, London/NY transitions
SLOW 240 bars ~8 hours Full session context, multi-session regime

Each stream applies the identical architecture — VSN followed by two Mamba layers followed by temporal attention pooling — but with independent weights. The VSN in the SHORT stream may learn to gate different features than the SLOW stream's VSN; the Mamba blocks in each stream learn scale-appropriate state dynamics. After pooling, each stream produces a vector of shape (B, 128).

Multi-Scale Fusion: The four stream outputs are concatenated to form a (B, 512) tensor, then projected through a fusion layer: $\mathbf{f} = \text{SiLU}(W_{\text{fuse}} \cdot [\mathbf{s}_{\text{short}}; \mathbf{s}_{\text{mid}}; \mathbf{s}_{\text{long}}; \mathbf{s}_{\text{slow}}])$, producing a $(B, 128)$ representation. The fusion layer learns how to weight and combine information from different temporal scales. This is preferable to averaging or attention-based fusion because the scales carry fundamentally different types of information — the model should learn to extract trend direction from SLOW, entry timing from SHORT, and volatility regime from LONG, rather than treating them as exchangeable views.

Why multi-scale matters: In our Transformer baseline, the four ContextTCNTransformer modules serve the same multi-scale role with TCN frontends. The structural analogy is deliberate — it ensures that any performance differences between GoldSSM and the Transformer are attributable to the sequence modeling paradigm (SSM vs. attention), not to differences in multi-scale design.

2.4 Temporal Attention Pooling

After the Mamba stack produces a sequence of hidden states (B, T, 128), these must be aggregated into a fixed-length representation for the output heads. Standard approaches use mean pooling or last-token extraction, both of which treat all timesteps equally or discard all but the final one. Both are suboptimal for financial sequences where the information density varies dramatically: a breakout bar, a session open, or a news-driven spike contains far more decision-relevant information than a consolidation bar in the middle of a quiet range.

We instead use temporal attention pooling with four learned query vectors that attend over the full time dimension. The mechanism operates as follows:

  1. Learned Queries: Four learnable query vectors Q ∈ R4×128 are initialized randomly and trained end-to-end. Each query learns to attend to a different aspect of the temporal sequence. Unlike Transformer self-attention where queries come from the input itself, these queries are global — they represent the model's learned notion of "what temporal patterns matter for trading decisions."
  2. Multi-Head Cross-Attention: Standard scaled dot-product attention is computed: $\text{Attention}(Q, K{=}H, V{=}H) = \text{softmax}(Q \cdot H^T / \sqrt{d}) \cdot H$, where H is the Mamba output sequence of shape (B, T, 128). This produces four attended vectors, each a weighted sum over all T timesteps, with weights determined by the learned queries. The attention uses n_heads=4 heads with d_k=32 per head.
  3. Concatenation and Projection: The four resulting attended vectors (each 128-dim) are concatenated to form a (B, 512) vector, then projected through a linear layer to the final hidden dimension (B, 128).

This mechanism allows the model to learn that recent bars and inflection points (reversals, breakouts) should receive higher weight than bars in the middle of a consolidation range. The four query vectors can specialize: empirically, we observe that different queries attend to different temporal patterns. One query typically focuses on the most recent 5–10 bars (recency bias), another attends to bars with the largest-magnitude hidden states (event detection), a third distributes attention more uniformly (context aggregation), and the fourth shows variable attention patterns that correlate with regime transitions.

Interpretability: The attention weights from temporal pooling are exportable and provide a direct answer to the question "which bars mattered for this prediction?" This is valuable for post-hoc analysis of model decisions — if the model predicts a trade but the attention concentrates on bars from 4 hours ago rather than recent price action, this suggests the model may be relying on stale information and the prediction should be discounted. The attention weights can be logged alongside each prediction for systematic monitoring.

2.5 Output Heads

The final fused representation feeds into five output heads, matching the TrendMRModel interface exactly. The heads are designed as regime specialist heads: in the general case, K heads per output can be instantiated (one per regime cluster), with the active head selected based on the regime embedding. In the default configuration, K=1 (matching REGIME_CLUSTER_K=1), meaning a single head handles all regimes.

Head Output Activation Purpose Training Loss
$p_{\text{trade}}$ Scalar [0,1] Sigmoid Trade gating — should we trade at all? Filters out low-conviction periods. Binary cross-entropy against trade/no-trade labels
$p_{\text{up}}$ Scalar [0,1] Softmax (jointly over 3) Probability of upward move exceeding threshold Categorical cross-entropy over 3-class direction
$p_{\text{down}}$ Scalar [0,1] Probability of downward move exceeding threshold
$p_{\text{hold}}$ Scalar [0,1] Probability of no significant move (within threshold)
recon Vector (F,) Linear (no activation) Input reconstruction for anomaly detection MSE against original input features

The p_trade head serves as a gating function that precedes direction prediction. In live execution, a trade is only taken when p_trade exceeds a configurable threshold (typically 0.6). This allows the model to express uncertainty — when the input features are ambiguous or the regime is transitioning, the model can output low p_trade to suppress trading, even if the directional heads show a lean.

The direction heads (p_up, p_down, p_hold) are jointly normalized via softmax, ensuring they sum to 1.0. This enforces a coherent probability distribution over outcomes: the model cannot simultaneously predict high probability of both up and down moves. In earlier iterations, these were independent sigmoids, which led to the "hedging" failure mode where p_up and p_down were both driven toward 0.5 by excessive weight decay.

The recon head serves a dual purpose: it acts as an auxiliary training objective (reconstruction loss) that encourages the learned representation to retain information about the input features, and as a runtime anomaly detector. If reconstruction error exceeds a threshold (calibrated on the training set), the input data is likely out-of-distribution — perhaps due to a data feed error, a flash crash, or a regime not seen during training — and the model's directional predictions should be discounted or suppressed.

The drop-in compatibility requirement means GoldSSM returns the same 5-tuple as TrendMRModel: $(p_{\text{trade}}, p_{\text{up}}, p_{\text{down}}, p_{\text{hold}}, \text{recon})$. This allows swapping models in production with zero changes to the execution layer, risk management, or logging infrastructure.

3. Implementation Details

3.1 Hyperparameter Table

Hyperparameter Value Rationale
$d_{\text{embed}}$ 128 Matches Transformer baseline for fair comparison; sufficient for 107 features
$d_{\text{state}}$ 8 Compact state; financial patterns have low intrinsic dimensionality
$d_{\text{conv}}$ 4 4-bar local context; analogous to TCN frontend in Transformer
expand 1 No inner dimension expansion; keeps parameter count low
$n_{\text{layers}}$ 2 Sufficient depth for financial patterns; more layers overfit on noisy data
$n_{\text{queries}}$ 4 Temporal attention pooling queries; matches number of attention heads
VSN hidden dimension 64 Selector MLP hidden dim; small enough to avoid overfitting the gate
dropout 0.15 Between 0.05 (too low for noisy features) and 0.25 (impairs learning)
$K_{\text{regime}}$ 1 Single regime cluster; K=6 fragments data excessively
learning rate 1e-4 Standard for SSMs; LR=0.01 is catastrophically high for sequence models
weight decay 0.005 Moderate; higher values cause hedging (p_up, p_down → 0.5)
warmup epochs 3 Less sensitive than Transformer (which requires 5); linear warmup schedule

3.2 Forward Signature

GoldSSM accepts the same multi-scale input structure as TrendMRModel:

The model accepts four input tensors corresponding to the four temporal scales — SHORT $(B, 30, 107)$, MID $(B, 60, 107)$, LONG $(B, 120, 107)$, and SLOW $(B, 240, 107)$ — plus a context embedding vector $(B, d_{\text{embed}})$. It returns the five-element output tuple: $(p_{\text{trade}}, p_{\text{up}}, p_{\text{down}}, p_{\text{hold}}, \text{recon})$.

Each scale is processed by its own VSN + Mamba stack, and the four resulting representations are fused before the output heads. This drop-in compatibility means GoldSSM can be swapped into an existing training and inference pipeline with zero code changes outside the model instantiation.

3.3 Implementation Notes

Naming Collision (Critical): A naming collision in the implementation required careful attention: the common idiom of unpacking tensor shape into variables named B, T, F shadows the module-level functional import. Throughout the GoldSSM codebase, the feature dimension is named "n_feat" rather than "F" to avoid this collision. This is enforced by code review and a linting rule — the consequences of shadowing are subtle (the code runs but uses an integer where a module is expected, producing cryptic type errors).

JIT Compatibility: Additionally, activation function calls are written using fully qualified module paths in regions near JIT-compiled code to avoid scope ambiguity. The JIT compiler captures the local scope at compilation time, and if the functional module alias has been shadowed by a shape unpacking in a calling function, the compiled code will fail at runtime with a misleading type error.

3.4 Parameter Count

Component Parameters % of Total
Variable Selection Network (×4 streams) ~420K 21%
Mamba Block Stack (2 layers ×4 streams) ~980K 49%
Temporal Attention Pooling (×4 streams) ~270K 14%
Output Heads (5 heads) ~330K 16%
Total ~2.0M 100%

At 2.0M parameters, GoldSSM is approximately 6× lighter than the Transformer Macro Regimes model it replaces (12.5M parameters). The reduction comes primarily from eliminating the multi-head self-attention layers, which contain O(d²) parameters per head per layer. The Mamba block's parameter count scales as O(d · d_state + d · d_conv), which for d=128, d_state=8, d_conv=4 yields approximately 1.5K parameters per block (versus ~65K for a single Transformer self-attention layer with 4 heads).

Figure 6: Parameter Count Comparison GoldSSM 2.0M Transformer Macro 12.5M 6.2× smaller

Figure 6: GoldSSM achieves 6.2x parameter reduction compared to the Transformer Macro Regimes baseline, primarily by replacing multi-head self-attention with the selective scan mechanism.

3.5 Validation

The implementation passes a full self-test suite: CUDA forward pass on random inputs at all four scales, backward pass with gradient accumulation, and output shape verification. The self-test confirms that gradients flow through the selective scan operation (which uses a custom CUDA kernel in the optimized path, or a JIT-compiled Python loop in the fallback path) without numerical issues. The test generates random inputs at each scale, runs a full forward pass, computes a dummy loss (sum of all outputs), and verifies that backpropagation produces non-zero gradients for all learnable parameters.

4. Complexity Analysis

The computational advantage of GoldSSM over the Transformer baseline can be analyzed formally by comparing the dominant operations in each architecture: $$\mathcal{O}(T \cdot d \cdot d_{\text{state}}) \text{ (SSM) vs } \mathcal{O}(T^2 \cdot d) \text{ (Transformer)}$$

4.1 Transformer Self-Attention Complexity

For a single self-attention layer with sequence length T and embedding dimension d:

  • QKV projection: 3 linear layers, each O(T · d²) → total O(T · d²)
  • Attention scores: Q · KT = O(T² · d)
  • Attention output: Softmax(scores) · V = O(T² · d)
  • FFN: Two linear layers with hidden dim 4d: O(T · d · 4d) = O(T · d²)
  • Total per layer: O(T² · d + T · d²)

For T=240 (SLOW scale), d=128: the attention term dominates with T² · d = 240² × 128 ≈ 7.4M multiply-adds per head per layer.

4.2 GoldSSM Selective Scan Complexity

For a single Mamba block with sequence length T, embedding dimension d, and state dimension N:

  • Input projections (Δ, B, C): O(T · d · N) total
  • Causal convolution: O(T · d · d_conv) = O(T · d) for constant d_conv
  • State update scan: O(T · d · N) for the full recurrence
  • Output projection: O(T · d²) for the gated output
  • Total per layer: O(T · d · N + T · d²) = O(T · d · max(N, d))

For T=240, d=128, N=8: the scan term is T · d · N = 240 × 128 × 8 ≈ 0.25M multiply-adds. Even including the output projection (T · d² ≈ 3.9M), the total is substantially less than the Transformer's attention term alone.

4.3 Comparison at Operating Scale

Metric Transformer (T=240, d=128) GoldSSM (T=240, d=128, N=8)
Dominant operation QKT: O(T² · d) SSM scan: O(T · d · N)
Multiply-adds (per layer) ~7.4M (attention only) ~0.25M (scan only)
Memory (activations) O(T²) attention matrix cached O(N) state vector only
Memory at T=240 57,600 floats (attention matrix) 8 floats (state vector)
Scaling: double T 4× cost 2× cost
Scaling: double d 2× cost (attention), 4× (FFN) 2× cost (scan), 4× (output proj)

The ~30× reduction in scan operations at T=240 (and larger savings at longer sequences) translates directly to faster inference and lower GPU memory consumption, enabling real-time execution on commodity hardware. The memory advantage is even more dramatic: the Transformer must store the full T × T attention matrix for backpropagation, while the SSM stores only the d_state-dimensional running state. For T=240, this is a 7,200× reduction in activation memory for the sequence modeling component.

5. Comparison

Property Transformer (Macro Regimes) GoldSSM
Total parameters 12.5M 2.0M
Sequence complexity O(T²) O(T)
Memory at T=240 O(T²) = 57.6K floats per head O(N) = 8 floats per channel
Sequence scales 4 (SHORT/MID/LONG/SLOW) 4 (identical)
Feature selection Static (offline AUC pruning) Dynamic (VSN, per-timestep)
Temporal aggregation Mean pooling Learned attention pooling (4 queries)
Positional encoding Sinusoidal / learned positional embeddings Implicit (causal recurrence encodes position)
Local context TCN frontend (temporal conv) d_conv=4 causal convolution (built into Mamba block)
Interpretability Attention weights (T×T matrix per head) VSN gate weights + temporal pooling attention
Output interface (p_trade, p_up, p_down, p_hold, recon) Identical
Drop-in compatible Yes
Encoder architecture TCN + 2-layer Transformer (×4 scales) VSN + 2-layer Mamba (×4 scales)
Warmup requirement 5 epochs (critical — LR past E2 destroys signal) 3 epochs (less sensitive to warmup schedule)
Pooling method Mean pooling (all bars weighted equally) Attention pooling (learned bar importance)

The Transformer model uses four ContextTCNTransformer modules, each combining a TCN frontend with a 2-layer Transformer encoder. GoldSSM replaces each with a VSN frontend and 2-layer Mamba stack. The TCN frontend in the Transformer model serves a similar local-context role as the $d_{\text{conv}}{=}4$ local convolution within the Mamba block, making the architectures structurally analogous at the component level despite the different sequence modeling paradigm.

A notable difference is in positional encoding. Transformers require explicit positional embeddings (sinusoidal or learned) because self-attention is permutation-invariant — without positional information, the model cannot distinguish bar ordering. GoldSSM's recurrent structure inherently encodes position through the causal scan: h_t is computed from h_{t-1}, so temporal order is implicit. This eliminates a potential source of error (incorrect positional encoding) and a set of additional parameters.

Hidden state timeline from Markov analysis

Figure 5: Hidden state timeline from Markov analysis, illustrating how the model's latent state evolves through different market regimes over the course of a trading day.

6. Conclusion

GoldSSM demonstrates that selective state space models are a viable and efficient alternative to Transformers for intraday financial time series forecasting. At 2.0M parameters (6× fewer than the Transformer baseline) and O(T) sequence complexity (versus O(T²)), the architecture is suitable for real-time inference on M1 and M2 bars across multiple temporal scales simultaneously.

The architecture addresses three specific limitations of the Transformer baseline: (1) quadratic scaling that constrains context window length, (2) static feature selection that cannot adapt to changing regimes, and (3) naive mean pooling that treats all timesteps as equally informative. The VSN provides regime-adaptive feature gating, the Mamba stack provides linear-time sequence modeling with content-aware state dynamics, and temporal attention pooling provides learned aggregation with interpretable weights.

The drop-in compatibility with existing model interfaces enables direct comparison without infrastructure changes, facilitating systematic evaluation of SSM architectures in production trading environments. The selective scan mechanism — where the model learns to control its own memory dynamics based on input content — is particularly well-suited to financial data, where the optimal memory horizon varies continuously with market conditions.

The design philosophy of GoldSSM prioritizes computational parsimony over architectural novelty. Every component serves a specific, measurable purpose: VSN for adaptive feature selection, Mamba for linear-time sequence modeling, temporal attention for learned aggregation. No component was added for its own sake. The architecture's value lies not in any single innovation, but in the disciplined integration of components that collectively address the real constraints of production intraday trading: limited inference time, noisy features, and regime-dependent dynamics.