Contributor
Gabriel Mongaras, Eric Larson
Subject Area
Computer Science
Abstract
Softmax attention Vaswani et al. [2017] Bahdanau et al. [2015] has gained traction as one of the most important parts of most modern machine learning models. Despite being the most adopted modern machine learning algorithm, exploration of softmax attention itself has been neglected. While softmax attention is a very powerful architectural component, it has a major drawback of being quadratic in sequence length. More specifically, softmax attention has quadratic complexity during training and linear complexity (when using a KV cache) during inference, which limits the sequence length it can model due to system memory constraints. Linear attention was proposed, (proposed by Katharopoulos et al. [2020]) was created to overcome this quadratic bottleneck, having linear complexity during training and constant complexity during inference. While the complexity of linear attention is very favorable, the accuracy of naive linear attention models is not. Naive linear attention replaces the exponential in softmax with a decomposable kernel function. Popular choices of this kernel function, such as ReLU , lead to significantly worse modeling accuracy than softmax attention, when trained under the same model and data configuration. The first question this thesis attempts to answer is why this accuracy gap exists. Specifically, the discrepancy gap can be seen via the Taylor expansion of the numerator while the softmax denominator can be interpreted as a choice of normalization. Understanding the accuracy gap between softmax and linear attention is important to understand, however one is still left with the choice of a very accurate method that is undesirable in memory, or a method that is much better in memory while also being much worse in accuracy. Mamba2 Dao and Gu [2024] bridges the gap between these two methods. By relating State Space Models (SSMs) Gu et al. [2022] to linear attention, Mamba adds a decay gate (or A-mask) to the hidden state, among other additions. The second part of this thesis uses insights from the first part to improve Mamba2 to be as accurate as softmax attention, while retaining linear train complexity. Code is provided for both sections (section 3) and (section 4).
Degree Date
5-2026
Document Type
Thesis
Degree Name
M.S.
Department
Lyle
Advisor
Dr. Eric Larson
Number of Pages
84
Format
Creative Commons License

This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License
Recommended Citation
Mongaras, Gabriel and Larson, Eric, "Softmax Expressiveness With Linear Complexity By Investigating The Linear-Softmax Attention Accuracy Gap" (2026). Computer Science and Engineering Theses and Dissertations. 58.
https://scholar.smu.edu/engineering_compsci_etds/58

Notes
https://github.com/gmongaras/On-the-Expressiveness-of-Softmax-Attention-A-Recurrent-Neural-Network-Perspective
https://github.com/gmongaras/2Mamba2Furious