Computer Science and Engineering Theses and Dissertations

Softmax Expressiveness With Linear Complexity By Investigating The Linear-Softmax Attention Accuracy Gap

Contributor

Gabriel Mongaras, Eric Larson

Subject Area

Computer Science

Abstract

Softmax attention Vaswani et al. [2017] Bahdanau et al. [2015] has gained traction as one of the most important parts of most modern machine learning models. Despite being the most adopted modern machine learning algorithm, exploration of softmax attention itself has been neglected. While softmax attention is a very powerful architectural component, it has a major drawback of being quadratic in sequence length. More specifically, softmax attention has quadratic complexity during training and linear complexity (when using a KV cache) during inference, which limits the sequence length it can model due to system memory constraints. Linear attention was proposed, (proposed by Katharopoulos et al. [2020]) was created to overcome this quadratic bottleneck, having linear complexity during training and constant complexity during inference. While the complexity of linear attention is very favorable, the accuracy of naive linear attention models is not. Naive linear attention replaces the exponential in softmax with a decomposable kernel function. Popular choices of this kernel function, such as ReLU , lead to significantly worse modeling accuracy than softmax attention, when trained under the same model and data configuration. The first question this thesis attempts to answer is why this accuracy gap exists. Specifically, the discrepancy gap can be seen via the Taylor expansion of the numerator while the softmax denominator can be interpreted as a choice of normalization. Understanding the accuracy gap between softmax and linear attention is important to understand, however one is still left with the choice of a very accurate method that is undesirable in memory, or a method that is much better in memory while also being much worse in accuracy. Mamba2 Dao and Gu [2024] bridges the gap between these two methods. By relating State Space Models (SSMs) Gu et al. [2022] to linear attention, Mamba adds a decay gate (or A-mask) to the hidden state, among other additions. The second part of this thesis uses insights from the first part to improve Mamba2 to be as accurate as softmax attention, while retaining linear train complexity. Code is provided for both sections (section 3) and (section 4).

Degree Date

5-2026

Document Type

Thesis

Degree Name

M.S.

Department

Computer Science

Advisor

Dr. Eric Larson

Notes

https://github.com/gmongaras/On-the-Expressiveness-of-Softmax-Attention-A-Recurrent-Neural-Network-Perspective

https://github.com/gmongaras/2Mamba2Furious

Number of Pages

Format

.pdf

Creative Commons License

This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License

Recommended Citation

Mongaras, Gabriel and Larson, Eric, "Softmax Expressiveness With Linear Complexity By Investigating The Linear-Softmax Attention Accuracy Gap" (2026). Computer Science and Engineering Theses and Dissertations. 58.
https://scholar.smu.edu/engineering_compsci_etds/58

Master_s_Thesis_SMU_Format.pdf (16105 kB)

Download

COinS

Computer Science and Engineering Theses and Dissertations

Softmax Expressiveness With Linear Complexity By Investigating The Linear-Softmax Attention Accuracy Gap

Contributor

Subject Area

Abstract

Degree Date

Document Type

Degree Name

Department

Advisor

Notes

Number of Pages

Format

Creative Commons License

Recommended Citation

Search

Browse

Submit

Links

Computer Science and Engineering Theses and Dissertations

Softmax Expressiveness With Linear Complexity By Investigating The Linear-Softmax Attention Accuracy Gap

Authors

Contributor

Subject Area

Abstract

Degree Date

Document Type

Degree Name

Department

Advisor

Notes

Number of Pages

Format

Creative Commons License

Recommended Citation

Share

Search

Browse

Submit

Links