Softmax function - Revision history

Quantling: /* Definition */ prevent break between math and punctuation

2025-05-29T19:53:41Z

Definition: prevent break between math and punctuation

← Previous revision		Revision as of 19:53, 29 May 2025
Line 9:		Line 9:
	The softmax function takes as input a tuple {{mvar\|z}} of {{mvar\|K}} real numbers, and normalizes it into a [[probability distribution]] consisting of {{mvar\|K}} probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some tuple components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the [[Interval (mathematics)\|interval]] <math>(0, 1)</math>, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.		The softmax function takes as input a tuple {{mvar\|z}} of {{mvar\|K}} real numbers, and normalizes it into a [[probability distribution]] consisting of {{mvar\|K}} probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some tuple components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the [[Interval (mathematics)\|interval]] <math>(0, 1)</math>, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.

	Formally, the standard (unit) softmax function <math>\sigma\colon \R^K \to (0, 1)^K</math>, where ~~<math>~~K > 1~~</math>~~, takes a tuple <math>\mathbf{z} = (z_1, \dotsc, z_K) \in \R^K</math> and computes each component of vector <math>\sigma(\mathbf{z}) \in (0, 1)^K</math> with		Formally, the standard (unit) softmax function <math>\sigma\colon \R^K \to (0, 1)^K</math>, where {{tmath\|K > 1}}, takes a tuple <math>\mathbf{z} = (z_1, \dotsc, z_K) \in \R^K</math> and computes each component of vector <math>\sigma(\mathbf{z}) \in (0, 1)^K</math> with

	<math display="block">\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}\,.</math>		<math display="block">\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}\,.</math>

Quantling: A vector has magnitude and direction, but what would the magnitude of an input to softmax be? Let's instead call it a tuple.

2025-05-27T15:22:27Z

A vector has magnitude and direction, but what would the magnitude of an input to softmax be? Let's instead call it a tuple.

98.110.119.175: Missing link

2025-05-27T02:11:37Z

Missing link

← Previous revision		Revision as of 02:11, 27 May 2025
Line 3:		Line 3:
	{{Redirect\|Softmax\|the Korean video game company\|ESA (company)}}		{{Redirect\|Softmax\|the Korean video game company\|ESA (company)}}
	{{Machine learning}}		{{Machine learning}}
	The '''softmax function,''' also known as '''softargmax'''<ref>{{cite book \|last1=Goodfellow \|first1=Ian \|url=http://www.deeplearningbook.org \|title=Deep Learning \|last2=Bengio \|first2=Yoshua \|last3=Courville \|first3=Aaron \|publisher=MIT Press \|year=2016 \|isbn=978-0-26203561-3 \|pages=180–184 \|section=6.2.2.3 Softmax Units for Multinoulli Output Distributions \|chapter-url=https://www.deeplearningbook.org/contents/mlp.html#pf15 \|authorlink1=Ian Goodfellow \|authorlink2=Yoshua Bengio}}</ref>{{rp\|184}} or '''normalized exponential function''',<ref name="bishop" />{{rp\|198}} converts a vector of {{mvar\|K}} real numbers into a [[probability distribution]] of {{mvar\|K}} possible outcomes. It is a generalization of the [[logistic function]] to multiple dimensions, and is used in [[multinomial logistic regression]]. The softmax function is often used as the last [[activation function]] of a [[Artificial neural network\|neural network]] to normalize the output of a network to a [[probability distribution]] over predicted output classes.		The '''softmax function,''' also known as '''softargmax'''<ref>{{cite book \|last1=Goodfellow \|first1=Ian \|url=http://www.deeplearningbook.org \|title=Deep Learning \|last2=Bengio \|first2=Yoshua \|last3=Courville \|first3=Aaron \|publisher=MIT Press \|year=2016 \|isbn=978-0-26203561-3 \|pages=180–184 \|section=6.2.2.3 Softmax Units for Multinoulli Output Distributions \|chapter-url=https://www.deeplearningbook.org/contents/mlp.html#pf15 \|authorlink1=Ian Goodfellow \|authorlink2=Yoshua Bengio}}</ref>{{rp\|184}} or '''normalized exponential function''',<ref name="bishop" />{{rp\|198}} converts a [[Vector (mathematics and physics)\|vector]] of {{mvar\|K}} real numbers into a [[probability distribution]] of {{mvar\|K}} possible outcomes. It is a generalization of the [[logistic function]] to multiple dimensions, and is used in [[multinomial logistic regression]]. The softmax function is often used as the last [[activation function]] of a [[Artificial neural network\|neural network]] to normalize the output of a network to a [[probability distribution]] over predicted output classes.

	== Definition ==		== Definition ==

Citation bot: Altered template type. Add: class, date, title, eprint, authors 1-2. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Headbomb | #UCB_toolbar

2025-04-30T05:31:30Z

Altered template type. Add: class, date, title, eprint, authors 1-2. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Headbomb | #UCB_toolbar

← Previous revision		Revision as of 05:31, 30 April 2025
Line 111:		Line 111:
	The standard softmax is numerically unstable because of large exponentiations. The '''safe softmax''' method calculates instead<math display="block">\sigma(\mathbf{z})_i = \frac{e^{\beta (z_i-m)}}{\sum_{j=1}^K e^{\beta (z_j-m)}}</math>where <math>m = \max_i z_i </math> is the largest factor involved. Subtracting by it guarantees that the exponentiations result in at most 1.		The standard softmax is numerically unstable because of large exponentiations. The '''safe softmax''' method calculates instead<math display="block">\sigma(\mathbf{z})_i = \frac{e^{\beta (z_i-m)}}{\sum_{j=1}^K e^{\beta (z_j-m)}}</math>where <math>m = \max_i z_i </math> is the largest factor involved. Subtracting by it guarantees that the exponentiations result in at most 1.

	The [[Attention (machine learning)\|attention mechanism]] in [[Transformer (deep learning architecture)\|Transformers]] takes three arguments: a "query vector" <math>q</math>, a list of "key vectors" <math>k_1, \dots, k_N</math>, and a list of "value vectors" <math>v_1, \dots, v_N</math>, and outputs a softmax-weighted sum over value vectors:<math display="block">o = \sum_{i=1}^N \frac{e^{q^T k_i - m}}{\sum_{j=1}^N e^{q^T k_j - m}} v_i</math>The standard softmax method involves several loops over the inputs, which would be [[Memory bottleneck\|bottlenecked by memory bandwidth]]. The [[FlashAttention]] method is a [[communication-avoiding algorithm]] that fuses these operations into a single loop, increasing the [[arithmetic intensity]]. It is an [[online algorithm]] that computes the following quantities:<ref>{{cite ~~arxiv~~ \|~~arxiv~~=1805.02867}}</ref><ref>{{Cite journal \|~~last~~=Dao \|~~first~~=Tri \|last2=Fu \|first2=Dan \|last3=Ermon \|first3=Stefano \|last4=Rudra \|first4=Atri \|last5=Ré \|first5=Christopher \|date=2022-12-06 \|title=FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness \|url=https://proceedings.neurips.cc/paper_files/paper/2022/hash/67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html \|journal=Advances in Neural Information Processing Systems \|language=en \|volume=35 \|pages=16344–16359}}</ref><math display="block">\begin{aligned}		The [[Attention (machine learning)\|attention mechanism]] in [[Transformer (deep learning architecture)\|Transformers]] takes three arguments: a "query vector" <math>q</math>, a list of "key vectors" <math>k_1, \dots, k_N</math>, and a list of "value vectors" <math>v_1, \dots, v_N</math>, and outputs a softmax-weighted sum over value vectors:<math display="block">o = \sum_{i=1}^N \frac{e^{q^T k_i - m}}{\sum_{j=1}^N e^{q^T k_j - m}} v_i</math>The standard softmax method involves several loops over the inputs, which would be [[Memory bottleneck\|bottlenecked by memory bandwidth]]. The [[FlashAttention]] method is a [[communication-avoiding algorithm]] that fuses these operations into a single loop, increasing the [[arithmetic intensity]]. It is an [[online algorithm]] that computes the following quantities:<ref>{{cite arXiv \|eprint=1805.02867 \|last1=Milakov \|first1=Maxim \|last2=Gimelshein \|first2=Natalia \|title=Online normalizer calculation for softmax \|date=2018 \|class=cs.PF }}</ref><ref>{{Cite journal \|last1=Dao \|first1=Tri \|last2=Fu \|first2=Dan \|last3=Ermon \|first3=Stefano \|last4=Rudra \|first4=Atri \|last5=Ré \|first5=Christopher \|date=2022-12-06 \|title=FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness \|url=https://proceedings.neurips.cc/paper_files/paper/2022/hash/67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html \|journal=Advances in Neural Information Processing Systems \|language=en \|volume=35 \|pages=16344–16359}}</ref><math display="block">\begin{aligned}
	z_i &= q^T k_i &\\		z_i &= q^T k_i &\\
	m_i &= \max(z_1, \dots, z_i) &=& \max(m_{i-1}, z_i)\\		m_i &= \max(z_1, \dots, z_i) &=& \max(m_{i-1}, z_i)\\

Headbomb: clean up

2025-04-30T05:29:17Z

clean up

← Previous revision		Revision as of 05:29, 30 April 2025
Line 7:		Line 7:
	== Definition ==		== Definition ==

	The softmax function takes as input a vector {{mvar\|z}} of {{mvar\|K}} real numbers, and normalizes it into a [[probability distribution]] consisting of {{mvar\|K}} probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the [[Interval (mathematics)\|interval]] <math>(0, 1)</math>, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.		The softmax function takes as input a vector {{mvar\|z}} of {{mvar\|K}} real numbers, and normalizes it into a [[probability distribution]] consisting of {{mvar\|K}} probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the [[Interval (mathematics)\|interval]] <math>(0, 1)</math>, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.

	Formally, the standard (unit) softmax function <math>\sigma\colon \R^K \to (0, 1)^K</math>, where <math>K > 1</math>, takes a vector <math>\mathbf{z} = (z_1, \dotsc, z_K) \in \R^K</math> and computes each component of vector <math>\sigma(\mathbf{z}) \in (0, 1)^K</math> with		Formally, the standard (unit) softmax function <math>\sigma\colon \R^K \to (0, 1)^K</math>, where <math>K > 1</math>, takes a vector <math>\mathbf{z} = (z_1, \dotsc, z_K) \in \R^K</math> and computes each component of vector <math>\sigma(\mathbf{z}) \in (0, 1)^K</math> with
Line 87:		Line 87:
	<math display="block"> \frac{\partial}{\partial q_k}\sigma(\textbf{q}, i) = \sigma(\textbf{q}, k)(\delta_{ik} - \sigma(\textbf{q}, i)).</math>		<math display="block"> \frac{\partial}{\partial q_k}\sigma(\textbf{q}, i) = \sigma(\textbf{q}, k)(\delta_{ik} - \sigma(\textbf{q}, i)).</math>

	Here, the [[Kronecker delta]] is used for simplicity (cf. the derivative of a [[sigmoid function]], being expressed via the function itself).		Here, the [[Kronecker delta]] is used for simplicity (cf. the derivative of a [[sigmoid function]], being expressed via the function itself).

	To ensure stable numerical computations subtracting the maximum value from the input vector is common. This approach, while not altering the output or the derivative theoretically, enhances stability by directly controlling the maximum exponent value computed.		To ensure stable numerical computations subtracting the maximum value from the input vector is common. This approach, while not altering the output or the derivative theoretically, enhances stability by directly controlling the maximum exponent value computed.
Line 104:		Line 104:
	In neural network applications, the number {{mvar\|K}} of possible outcomes is often large, e.g. in case of [[Language model\|neural language models]] that predict the most likely outcome out of a vocabulary which might contain millions of possible words.<ref name=":0">{{Cite journal \|last1=Onal \|first1=Kezban Dilek \|last2=Zhang \|first2=Ye \|last3=Altingovde \|first3=Ismail Sengor \|last4=Rahman \|first4=Md Mustafizur \|last5=Karagoz \|first5=Pinar \|last6=Braylan \|first6=Alex \|last7=Dang \|first7=Brandon \|last8=Chang \|first8=Heng-Lu \|last9=Kim \|first9=Henna \|last10=McNamara \|first10=Quinten \|last11=Angert \|first11=Aaron \|date=2018-06-01 \|title=Neural information retrieval: at the end of the early years \|journal=Information Retrieval Journal \|language=en \|volume=21 \|issue=2 \|pages=111–182 \|doi=10.1007/s10791-017-9321-y \|s2cid=21684923 \|issn=1573-7659\|doi-access=free \|hdl=11245.1/008d6e8f-df13-4abf-8ae9-6ff2e17377f3 \|hdl-access=free }}</ref> This can make the calculations for the softmax layer (i.e. the matrix multiplications to determine the <math>z_i</math>, followed by the application of the softmax function itself) computationally expensive.<ref name=":0" /><ref name=":1">{{Cite journal \|last1=Chen \|first1=Wenlin \|last2=Grangier \|first2=David \|last3=Auli \|first3=Michael \|date=August 2016 \|title=Strategies for Training Large Vocabulary Neural Language Models \|url=https://aclanthology.org/P16-1186 \|journal=Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) \|location=Berlin, Germany \|publisher=Association for Computational Linguistics \|pages=1975–1985 \|doi=10.18653/v1/P16-1186\|s2cid=6035643 \|doi-access=free \|arxiv=1512.04906 }}</ref> What's more, the [[gradient descent]] [[backpropagation]] method for training such a neural network involves calculating the softmax for every training example, and the number of training examples can also become large. The computational effort for the softmax became a major limiting factor in the development of larger neural language models, motivating various remedies to reduce training times.<ref name=":0" /><ref name=":1" />		In neural network applications, the number {{mvar\|K}} of possible outcomes is often large, e.g. in case of [[Language model\|neural language models]] that predict the most likely outcome out of a vocabulary which might contain millions of possible words.<ref name=":0">{{Cite journal \|last1=Onal \|first1=Kezban Dilek \|last2=Zhang \|first2=Ye \|last3=Altingovde \|first3=Ismail Sengor \|last4=Rahman \|first4=Md Mustafizur \|last5=Karagoz \|first5=Pinar \|last6=Braylan \|first6=Alex \|last7=Dang \|first7=Brandon \|last8=Chang \|first8=Heng-Lu \|last9=Kim \|first9=Henna \|last10=McNamara \|first10=Quinten \|last11=Angert \|first11=Aaron \|date=2018-06-01 \|title=Neural information retrieval: at the end of the early years \|journal=Information Retrieval Journal \|language=en \|volume=21 \|issue=2 \|pages=111–182 \|doi=10.1007/s10791-017-9321-y \|s2cid=21684923 \|issn=1573-7659\|doi-access=free \|hdl=11245.1/008d6e8f-df13-4abf-8ae9-6ff2e17377f3 \|hdl-access=free }}</ref> This can make the calculations for the softmax layer (i.e. the matrix multiplications to determine the <math>z_i</math>, followed by the application of the softmax function itself) computationally expensive.<ref name=":0" /><ref name=":1">{{Cite journal \|last1=Chen \|first1=Wenlin \|last2=Grangier \|first2=David \|last3=Auli \|first3=Michael \|date=August 2016 \|title=Strategies for Training Large Vocabulary Neural Language Models \|url=https://aclanthology.org/P16-1186 \|journal=Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) \|location=Berlin, Germany \|publisher=Association for Computational Linguistics \|pages=1975–1985 \|doi=10.18653/v1/P16-1186\|s2cid=6035643 \|doi-access=free \|arxiv=1512.04906 }}</ref> What's more, the [[gradient descent]] [[backpropagation]] method for training such a neural network involves calculating the softmax for every training example, and the number of training examples can also become large. The computational effort for the softmax became a major limiting factor in the development of larger neural language models, motivating various remedies to reduce training times.<ref name=":0" /><ref name=":1" />

	Approaches that reorganize the softmax layer for more efficient calculation include the '''hierarchical softmax''' and the '''differentiated softmax'''.<ref name=":0" /> The hierarchical softmax (introduced by Morin and [[Yoshua Bengio\|Bengio]] in 2005) uses a binary tree structure where the outcomes (vocabulary words) are the leaves and the intermediate nodes are suitably selected "classes" of outcomes, forming [[~~Latent~~ variable~~\|latent variables~~]].<ref name=":1" /><ref name=":2">{{Cite journal \|last1=Morin \|first1=Frederic \|last2=Bengio \|first2=Yoshua \|date=2005-01-06 \|title=Hierarchical Probabilistic Neural Network Language Model \|url=https://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf \|journal=International Workshop on Artificial Intelligence and Statistics \|language=en \|publisher=PMLR \|pages=246–252}}</ref> The desired probability (softmax value) of a leaf (outcome) can then be calculated as the product of the probabilities of all nodes on the path from the root to that leaf.<ref name=":1" /> Ideally, when the tree is balanced, this would reduce the [[computational complexity]] from <math>O(K)</math> to <math>O(\log_2 K)</math>.<ref name=":2" /> In practice, results depend on choosing a good strategy for clustering the outcomes into classes.<ref name=":1" /><ref name=":2" /> A [[Huffman tree]] was used for this in Google's [[word2vec]] models (introduced in 2013) to achieve scalability.<ref name=":0" />		Approaches that reorganize the softmax layer for more efficient calculation include the '''hierarchical softmax''' and the '''differentiated softmax'''.<ref name=":0" /> The hierarchical softmax (introduced by Morin and [[Yoshua Bengio\|Bengio]] in 2005) uses a binary tree structure where the outcomes (vocabulary words) are the leaves and the intermediate nodes are suitably selected "classes" of outcomes, forming [[latent variable]]s.<ref name=":1" /><ref name=":2">{{Cite journal \|last1=Morin \|first1=Frederic \|last2=Bengio \|first2=Yoshua \|date=2005-01-06 \|title=Hierarchical Probabilistic Neural Network Language Model \|url=https://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf \|journal=International Workshop on Artificial Intelligence and Statistics \|language=en \|publisher=PMLR \|pages=246–252}}</ref> The desired probability (softmax value) of a leaf (outcome) can then be calculated as the product of the probabilities of all nodes on the path from the root to that leaf.<ref name=":1" /> Ideally, when the tree is balanced, this would reduce the [[computational complexity]] from <math>O(K)</math> to <math>O(\log_2 K)</math>.<ref name=":2" /> In practice, results depend on choosing a good strategy for clustering the outcomes into classes.<ref name=":1" /><ref name=":2" /> A [[Huffman tree]] was used for this in Google's [[word2vec]] models (introduced in 2013) to achieve scalability.<ref name=":0" />

	A second kind of remedies is based on approximating the softmax (during training) with modified loss functions that avoid the calculation of the full normalization factor.<ref name=":0" /> These include methods that restrict the normalization sum to a sample of outcomes (e.g. Importance Sampling, Target Sampling).<ref name=":0" /><ref name=":1" />		A second kind of remedies is based on approximating the softmax (during training) with modified loss functions that avoid the calculation of the full normalization factor.<ref name=":0" /> These include methods that restrict the normalization sum to a sample of outcomes (e.g. Importance Sampling, Target Sampling).<ref name=":0" /><ref name=":1" />
Line 111:		Line 111:
	The standard softmax is numerically unstable because of large exponentiations. The '''safe softmax''' method calculates instead<math display="block">\sigma(\mathbf{z})_i = \frac{e^{\beta (z_i-m)}}{\sum_{j=1}^K e^{\beta (z_j-m)}}</math>where <math>m = \max_i z_i </math> is the largest factor involved. Subtracting by it guarantees that the exponentiations result in at most 1.		The standard softmax is numerically unstable because of large exponentiations. The '''safe softmax''' method calculates instead<math display="block">\sigma(\mathbf{z})_i = \frac{e^{\beta (z_i-m)}}{\sum_{j=1}^K e^{\beta (z_j-m)}}</math>where <math>m = \max_i z_i </math> is the largest factor involved. Subtracting by it guarantees that the exponentiations result in at most 1.

	The [[Attention (machine learning)\|attention mechanism]] in [[Transformer (deep learning architecture)\|Transformers]] takes three arguments: a "query vector" <math>q</math>, a list of "key vectors" <math>k_1, \dots, k_N</math>, and a list of "value vectors" <math>v_1, \dots, v_N</math>, and outputs a softmax-weighted sum over value vectors:<math display="block">o = \sum_{i=1}^N \frac{e^{q^T k_i - m}}{\sum_{j=1}^N e^{q^T k_j - m}} v_i</math>The standard softmax method involves several loops over the inputs, which would be [[Memory bottleneck\|bottlenecked by memory bandwidth]]. The [[FlashAttention]] method is a [[communication-avoiding algorithm]] that fuses these operations into a single loop, increasing the [[arithmetic intensity]]. It is an [[online algorithm]] that computes the following quantities:<ref>{{~~Citation~~ ~~\|last=Milakov~~ \|~~first=Maxim \|title=Online normalizer calculation for softmax \|date=2018-07-28 \|url=https://~~arxiv~~.org/abs/1805.02867 \|publisher~~=~~arXiv \|doi=10.48550/arXiv.~~1805.02867 ~~\|id=arXiv:1805.02867 \|last2=Gimelshein \|first2=Natalia~~}}</ref><ref>{{Cite journal \|last=Dao \|first=Tri \|last2=Fu \|first2=Dan \|last3=Ermon \|first3=Stefano \|last4=Rudra \|first4=Atri \|last5=Ré \|first5=Christopher \|date=2022-12-06 \|title=FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness \|url=https://proceedings.neurips.cc/paper_files/paper/2022/hash/67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html \|journal=Advances in Neural Information Processing Systems \|language=en \|volume=35 \|pages=16344–16359}}</ref><math display="block">\begin{aligned}		The [[Attention (machine learning)\|attention mechanism]] in [[Transformer (deep learning architecture)\|Transformers]] takes three arguments: a "query vector" <math>q</math>, a list of "key vectors" <math>k_1, \dots, k_N</math>, and a list of "value vectors" <math>v_1, \dots, v_N</math>, and outputs a softmax-weighted sum over value vectors:<math display="block">o = \sum_{i=1}^N \frac{e^{q^T k_i - m}}{\sum_{j=1}^N e^{q^T k_j - m}} v_i</math>The standard softmax method involves several loops over the inputs, which would be [[Memory bottleneck\|bottlenecked by memory bandwidth]]. The [[FlashAttention]] method is a [[communication-avoiding algorithm]] that fuses these operations into a single loop, increasing the [[arithmetic intensity]]. It is an [[online algorithm]] that computes the following quantities:<ref>{{cite arxiv \|arxiv=1805.02867}}</ref><ref>{{Cite journal \|last=Dao \|first=Tri \|last2=Fu \|first2=Dan \|last3=Ermon \|first3=Stefano \|last4=Rudra \|first4=Atri \|last5=Ré \|first5=Christopher \|date=2022-12-06 \|title=FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness \|url=https://proceedings.neurips.cc/paper_files/paper/2022/hash/67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html \|journal=Advances in Neural Information Processing Systems \|language=en \|volume=35 \|pages=16344–16359}}</ref><math display="block">\begin{aligned}
	z_i &= q^T k_i &\\		z_i &= q^T k_i &\\
	m_i &= \max(z_1, \dots, z_i) &=& \max(m_{i-1}, z_i)\\		m_i &= \max(z_1, \dots, z_i) &=& \max(m_{i-1}, z_i)\\
Line 150:		Line 150:
	}}</ref>		}}</ref>

	The use of the softmax in [[decision theory]] is credited to [[R. Duncan Luce]],<ref name="Gao">{{cite arXiv\|eprint=1704.00805\|last1=Gao\|first1=Bolin\|title=On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning\|last2=Pavel\|first2=Lacra\|author2-link=Lacra Pavel\|class=math.OC\|year=2017}}</ref>{{rp\|1}} who used the axiom of [[independence of irrelevant alternatives]] in [[rational choice theory]] to deduce the softmax in [[Luce's choice axiom]] for relative preferences.{{Cn\|date=March 2024}}		The use of the softmax in [[decision theory]] is credited to [[R. Duncan Luce]],<ref name="Gao">{{cite arXiv\|eprint=1704.00805\|last1=Gao\|first1=Bolin\|title=On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning\|last2=Pavel\|first2=Lacra\|author2-link=Lacra Pavel\|class=math.OC\|year=2017}}</ref>{{rp\|1}} who used the axiom of [[independence of irrelevant alternatives]] in [[rational choice theory]] to deduce the softmax in [[Luce's choice axiom]] for relative preferences.{{Citation needed\|date=March 2024}}

	In machine learning, the term "softmax" is credited to John S. Bridle in two 1989 conference papers, {{harvtxt\|Bridle\|1990a}}:<ref name="Gao"/>{{rp\|1}} and {{harvtxt\|Bridle\|1990b}}:<ref name="sako2018"/>		In machine learning, the term "softmax" is credited to John S. Bridle in two 1989 conference papers, {{harvtxt\|Bridle\|1990a}}:<ref name="Gao"/>{{rp\|1}} and {{harvtxt\|Bridle\|1990b}}:<ref name="sako2018"/>
			{{blockquote
	{{quote
	\|We are concerned with feed-forward non-linear networks (multi-layer perceptrons, or MLPs) with multiple outputs. We wish to treat the outputs of the network as probabilities of alternatives (''e.g.'' pattern classes), conditioned on the inputs. We look for appropriate output non-linearities and for appropriate criteria for adaptation of the parameters of the network (''e.g.'' weights). We explain two modifications: probability scoring, which is an alternative to squared error minimisation, and a normalised exponential ('''softmax''') multi-input generalisation of the logistic non-linearity.<ref>{{cite conference		\|We are concerned with feed-forward non-linear networks (multi-layer perceptrons, or MLPs) with multiple outputs. We wish to treat the outputs of the network as probabilities of alternatives (''e.g.'' pattern classes), conditioned on the inputs. We look for appropriate output non-linearities and for appropriate criteria for adaptation of the parameters of the network (''e.g.'' weights). We explain two modifications: probability scoring, which is an alternative to squared error minimisation, and a normalised exponential ('''softmax''') multi-input generalisation of the logistic non-linearity.<ref>{{cite conference
	\|first=John S. \|last=Bridle		\|first=John S. \|last=Bridle
Line 169:		Line 169:
	}}</ref>{{rp\|227}}		}}</ref>{{rp\|227}}
	}}		}}
			{{blockquote
	{{quote
	\|For any input, the outputs must all be positive and they must sum to unity. ...		\|For any input, the outputs must all be positive and they must sum to unity. ...

2601:40F:4480:DB30:A910:2ED7:313:DBF7: /* Probability theory */

2025-02-26T05:44:51Z

Probability theory

← Previous revision		Revision as of 05:44, 26 February 2025
Line 44:		Line 44:
	<math display="block">\sigma_{1/100}(0,\, 10) = \left(1/\left(1 + e^{1/10}\right),\, e^{1/10}/\left(1 + e^{1/10}\right)\right) \approx (0.475,\, 0.525).</math>		<math display="block">\sigma_{1/100}(0,\, 10) = \left(1/\left(1 + e^{1/10}\right),\, e^{1/10}/\left(1 + e^{1/10}\right)\right) \approx (0.475,\, 0.525).</math>
	As {{tmath\|\beta \to \infty}}, temperature goes to zero, <math>T = 1/\beta \to 0</math>, so eventually all differences become large (relative to a shrinking temperature), which gives another interpretation for the limit behavior.		As {{tmath\|\beta \to \infty}}, temperature goes to zero, <math>T = 1/\beta \to 0</math>, so eventually all differences become large (relative to a shrinking temperature), which gives another interpretation for the limit behavior.

	=== Probability theory ===
	In [[probability theory]], the output of the softargmax function can be used to represent a [[categorical distribution]] – that is, a [[probability distribution]] over {{mvar\|K}} different possible outcomes.

	=== Statistical mechanics ===		=== Statistical mechanics ===

Omegatron: link to page

2025-02-14T15:59:49Z

link to page

@@ Line 3: / Line 3: @@
 {{Redirect|Softmax|the Korean video game company|ESA (company)}}
 {{Machine learning}}
 == Definition ==

Cosmia Nebula: /* Numerical algorithms */

2025-02-03T03:44:25Z

Numerical algorithms

← Previous revision		Revision as of 03:44, 3 February 2025
Line 130:		Line 130:
	l_i &= e^{z_1 - m_i} + \dots + e^{z_i - m_i} &=& e^{m_{i-1}-m_i} l_{i-1} + e^{z_i - m_i}\\		l_i &= e^{z_1 - m_i} + \dots + e^{z_i - m_i} &=& e^{m_{i-1}-m_i} l_{i-1} + e^{z_i - m_i}\\
	o_i &= e^{z_1 - m_i} v_1 + \dots + e^{z_i - m_i}v_i &=& e^{m_{i-1}-m_i} o_{i-1} + e^{z_i - m_i}v_i		o_i &= e^{z_1 - m_i} v_1 + \dots + e^{z_i - m_i}v_i &=& e^{m_{i-1}-m_i} o_{i-1} + e^{z_i - m_i}v_i
	\end{aligned}</math>and returns <math>o_N/l_N</math>. In practice, FlashAttention operates over multiple queries and keys per loop iteration, in a similar way as [[Communication-avoiding algorithm#Blocked (tiled) matrix multiplication\|blocked matrix multiplication]]. If [[backpropagation]] is needed, then the output vectors and the intermediate arrays <math>[m_1, \dots, m_N], [l_1, \dots, l_N]</math> are cached, and during the backward pass, attention matrices are [[Rematerialization\|rematerialized]] from these.		\end{aligned}</math>and returns <math>o_N/l_N</math>. In practice, FlashAttention operates over multiple queries and keys per loop iteration, in a similar way as [[Communication-avoiding algorithm#Blocked (tiled) matrix multiplication\|blocked matrix multiplication]]. If [[backpropagation]] is needed, then the output vectors and the intermediate arrays <math>[m_1, \dots, m_N], [l_1, \dots, l_N]</math> are cached, and during the backward pass, attention matrices are [[Rematerialization\|rematerialized]] from these, making it a form of gradient checkpointing.

	== Mathematical properties ==		== Mathematical properties ==

Cosmia Nebula: /* Numerical algorithms */

2025-02-03T03:43:40Z

Numerical algorithms

← Previous revision		Revision as of 03:43, 3 February 2025
Line 130:		Line 130:
	l_i &= e^{z_1 - m_i} + \dots + e^{z_i - m_i} &=& e^{m_{i-1}-m_i} l_{i-1} + e^{z_i - m_i}\\		l_i &= e^{z_1 - m_i} + \dots + e^{z_i - m_i} &=& e^{m_{i-1}-m_i} l_{i-1} + e^{z_i - m_i}\\
	o_i &= e^{z_1 - m_i} v_1 + \dots + e^{z_i - m_i}v_i &=& e^{m_{i-1}-m_i} o_{i-1} + e^{z_i - m_i}v_i		o_i &= e^{z_1 - m_i} v_1 + \dots + e^{z_i - m_i}v_i &=& e^{m_{i-1}-m_i} o_{i-1} + e^{z_i - m_i}v_i
	\end{aligned}</math>and returns <math>o_N/l_N</math>. If [[backpropagation]] is needed, then the arrays <math>[m_1, \dots, m_N], [l_1, \dots, l_N]</math> are cached, and during the backward pass, attention matrices are [[Rematerialization\|rematerialized]]. ~~In practice, FlashAttention operates~~ ~~over multiple queries and keys per loop iteration, in a similar way as [[Communication-avoiding algorithm#Blocked (tiled) matrix multiplication\|blocked matrix multiplication]]~~.		\end{aligned}</math>and returns <math>o_N/l_N</math>. In practice, FlashAttention operates over multiple queries and keys per loop iteration, in a similar way as [[Communication-avoiding algorithm#Blocked (tiled) matrix multiplication\|blocked matrix multiplication]]. If [[backpropagation]] is needed, then the output vectors and the intermediate arrays <math>[m_1, \dots, m_N], [l_1, \dots, l_N]</math> are cached, and during the backward pass, attention matrices are [[Rematerialization\|rematerialized]] from these.

	== Mathematical properties ==		== Mathematical properties ==

Cosmia Nebula: /* Numerical algorithms */ backprop

2025-02-03T03:42:51Z

Numerical algorithms: backprop

← Previous revision		Revision as of 03:42, 3 February 2025
Line 128:		Line 128:
	z_i &= q^T k_i &\\		z_i &= q^T k_i &\\
	m_i &= \max(z_1, \dots, z_i) &=& \max(m_{i-1}, z_i)\\		m_i &= \max(z_1, \dots, z_i) &=& \max(m_{i-1}, z_i)\\
	~~s_i~~ &= e^{z_1 - m_i} + \dots + e^{z_i - m_i} &=& e^{m_{i-1}-m_i} s_{i-1} + e^{z_i - m_i}\\		l_i &= e^{z_1 - m_i} + \dots + e^{z_i - m_i} &=& e^{m_{i-1}-m_i} l_{i-1} + e^{z_i - m_i}\\
	o_i &= e^{z_1 - m_i} v_1 + \dots + e^{z_i - m_i}v_i &=& e^{m_{i-1}-m_i} o_{i-1} + e^{z_i - m_i}v_i		o_i &= e^{z_1 - m_i} v_1 + \dots + e^{z_i - m_i}v_i &=& e^{m_{i-1}-m_i} o_{i-1} + e^{z_i - m_i}v_i
	\end{aligned}</math>and returns <math>o_N/~~s_N~~</math>. In practice, FlashAttention operates over multiple queries and keys per loop iteration, in a similar way as [[Communication-avoiding algorithm#Blocked (tiled) matrix multiplication\|blocked matrix multiplication]].		\end{aligned}</math>and returns <math>o_N/l_N</math>. If [[backpropagation]] is needed, then the arrays <math>[m_1, \dots, m_N], [l_1, \dots, l_N]</math> are cached, and during the backward pass, attention matrices are [[Rematerialization\|rematerialized]]. In practice, FlashAttention operates over multiple queries and keys per loop iteration, in a similar way as [[Communication-avoiding algorithm#Blocked (tiled) matrix multiplication\|blocked matrix multiplication]].

	== Mathematical properties ==		== Mathematical properties ==

← Previous revision		Revision as of 15:22, 27 May 2025
Line 3:		Line 3:
	{{Redirect\|Softmax\|the Korean video game company\|ESA (company)}}		{{Redirect\|Softmax\|the Korean video game company\|ESA (company)}}
	{{Machine learning}}		{{Machine learning}}
	The '''softmax function,''' also known as '''softargmax'''<ref>{{cite book \|last1=Goodfellow \|first1=Ian \|url=http://www.deeplearningbook.org \|title=Deep Learning \|last2=Bengio \|first2=Yoshua \|last3=Courville \|first3=Aaron \|publisher=MIT Press \|year=2016 \|isbn=978-0-26203561-3 \|pages=180–184 \|section=6.2.2.3 Softmax Units for Multinoulli Output Distributions \|chapter-url=https://www.deeplearningbook.org/contents/mlp.html#pf15 \|authorlink1=Ian Goodfellow \|authorlink2=Yoshua Bengio}}</ref>{{rp\|184}} or '''normalized exponential function''',<ref name="bishop" />{{rp\|198}} converts a [[~~Vector (mathematics and physics)\|vector~~]] of {{mvar\|K}} real numbers into a [[probability distribution]] of {{mvar\|K}} possible outcomes. It is a generalization of the [[logistic function]] to multiple dimensions, and is used in [[multinomial logistic regression]]. The softmax function is often used as the last [[activation function]] of a [[Artificial neural network\|neural network]] to normalize the output of a network to a [[probability distribution]] over predicted output classes.		The '''softmax function,''' also known as '''softargmax'''<ref>{{cite book \|last1=Goodfellow \|first1=Ian \|url=http://www.deeplearningbook.org \|title=Deep Learning \|last2=Bengio \|first2=Yoshua \|last3=Courville \|first3=Aaron \|publisher=MIT Press \|year=2016 \|isbn=978-0-26203561-3 \|pages=180–184 \|section=6.2.2.3 Softmax Units for Multinoulli Output Distributions \|chapter-url=https://www.deeplearningbook.org/contents/mlp.html#pf15 \|authorlink1=Ian Goodfellow \|authorlink2=Yoshua Bengio}}</ref>{{rp\|184}} or '''normalized exponential function''',<ref name="bishop" />{{rp\|198}} converts a [[tuple]] of {{mvar\|K}} real numbers into a [[probability distribution]] of {{mvar\|K}} possible outcomes. It is a generalization of the [[logistic function]] to multiple dimensions, and is used in [[multinomial logistic regression]]. The softmax function is often used as the last [[activation function]] of a [[Artificial neural network\|neural network]] to normalize the output of a network to a [[probability distribution]] over predicted output classes.

	== Definition ==		== Definition ==

	The softmax function takes as input a ~~vector~~ {{mvar\|z}} of {{mvar\|K}} real numbers, and normalizes it into a [[probability distribution]] consisting of {{mvar\|K}} probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some ~~vector~~ components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the [[Interval (mathematics)\|interval]] <math>(0, 1)</math>, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.		The softmax function takes as input a tuple {{mvar\|z}} of {{mvar\|K}} real numbers, and normalizes it into a [[probability distribution]] consisting of {{mvar\|K}} probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some tuple components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the [[Interval (mathematics)\|interval]] <math>(0, 1)</math>, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.

	Formally, the standard (unit) softmax function <math>\sigma\colon \R^K \to (0, 1)^K</math>, where <math>K > 1</math>, takes a ~~vector~~ <math>\mathbf{z} = (z_1, \dotsc, z_K) \in \R^K</math> and computes each component of vector <math>\sigma(\mathbf{z}) \in (0, 1)^K</math> with		Formally, the standard (unit) softmax function <math>\sigma\colon \R^K \to (0, 1)^K</math>, where <math>K > 1</math>, takes a tuple <math>\mathbf{z} = (z_1, \dotsc, z_K) \in \R^K</math> and computes each component of vector <math>\sigma(\mathbf{z}) \in (0, 1)^K</math> with

	<math display="block">\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}\,.</math>		<math display="block">\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}\,.</math>

	In words, the softmax applies the standard [[exponential function]] to each element <math>z_i</math> of the input ~~vector~~ <math>\mathbf z</math> (consisting of <math>K</math> real numbers), and normalizes these values by dividing by the sum of all these exponentials. The normalization ensures that the sum of the components of the output vector <math>\sigma(\mathbf z)</math> is 1. The term "softmax" derives from the amplifying effects of the exponential on any maxima in the input ~~vector~~. For example, the standard softmax of <math>(1,2,8)</math> is approximately <math>(0.001,0.002,0.997)</math>, which amounts to assigning almost all of the total unit weight in the result to the position of the ~~vector~~'s maximal element (of 8).		In words, the softmax applies the standard [[exponential function]] to each element <math>z_i</math> of the input tuple <math>\mathbf z</math> (consisting of <math>K</math> real numbers), and normalizes these values by dividing by the sum of all these exponentials. The normalization ensures that the sum of the components of the output vector <math>\sigma(\mathbf z)</math> is 1. The term "softmax" derives from the amplifying effects of the exponential on any maxima in the input tuple. For example, the standard softmax of <math>(1,2,8)</math> is approximately <math>(0.001,0.002,0.997)</math>, which amounts to assigning almost all of the total unit weight in the result to the position of the tuple's maximal element (of 8).

	In general, instead of {{mvar\|e}} a different [[Base (exponentiation)\|base]] {{math\|b > 0}} can be used. As above, if {{math\|b > 1}} then larger input components will result in larger output probabilities, and increasing the value of {{mvar\|b}} will create probability distributions that are more concentrated around the positions of the largest input values. Conversely, if {{math\|0 < b < 1}} then smaller input components will result in larger output probabilities, and decreasing the value of {{mvar\|b}} will create probability distributions that are more concentrated around the positions of the smallest input values. Writing <math>b = e^\beta</math> or <math>b = e^{-\beta}</math>{{efn\|Positive {{mvar\|β}} corresponds to the maximum convention, and is usual in machine learning, corresponding to the highest score having highest probability. The negative {{math\|−β}} corresponds to the minimum convention, and is conventional in thermodynamics, corresponding to the lowest energy state having the highest probability; this matches the convention in the [[Gibbs distribution]], interpreting {{mvar\|β}} as [[coldness]].}} (for real {{mvar\|β}}){{efn\|1=The notation {{mvar\|β}} is for the [[thermodynamic beta]], which is inverse [[temperature]]: <math>\beta = 1/T</math>, <math>T = 1/\beta.</math>}} yields the expressions:{{efn\|For <math>\beta = 0</math> ([[coldness]] zero, infinite temperature), <math>b = e^\beta = e^0 = 1</math>, and this becomes the constant function {{tmath\|(1/n, \dots, 1/n)}}, corresponding to the [[discrete uniform distribution]].}}		In general, instead of {{mvar\|e}} a different [[Base (exponentiation)\|base]] {{math\|b > 0}} can be used. As above, if {{math\|b > 1}} then larger input components will result in larger output probabilities, and increasing the value of {{mvar\|b}} will create probability distributions that are more concentrated around the positions of the largest input values. Conversely, if {{math\|0 < b < 1}} then smaller input components will result in larger output probabilities, and decreasing the value of {{mvar\|b}} will create probability distributions that are more concentrated around the positions of the smallest input values. Writing <math>b = e^\beta</math> or <math>b = e^{-\beta}</math>{{efn\|Positive {{mvar\|β}} corresponds to the maximum convention, and is usual in machine learning, corresponding to the highest score having highest probability. The negative {{math\|−β}} corresponds to the minimum convention, and is conventional in thermodynamics, corresponding to the lowest energy state having the highest probability; this matches the convention in the [[Gibbs distribution]], interpreting {{mvar\|β}} as [[coldness]].}} (for real {{mvar\|β}}){{efn\|1=The notation {{mvar\|β}} is for the [[thermodynamic beta]], which is inverse [[temperature]]: <math>\beta = 1/T</math>, <math>T = 1/\beta.</math>}} yields the expressions:{{efn\|For <math>\beta = 0</math> ([[coldness]] zero, infinite temperature), <math>b = e^\beta = e^0 = 1</math>, and this becomes the constant function {{tmath\|(1/n, \dots, 1/n)}}, corresponding to the [[discrete uniform distribution]].}}
Line 27:		Line 27:
	=== Smooth arg max ===		=== Smooth arg max ===
	{{See also\|Arg max}}		{{See also\|Arg max}}
	The Softmax function is a smooth approximation to the [[arg max]] function: the function whose value is the ''index'' of a ~~vector~~'s largest element. The name "softmax" may be misleading. Softmax is not a [[smooth maximum]] (that is, a [[smooth approximation]] to the [[maximum]] function). The term "softmax" is also used for the closely related [[LogSumExp]] function, which is a smooth maximum. For this reason, some prefer the more accurate term "softargmax", though the term "softmax" is conventional in machine learning.<ref name="sako2018"/>{{sfn\|Goodfellow\|Bengio\|Courville\|2016\|pp=183–184\|ps=: The name "softmax" can be somewhat confusing. The function is more closely related to the arg max function than the max function. The term "soft" derives from the fact that the softmax function is continuous and differentiable. The arg max function, with its result represented as a one-hot vector, is not continuous nor differentiable. The softmax function thus provides a "softened" version of the arg max. The corresponding soft version of the maximum function is <math>\operatorname{softmax}(\mathbf{z})^\top \mathbf{z}</math>. It would perhaps be better to call the softmax function "softargmax," but the current name is an entrenched convention.}} This section uses the term "softargmax" for clarity.		The Softmax function is a smooth approximation to the [[arg max]] function: the function whose value is the ''index'' of a tuple's largest element. The name "softmax" may be misleading. Softmax is not a [[smooth maximum]] (that is, a [[smooth approximation]] to the [[maximum]] function). The term "softmax" is also used for the closely related [[LogSumExp]] function, which is a smooth maximum. For this reason, some prefer the more accurate term "softargmax", though the term "softmax" is conventional in machine learning.<ref name="sako2018"/>{{sfn\|Goodfellow\|Bengio\|Courville\|2016\|pp=183–184\|ps=: The name "softmax" can be somewhat confusing. The function is more closely related to the arg max function than the max function. The term "soft" derives from the fact that the softmax function is continuous and differentiable. The arg max function, with its result represented as a one-hot vector, is not continuous nor differentiable. The softmax function thus provides a "softened" version of the arg max. The corresponding soft version of the maximum function is <math>\operatorname{softmax}(\mathbf{z})^\top \mathbf{z}</math>. It would perhaps be better to call the softmax function "softargmax," but the current name is an entrenched convention.}} This section uses the term "softargmax" for clarity.

	Formally, instead of considering the arg max as a function with categorical output <math>1, \dots, n</math> (corresponding to the index), consider the arg max function with [[one-hot]] representation of the output (assuming there is a unique maximum arg):		Formally, instead of considering the arg max as a function with categorical output <math>1, \dots, n</math> (corresponding to the index), consider the arg max function with [[one-hot]] representation of the output (assuming there is a unique maximum arg):
Line 70:		Line 70:

	== Applications ==		== Applications ==
	The softmax function is used in various [[multiclass classification]] methods, such as [[multinomial logistic regression]] (also known as softmax regression),<ref name="bishop">{{cite book \|first=Christopher M. \|last=Bishop \|year=2006 \|title=Pattern Recognition and Machine Learning \|publisher=Springer \|isbn=0-387-31073-8 }}</ref>{{rp\|206–209}}<ref>{{Cite web \|title=Unsupervised Feature Learning and Deep Learning Tutorial \|url=http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/ \|access-date=2024-03-25 \|website=ufldl.stanford.edu}}</ref> multiclass [[linear discriminant analysis]], [[naive Bayes classifier]]s, and [[artificial neural network]]s.<ref>ai-faq [http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-12.html What is a softmax activation function?]</ref> Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result of {{mvar\|K}} distinct [[linear function]]s, and the predicted probability for the {{mvar\|j}}th class given a sample ~~vector~~ {{math\|'''x'''}} and a weighting vector {{math\|'''w'''}} is:		The softmax function is used in various [[multiclass classification]] methods, such as [[multinomial logistic regression]] (also known as softmax regression),<ref name="bishop">{{cite book \|first=Christopher M. \|last=Bishop \|year=2006 \|title=Pattern Recognition and Machine Learning \|publisher=Springer \|isbn=0-387-31073-8 }}</ref>{{rp\|206–209}}<ref>{{Cite web \|title=Unsupervised Feature Learning and Deep Learning Tutorial \|url=http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/ \|access-date=2024-03-25 \|website=ufldl.stanford.edu}}</ref> multiclass [[linear discriminant analysis]], [[naive Bayes classifier]]s, and [[artificial neural network]]s.<ref>ai-faq [http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-12.html What is a softmax activation function?]</ref> Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result of {{mvar\|K}} distinct [[linear function]]s, and the predicted probability for the {{mvar\|j}}th class given a sample tuple {{math\|'''x'''}} and a weighting vector {{math\|'''w'''}} is:

	<math display="block">P(y=j\mid \mathbf{x}) = \frac{e^{\mathbf{x}^\mathsf{T}\mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^\mathsf{T}\mathbf{w}_k}}</math>		<math display="block">P(y=j\mid \mathbf{x}) = \frac{e^{\mathbf{x}^\mathsf{T}\mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^\mathsf{T}\mathbf{w}_k}}</math>

	This can be seen as the [[function composition\|composition]] of {{mvar\|K}} linear functions <math>\mathbf{x} \mapsto \mathbf{x}^\mathsf{T}\mathbf{w}_1, \ldots, \mathbf{x} \mapsto \mathbf{x}^\mathsf{T}\mathbf{w}_K</math> and the softmax function (where <math>\mathbf{x}^\mathsf{T}\mathbf{w}</math> denotes the inner product of <math>\mathbf{x}</math> and <math>\mathbf{w}</math>). The operation is equivalent to applying a linear operator defined by <math>\mathbf{w}</math> to ~~vectors~~ <math>\mathbf{x}</math>, thus transforming the original, probably highly-dimensional, input to vectors in a {{mvar\|K}}-dimensional space <math>\mathbb{R}^K</math>.		This can be seen as the [[function composition\|composition]] of {{mvar\|K}} linear functions <math>\mathbf{x} \mapsto \mathbf{x}^\mathsf{T}\mathbf{w}_1, \ldots, \mathbf{x} \mapsto \mathbf{x}^\mathsf{T}\mathbf{w}_K</math> and the softmax function (where <math>\mathbf{x}^\mathsf{T}\mathbf{w}</math> denotes the inner product of <math>\mathbf{x}</math> and <math>\mathbf{w}</math>). The operation is equivalent to applying a linear operator defined by <math>\mathbf{w}</math> to tuples <math>\mathbf{x}</math>, thus transforming the original, probably highly-dimensional, input to vectors in a {{mvar\|K}}-dimensional space <math>\mathbb{R}^K</math>.

	=== Neural networks ===		=== Neural networks ===
	The standard softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a [[Cross entropy\|log loss]] (or [[cross-entropy]]) regime, giving a non-linear variant of multinomial logistic regression.		The standard softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a [[Cross entropy\|log loss]] (or [[cross-entropy]]) regime, giving a non-linear variant of multinomial logistic regression.

	Since the function maps a ~~vector~~ and a specific index <math>i</math> to a real value, the derivative needs to take the index into account:		Since the function maps a tuple and a specific index <math>i</math> to a real value, the derivative needs to take the index into account:

	<math display="block">\frac{\partial}{\partial q_k}\sigma(\textbf{q}, i) = \sigma(\textbf{q}, i)(\delta_{ik} - \sigma(\textbf{q}, k)).</math>		<math display="block">\frac{\partial}{\partial q_k}\sigma(\textbf{q}, i) = \sigma(\textbf{q}, i)(\delta_{ik} - \sigma(\textbf{q}, k)).</math>
Line 89:		Line 89:
	Here, the [[Kronecker delta]] is used for simplicity (cf. the derivative of a [[sigmoid function]], being expressed via the function itself).		Here, the [[Kronecker delta]] is used for simplicity (cf. the derivative of a [[sigmoid function]], being expressed via the function itself).

	To ensure stable numerical computations subtracting the maximum value from the input ~~vector~~ is common. This approach, while not altering the output or the derivative theoretically, enhances stability by directly controlling the maximum exponent value computed.		To ensure stable numerical computations subtracting the maximum value from the input tuple is common. This approach, while not altering the output or the derivative theoretically, enhances stability by directly controlling the maximum exponent value computed.

	If the function is scaled with the parameter <math>\beta</math>, then these expressions must be multiplied by <math>\beta</math>.		If the function is scaled with the parameter <math>\beta</math>, then these expressions must be multiplied by <math>\beta</math>.
Line 119:		Line 119:

	== Mathematical properties ==		== Mathematical properties ==
	Geometrically the softmax function maps the [[~~vector~~ space]] <math>\mathbb{R}^K</math> to the [[Boundary (topology)\|boundary]] of the [[standard simplex\|standard <math>(K-1)</math>-simplex]], cutting the dimension by one (the range is a <math>(K - 1)</math>-dimensional simplex in <math>K</math>-dimensional space), due to the [[linear constraint]] that all output sum to 1 meaning it lies on a [[hyperplane]].		Geometrically the softmax function maps the [[Euclidean space]] <math>\mathbb{R}^K</math> to the [[Boundary (topology)\|boundary]] of the [[standard simplex\|standard <math>(K-1)</math>-simplex]], cutting the dimension by one (the range is a <math>(K - 1)</math>-dimensional simplex in <math>K</math>-dimensional space), due to the [[linear constraint]] that all output sum to 1 meaning it lies on a [[hyperplane]].

	Along the main diagonal <math>(x,\, x,\, \dots,\, x),</math> softmax is just the uniform distribution on outputs, <math>(1/n, \dots, 1/n)</math>: equal scores yield equal probabilities.		Along the main diagonal <math>(x,\, x,\, \dots,\, x),</math> softmax is just the uniform distribution on outputs, <math>(1/n, \dots, 1/n)</math>: equal scores yield equal probabilities.