https://en.wikipedia.org/w/index.php?action=history&feed=atom&title=Softmax_function Softmax function - Revision history 2025-06-12T03:44:43Z Revision history for this page on the wiki MediaWiki 1.45.0-wmf.4 https://en.wikipedia.org/w/index.php?title=Softmax_function&diff=1292950500&oldid=prev Quantling: /* Definition */ prevent break between math and punctuation 2025-05-29T19:53:41Z <p><span class="autocomment">Definition: </span> prevent break between math and punctuation</p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Previous revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 19:53, 29 May 2025</td> </tr><tr> <td colspan="2" class="diff-lineno">Line 9:</td> <td colspan="2" class="diff-lineno">Line 9:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The softmax function takes as input a tuple {{mvar|z}} of {{mvar|K}} real numbers, and normalizes it into a [[probability distribution]] consisting of {{mvar|K}} probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some tuple components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the [[Interval (mathematics)|interval]] &lt;math&gt;(0, 1)&lt;/math&gt;, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The softmax function takes as input a tuple {{mvar|z}} of {{mvar|K}} real numbers, and normalizes it into a [[probability distribution]] consisting of {{mvar|K}} probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some tuple components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the [[Interval (mathematics)|interval]] &lt;math&gt;(0, 1)&lt;/math&gt;, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Formally, the standard (unit) softmax function &lt;math&gt;\sigma\colon \R^K \to (0, 1)^K&lt;/math&gt;, where <del style="font-weight: bold; text-decoration: none;">&lt;math&gt;</del>K &gt; 1<del style="font-weight: bold; text-decoration: none;">&lt;/math&gt;</del>, takes a tuple &lt;math&gt;\mathbf{z} = (z_1, \dotsc, z_K) \in \R^K&lt;/math&gt; and computes each component of vector &lt;math&gt;\sigma(\mathbf{z}) \in (0, 1)^K&lt;/math&gt; with</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Formally, the standard (unit) softmax function &lt;math&gt;\sigma\colon \R^K \to (0, 1)^K&lt;/math&gt;, where <ins style="font-weight: bold; text-decoration: none;">{{tmath|</ins>K &gt; 1<ins style="font-weight: bold; text-decoration: none;">}}</ins>, takes a tuple &lt;math&gt;\mathbf{z} = (z_1, \dotsc, z_K) \in \R^K&lt;/math&gt; and computes each component of vector &lt;math&gt;\sigma(\mathbf{z}) \in (0, 1)^K&lt;/math&gt; with</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>&lt;math display="block"&gt;\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}\,.&lt;/math&gt;</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>&lt;math display="block"&gt;\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}\,.&lt;/math&gt;</div></td> </tr> </table> Quantling https://en.wikipedia.org/w/index.php?title=Softmax_function&diff=1292548509&oldid=prev Quantling: A vector has magnitude and direction, but what would the magnitude of an input to softmax be? Let's instead call it a tuple. 2025-05-27T15:22:27Z <p>A vector has magnitude and direction, but what would the magnitude of an input to softmax be? Let&#039;s instead call it a <a href="/wiki/Tuple" title="Tuple">tuple</a>.</p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Previous revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 15:22, 27 May 2025</td> </tr><tr> <td colspan="2" class="diff-lineno">Line 3:</td> <td colspan="2" class="diff-lineno">Line 3:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>{{Redirect|Softmax|the Korean video game company|ESA (company)}}</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>{{Redirect|Softmax|the Korean video game company|ESA (company)}}</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>{{Machine learning}}</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>{{Machine learning}}</div></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The '''softmax function,''' also known as '''softargmax'''&lt;ref&gt;{{cite book |last1=Goodfellow |first1=Ian |url=http://www.deeplearningbook.org |title=Deep Learning |last2=Bengio |first2=Yoshua |last3=Courville |first3=Aaron |publisher=MIT Press |year=2016 |isbn=978-0-26203561-3 |pages=180–184 |section=6.2.2.3 Softmax Units for Multinoulli Output Distributions |chapter-url=https://www.deeplearningbook.org/contents/mlp.html#pf15 |authorlink1=Ian Goodfellow |authorlink2=Yoshua Bengio}}&lt;/ref&gt;{{rp|184}} or '''normalized exponential function''',&lt;ref name="bishop" /&gt;{{rp|198}} converts a [[<del style="font-weight: bold; text-decoration: none;">Vector (mathematics and physics)|vector</del>]] of {{mvar|K}} real numbers into a [[probability distribution]] of {{mvar|K}} possible outcomes. It is a generalization of the [[logistic function]] to multiple dimensions, and is used in [[multinomial logistic regression]]. The softmax function is often used as the last [[activation function]] of a [[Artificial neural network|neural network]] to normalize the output of a network to a [[probability distribution]] over predicted output classes.</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The '''softmax function,''' also known as '''softargmax'''&lt;ref&gt;{{cite book |last1=Goodfellow |first1=Ian |url=http://www.deeplearningbook.org |title=Deep Learning |last2=Bengio |first2=Yoshua |last3=Courville |first3=Aaron |publisher=MIT Press |year=2016 |isbn=978-0-26203561-3 |pages=180–184 |section=6.2.2.3 Softmax Units for Multinoulli Output Distributions |chapter-url=https://www.deeplearningbook.org/contents/mlp.html#pf15 |authorlink1=Ian Goodfellow |authorlink2=Yoshua Bengio}}&lt;/ref&gt;{{rp|184}} or '''normalized exponential function''',&lt;ref name="bishop" /&gt;{{rp|198}} converts a [[<ins style="font-weight: bold; text-decoration: none;">tuple</ins>]] of {{mvar|K}} real numbers into a [[probability distribution]] of {{mvar|K}} possible outcomes. It is a generalization of the [[logistic function]] to multiple dimensions, and is used in [[multinomial logistic regression]]. The softmax function is often used as the last [[activation function]] of a [[Artificial neural network|neural network]] to normalize the output of a network to a [[probability distribution]] over predicted output classes.</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Definition ==</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Definition ==</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The softmax function takes as input a <del style="font-weight: bold; text-decoration: none;">vector</del> {{mvar|z}} of {{mvar|K}} real numbers, and normalizes it into a [[probability distribution]] consisting of {{mvar|K}} probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some <del style="font-weight: bold; text-decoration: none;">vector</del> components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the [[Interval (mathematics)|interval]] &lt;math&gt;(0, 1)&lt;/math&gt;, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The softmax function takes as input a <ins style="font-weight: bold; text-decoration: none;">tuple</ins> {{mvar|z}} of {{mvar|K}} real numbers, and normalizes it into a [[probability distribution]] consisting of {{mvar|K}} probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some <ins style="font-weight: bold; text-decoration: none;">tuple</ins> components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the [[Interval (mathematics)|interval]] &lt;math&gt;(0, 1)&lt;/math&gt;, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Formally, the standard (unit) softmax function &lt;math&gt;\sigma\colon \R^K \to (0, 1)^K&lt;/math&gt;, where &lt;math&gt;K &gt; 1&lt;/math&gt;, takes a <del style="font-weight: bold; text-decoration: none;">vector</del> &lt;math&gt;\mathbf{z} = (z_1, \dotsc, z_K) \in \R^K&lt;/math&gt; and computes each component of vector &lt;math&gt;\sigma(\mathbf{z}) \in (0, 1)^K&lt;/math&gt; with</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Formally, the standard (unit) softmax function &lt;math&gt;\sigma\colon \R^K \to (0, 1)^K&lt;/math&gt;, where &lt;math&gt;K &gt; 1&lt;/math&gt;, takes a <ins style="font-weight: bold; text-decoration: none;">tuple</ins> &lt;math&gt;\mathbf{z} = (z_1, \dotsc, z_K) \in \R^K&lt;/math&gt; and computes each component of vector &lt;math&gt;\sigma(\mathbf{z}) \in (0, 1)^K&lt;/math&gt; with</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>&lt;math display="block"&gt;\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}\,.&lt;/math&gt;</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>&lt;math display="block"&gt;\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}\,.&lt;/math&gt;</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>In words, the softmax applies the standard [[exponential function]] to each element &lt;math&gt;z_i&lt;/math&gt; of the input <del style="font-weight: bold; text-decoration: none;">vector</del> &lt;math&gt;\mathbf z&lt;/math&gt; (consisting of &lt;math&gt;K&lt;/math&gt; real numbers), and normalizes these values by dividing by the sum of all these exponentials. The normalization ensures that the sum of the components of the output vector &lt;math&gt;\sigma(\mathbf z)&lt;/math&gt; is 1. The term "softmax" derives from the amplifying effects of the exponential on any maxima in the input <del style="font-weight: bold; text-decoration: none;">vector</del>. For example, the standard softmax of &lt;math&gt;(1,2,8)&lt;/math&gt; is approximately &lt;math&gt;(0.001,0.002,0.997)&lt;/math&gt;, which amounts to assigning almost all of the total unit weight in the result to the position of the <del style="font-weight: bold; text-decoration: none;">vector</del>'s maximal element (of 8).</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>In words, the softmax applies the standard [[exponential function]] to each element &lt;math&gt;z_i&lt;/math&gt; of the input <ins style="font-weight: bold; text-decoration: none;">tuple</ins> &lt;math&gt;\mathbf z&lt;/math&gt; (consisting of &lt;math&gt;K&lt;/math&gt; real numbers), and normalizes these values by dividing by the sum of all these exponentials. The normalization ensures that the sum of the components of the output vector &lt;math&gt;\sigma(\mathbf z)&lt;/math&gt; is 1. The term "softmax" derives from the amplifying effects of the exponential on any maxima in the input <ins style="font-weight: bold; text-decoration: none;">tuple</ins>. For example, the standard softmax of &lt;math&gt;(1,2,8)&lt;/math&gt; is approximately &lt;math&gt;(0.001,0.002,0.997)&lt;/math&gt;, which amounts to assigning almost all of the total unit weight in the result to the position of the <ins style="font-weight: bold; text-decoration: none;">tuple</ins>'s maximal element (of 8).</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In general, instead of {{mvar|e}} a different [[Base (exponentiation)|base]] {{math|b &gt; 0}} can be used. As above, if {{math|b &gt; 1}} then larger input components will result in larger output probabilities, and increasing the value of {{mvar|b}} will create probability distributions that are more concentrated around the positions of the largest input values. Conversely, if {{math|0 &lt; b &lt; 1}} then smaller input components will result in larger output probabilities, and decreasing the value of {{mvar|b}} will create probability distributions that are more concentrated around the positions of the smallest input values. Writing &lt;math&gt;b = e^\beta&lt;/math&gt; or &lt;math&gt;b = e^{-\beta}&lt;/math&gt;{{efn|Positive {{mvar|β}} corresponds to the maximum convention, and is usual in machine learning, corresponding to the highest score having highest probability. The negative {{math|−β}} corresponds to the minimum convention, and is conventional in thermodynamics, corresponding to the lowest energy state having the highest probability; this matches the convention in the [[Gibbs distribution]], interpreting {{mvar|β}} as [[coldness]].}} (for real {{mvar|β}}){{efn|1=The notation {{mvar|β}} is for the [[thermodynamic beta]], which is inverse [[temperature]]: &lt;math&gt;\beta = 1/T&lt;/math&gt;, &lt;math&gt;T = 1/\beta.&lt;/math&gt;}} yields the expressions:{{efn|For &lt;math&gt;\beta = 0&lt;/math&gt; ([[coldness]] zero, infinite temperature), &lt;math&gt;b = e^\beta = e^0 = 1&lt;/math&gt;, and this becomes the constant function {{tmath|(1/n, \dots, 1/n)}}, corresponding to the [[discrete uniform distribution]].}}</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In general, instead of {{mvar|e}} a different [[Base (exponentiation)|base]] {{math|b &gt; 0}} can be used. As above, if {{math|b &gt; 1}} then larger input components will result in larger output probabilities, and increasing the value of {{mvar|b}} will create probability distributions that are more concentrated around the positions of the largest input values. Conversely, if {{math|0 &lt; b &lt; 1}} then smaller input components will result in larger output probabilities, and decreasing the value of {{mvar|b}} will create probability distributions that are more concentrated around the positions of the smallest input values. Writing &lt;math&gt;b = e^\beta&lt;/math&gt; or &lt;math&gt;b = e^{-\beta}&lt;/math&gt;{{efn|Positive {{mvar|β}} corresponds to the maximum convention, and is usual in machine learning, corresponding to the highest score having highest probability. The negative {{math|−β}} corresponds to the minimum convention, and is conventional in thermodynamics, corresponding to the lowest energy state having the highest probability; this matches the convention in the [[Gibbs distribution]], interpreting {{mvar|β}} as [[coldness]].}} (for real {{mvar|β}}){{efn|1=The notation {{mvar|β}} is for the [[thermodynamic beta]], which is inverse [[temperature]]: &lt;math&gt;\beta = 1/T&lt;/math&gt;, &lt;math&gt;T = 1/\beta.&lt;/math&gt;}} yields the expressions:{{efn|For &lt;math&gt;\beta = 0&lt;/math&gt; ([[coldness]] zero, infinite temperature), &lt;math&gt;b = e^\beta = e^0 = 1&lt;/math&gt;, and this becomes the constant function {{tmath|(1/n, \dots, 1/n)}}, corresponding to the [[discrete uniform distribution]].}}</div></td> </tr> <tr> <td colspan="2" class="diff-lineno">Line 27:</td> <td colspan="2" class="diff-lineno">Line 27:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>=== Smooth arg max ===</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>=== Smooth arg max ===</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>{{See also|Arg max}}</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>{{See also|Arg max}}</div></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The Softmax function is a smooth approximation to the [[arg max]] function: the function whose value is the ''index'' of a <del style="font-weight: bold; text-decoration: none;">vector</del>'s largest element. The name "softmax" may be misleading. Softmax is not a [[smooth maximum]] (that is, a [[smooth approximation]] to the [[maximum]] function). The term "softmax" is also used for the closely related [[LogSumExp]] function, which is a smooth maximum. For this reason, some prefer the more accurate term "softargmax", though the term "softmax" is conventional in machine learning.&lt;ref name="sako2018"/&gt;{{sfn|Goodfellow|Bengio|Courville|2016|pp=183–184|ps=: The name "softmax" can be somewhat confusing. The function is more closely related to the arg max function than the max function. The term "soft" derives from the fact that the softmax function is continuous and differentiable. The arg max function, with its result represented as a one-hot vector, is not continuous nor differentiable. The softmax function thus provides a "softened" version of the arg max. The corresponding soft version of the maximum function is &lt;math&gt;\operatorname{softmax}(\mathbf{z})^\top \mathbf{z}&lt;/math&gt;. It would perhaps be better to call the softmax function "softargmax," but the current name is an entrenched convention.}} This section uses the term "softargmax" for clarity.</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The Softmax function is a smooth approximation to the [[arg max]] function: the function whose value is the ''index'' of a <ins style="font-weight: bold; text-decoration: none;">tuple</ins>'s largest element. The name "softmax" may be misleading. Softmax is not a [[smooth maximum]] (that is, a [[smooth approximation]] to the [[maximum]] function). The term "softmax" is also used for the closely related [[LogSumExp]] function, which is a smooth maximum. For this reason, some prefer the more accurate term "softargmax", though the term "softmax" is conventional in machine learning.&lt;ref name="sako2018"/&gt;{{sfn|Goodfellow|Bengio|Courville|2016|pp=183–184|ps=: The name "softmax" can be somewhat confusing. The function is more closely related to the arg max function than the max function. The term "soft" derives from the fact that the softmax function is continuous and differentiable. The arg max function, with its result represented as a one-hot vector, is not continuous nor differentiable. The softmax function thus provides a "softened" version of the arg max. The corresponding soft version of the maximum function is &lt;math&gt;\operatorname{softmax}(\mathbf{z})^\top \mathbf{z}&lt;/math&gt;. It would perhaps be better to call the softmax function "softargmax," but the current name is an entrenched convention.}} This section uses the term "softargmax" for clarity.</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Formally, instead of considering the arg max as a function with categorical output &lt;math&gt;1, \dots, n&lt;/math&gt; (corresponding to the index), consider the arg max function with [[one-hot]] representation of the output (assuming there is a unique maximum arg):</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Formally, instead of considering the arg max as a function with categorical output &lt;math&gt;1, \dots, n&lt;/math&gt; (corresponding to the index), consider the arg max function with [[one-hot]] representation of the output (assuming there is a unique maximum arg):</div></td> </tr> <tr> <td colspan="2" class="diff-lineno">Line 70:</td> <td colspan="2" class="diff-lineno">Line 70:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Applications ==</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Applications ==</div></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The softmax function is used in various [[multiclass classification]] methods, such as [[multinomial logistic regression]] (also known as softmax regression),&lt;ref name="bishop"&gt;{{cite book |first=Christopher M. |last=Bishop |year=2006 |title=Pattern Recognition and Machine Learning |publisher=Springer |isbn=0-387-31073-8 }}&lt;/ref&gt;{{rp|206–209}}&lt;ref&gt;{{Cite web |title=Unsupervised Feature Learning and Deep Learning Tutorial |url=http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/ |access-date=2024-03-25 |website=ufldl.stanford.edu}}&lt;/ref&gt; multiclass [[linear discriminant analysis]], [[naive Bayes classifier]]s, and [[artificial neural network]]s.&lt;ref&gt;ai-faq [http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-12.html What is a softmax activation function?]&lt;/ref&gt; Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result of {{mvar|K}} distinct [[linear function]]s, and the predicted probability for the {{mvar|j}}th class given a sample <del style="font-weight: bold; text-decoration: none;">vector</del> {{math|'''x'''}} and a weighting vector {{math|'''w'''}} is:</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The softmax function is used in various [[multiclass classification]] methods, such as [[multinomial logistic regression]] (also known as softmax regression),&lt;ref name="bishop"&gt;{{cite book |first=Christopher M. |last=Bishop |year=2006 |title=Pattern Recognition and Machine Learning |publisher=Springer |isbn=0-387-31073-8 }}&lt;/ref&gt;{{rp|206–209}}&lt;ref&gt;{{Cite web |title=Unsupervised Feature Learning and Deep Learning Tutorial |url=http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/ |access-date=2024-03-25 |website=ufldl.stanford.edu}}&lt;/ref&gt; multiclass [[linear discriminant analysis]], [[naive Bayes classifier]]s, and [[artificial neural network]]s.&lt;ref&gt;ai-faq [http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-12.html What is a softmax activation function?]&lt;/ref&gt; Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result of {{mvar|K}} distinct [[linear function]]s, and the predicted probability for the {{mvar|j}}th class given a sample <ins style="font-weight: bold; text-decoration: none;">tuple</ins> {{math|'''x'''}} and a weighting vector {{math|'''w'''}} is:</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>&lt;math display="block"&gt;P(y=j\mid \mathbf{x}) = \frac{e^{\mathbf{x}^\mathsf{T}\mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^\mathsf{T}\mathbf{w}_k}}&lt;/math&gt;</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>&lt;math display="block"&gt;P(y=j\mid \mathbf{x}) = \frac{e^{\mathbf{x}^\mathsf{T}\mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^\mathsf{T}\mathbf{w}_k}}&lt;/math&gt;</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>This can be seen as the [[function composition|composition]] of {{mvar|K}} linear functions &lt;math&gt;\mathbf{x} \mapsto \mathbf{x}^\mathsf{T}\mathbf{w}_1, \ldots, \mathbf{x} \mapsto \mathbf{x}^\mathsf{T}\mathbf{w}_K&lt;/math&gt; and the softmax function (where &lt;math&gt;\mathbf{x}^\mathsf{T}\mathbf{w}&lt;/math&gt; denotes the inner product of &lt;math&gt;\mathbf{x}&lt;/math&gt; and &lt;math&gt;\mathbf{w}&lt;/math&gt;). The operation is equivalent to applying a linear operator defined by &lt;math&gt;\mathbf{w}&lt;/math&gt; to <del style="font-weight: bold; text-decoration: none;">vectors</del> &lt;math&gt;\mathbf{x}&lt;/math&gt;, thus transforming the original, probably highly-dimensional, input to vectors in a {{mvar|K}}-dimensional space &lt;math&gt;\mathbb{R}^K&lt;/math&gt;.</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>This can be seen as the [[function composition|composition]] of {{mvar|K}} linear functions &lt;math&gt;\mathbf{x} \mapsto \mathbf{x}^\mathsf{T}\mathbf{w}_1, \ldots, \mathbf{x} \mapsto \mathbf{x}^\mathsf{T}\mathbf{w}_K&lt;/math&gt; and the softmax function (where &lt;math&gt;\mathbf{x}^\mathsf{T}\mathbf{w}&lt;/math&gt; denotes the inner product of &lt;math&gt;\mathbf{x}&lt;/math&gt; and &lt;math&gt;\mathbf{w}&lt;/math&gt;). The operation is equivalent to applying a linear operator defined by &lt;math&gt;\mathbf{w}&lt;/math&gt; to <ins style="font-weight: bold; text-decoration: none;">tuples</ins> &lt;math&gt;\mathbf{x}&lt;/math&gt;, thus transforming the original, probably highly-dimensional, input to vectors in a {{mvar|K}}-dimensional space &lt;math&gt;\mathbb{R}^K&lt;/math&gt;.</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>=== Neural networks ===</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>=== Neural networks ===</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The standard softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a [[Cross entropy|log loss]] (or [[cross-entropy]]) regime, giving a non-linear variant of multinomial logistic regression.</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The standard softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a [[Cross entropy|log loss]] (or [[cross-entropy]]) regime, giving a non-linear variant of multinomial logistic regression.</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Since the function maps a <del style="font-weight: bold; text-decoration: none;">vector</del> and a specific index &lt;math&gt;i&lt;/math&gt; to a real value, the derivative needs to take the index into account:</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Since the function maps a <ins style="font-weight: bold; text-decoration: none;">tuple</ins> and a specific index &lt;math&gt;i&lt;/math&gt; to a real value, the derivative needs to take the index into account:</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>&lt;math display="block"&gt;\frac{\partial}{\partial q_k}\sigma(\textbf{q}, i) = \sigma(\textbf{q}, i)(\delta_{ik} - \sigma(\textbf{q}, k)).&lt;/math&gt;</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>&lt;math display="block"&gt;\frac{\partial}{\partial q_k}\sigma(\textbf{q}, i) = \sigma(\textbf{q}, i)(\delta_{ik} - \sigma(\textbf{q}, k)).&lt;/math&gt;</div></td> </tr> <tr> <td colspan="2" class="diff-lineno">Line 89:</td> <td colspan="2" class="diff-lineno">Line 89:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Here, the [[Kronecker delta]] is used for simplicity (cf. the derivative of a [[sigmoid function]], being expressed via the function itself).</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Here, the [[Kronecker delta]] is used for simplicity (cf. the derivative of a [[sigmoid function]], being expressed via the function itself).</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>To ensure stable numerical computations subtracting the maximum value from the input <del style="font-weight: bold; text-decoration: none;">vector</del> is common. This approach, while not altering the output or the derivative theoretically, enhances stability by directly controlling the maximum exponent value computed.</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>To ensure stable numerical computations subtracting the maximum value from the input <ins style="font-weight: bold; text-decoration: none;">tuple</ins> is common. This approach, while not altering the output or the derivative theoretically, enhances stability by directly controlling the maximum exponent value computed.</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>If the function is scaled with the parameter &lt;math&gt;\beta&lt;/math&gt;, then these expressions must be multiplied by &lt;math&gt;\beta&lt;/math&gt;.</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>If the function is scaled with the parameter &lt;math&gt;\beta&lt;/math&gt;, then these expressions must be multiplied by &lt;math&gt;\beta&lt;/math&gt;.</div></td> </tr> <tr> <td colspan="2" class="diff-lineno">Line 119:</td> <td colspan="2" class="diff-lineno">Line 119:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Mathematical properties ==</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Mathematical properties ==</div></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Geometrically the softmax function maps the [[<del style="font-weight: bold; text-decoration: none;">vector</del> space]] &lt;math&gt;\mathbb{R}^K&lt;/math&gt; to the [[Boundary (topology)|boundary]] of the [[standard simplex|standard &lt;math&gt;(K-1)&lt;/math&gt;-simplex]], cutting the dimension by one (the range is a &lt;math&gt;(K - 1)&lt;/math&gt;-dimensional simplex in &lt;math&gt;K&lt;/math&gt;-dimensional space), due to the [[linear constraint]] that all output sum to 1 meaning it lies on a [[hyperplane]].</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Geometrically the softmax function maps the [[<ins style="font-weight: bold; text-decoration: none;">Euclidean</ins> space]] &lt;math&gt;\mathbb{R}^K&lt;/math&gt; to the [[Boundary (topology)|boundary]] of the [[standard simplex|standard &lt;math&gt;(K-1)&lt;/math&gt;-simplex]], cutting the dimension by one (the range is a &lt;math&gt;(K - 1)&lt;/math&gt;-dimensional simplex in &lt;math&gt;K&lt;/math&gt;-dimensional space), due to the [[linear constraint]] that all output sum to 1 meaning it lies on a [[hyperplane]].</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Along the main diagonal &lt;math&gt;(x,\, x,\, \dots,\, x),&lt;/math&gt; softmax is just the uniform distribution on outputs, &lt;math&gt;(1/n, \dots, 1/n)&lt;/math&gt;: equal scores yield equal probabilities.</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Along the main diagonal &lt;math&gt;(x,\, x,\, \dots,\, x),&lt;/math&gt; softmax is just the uniform distribution on outputs, &lt;math&gt;(1/n, \dots, 1/n)&lt;/math&gt;: equal scores yield equal probabilities.</div></td> </tr> <!-- diff cache key enwiki:diff:1.41:old-1292459821:rev-1292548509:wikidiff2=table:1.14.1:ff290eae --> </table> Quantling https://en.wikipedia.org/w/index.php?title=Softmax_function&diff=1292459821&oldid=prev 98.110.119.175: Missing link 2025-05-27T02:11:37Z <p>Missing link</p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Previous revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 02:11, 27 May 2025</td> </tr><tr> <td colspan="2" class="diff-lineno">Line 3:</td> <td colspan="2" class="diff-lineno">Line 3:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>{{Redirect|Softmax|the Korean video game company|ESA (company)}}</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>{{Redirect|Softmax|the Korean video game company|ESA (company)}}</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>{{Machine learning}}</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>{{Machine learning}}</div></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The '''softmax function,''' also known as '''softargmax'''&lt;ref&gt;{{cite book |last1=Goodfellow |first1=Ian |url=http://www.deeplearningbook.org |title=Deep Learning |last2=Bengio |first2=Yoshua |last3=Courville |first3=Aaron |publisher=MIT Press |year=2016 |isbn=978-0-26203561-3 |pages=180–184 |section=6.2.2.3 Softmax Units for Multinoulli Output Distributions |chapter-url=https://www.deeplearningbook.org/contents/mlp.html#pf15 |authorlink1=Ian Goodfellow |authorlink2=Yoshua Bengio}}&lt;/ref&gt;{{rp|184}} or '''normalized exponential function''',&lt;ref name="bishop" /&gt;{{rp|198}} converts a vector of {{mvar|K}} real numbers into a [[probability distribution]] of {{mvar|K}} possible outcomes. It is a generalization of the [[logistic function]] to multiple dimensions, and is used in [[multinomial logistic regression]]. The softmax function is often used as the last [[activation function]] of a [[Artificial neural network|neural network]] to normalize the output of a network to a [[probability distribution]] over predicted output classes.</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The '''softmax function,''' also known as '''softargmax'''&lt;ref&gt;{{cite book |last1=Goodfellow |first1=Ian |url=http://www.deeplearningbook.org |title=Deep Learning |last2=Bengio |first2=Yoshua |last3=Courville |first3=Aaron |publisher=MIT Press |year=2016 |isbn=978-0-26203561-3 |pages=180–184 |section=6.2.2.3 Softmax Units for Multinoulli Output Distributions |chapter-url=https://www.deeplearningbook.org/contents/mlp.html#pf15 |authorlink1=Ian Goodfellow |authorlink2=Yoshua Bengio}}&lt;/ref&gt;{{rp|184}} or '''normalized exponential function''',&lt;ref name="bishop" /&gt;{{rp|198}} converts a <ins style="font-weight: bold; text-decoration: none;">[[Vector (mathematics and physics)|</ins>vector<ins style="font-weight: bold; text-decoration: none;">]]</ins> of {{mvar|K}} real numbers into a [[probability distribution]] of {{mvar|K}} possible outcomes. It is a generalization of the [[logistic function]] to multiple dimensions, and is used in [[multinomial logistic regression]]. The softmax function is often used as the last [[activation function]] of a [[Artificial neural network|neural network]] to normalize the output of a network to a [[probability distribution]] over predicted output classes.</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Definition ==</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Definition ==</div></td> </tr> </table> 98.110.119.175 https://en.wikipedia.org/w/index.php?title=Softmax_function&diff=1288063974&oldid=prev Citation bot: Altered template type. Add: class, date, title, eprint, authors 1-2. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Headbomb | #UCB_toolbar 2025-04-30T05:31:30Z <p>Altered template type. Add: class, date, title, eprint, authors 1-2. Removed parameters. Some additions/deletions were parameter name changes. | <a href="/wiki/Wikipedia:UCB" class="mw-redirect" title="Wikipedia:UCB">Use this bot</a>. <a href="/wiki/Wikipedia:DBUG" class="mw-redirect" title="Wikipedia:DBUG">Report bugs</a>. | Suggested by Headbomb | #UCB_toolbar</p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Previous revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 05:31, 30 April 2025</td> </tr><tr> <td colspan="2" class="diff-lineno">Line 111:</td> <td colspan="2" class="diff-lineno">Line 111:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The standard softmax is numerically unstable because of large exponentiations. The '''safe softmax''' method calculates instead&lt;math display="block"&gt;\sigma(\mathbf{z})_i = \frac{e^{\beta (z_i-m)}}{\sum_{j=1}^K e^{\beta (z_j-m)}}&lt;/math&gt;where &lt;math&gt;m = \max_i z_i &lt;/math&gt; is the largest factor involved. Subtracting by it guarantees that the exponentiations result in at most 1.</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The standard softmax is numerically unstable because of large exponentiations. The '''safe softmax''' method calculates instead&lt;math display="block"&gt;\sigma(\mathbf{z})_i = \frac{e^{\beta (z_i-m)}}{\sum_{j=1}^K e^{\beta (z_j-m)}}&lt;/math&gt;where &lt;math&gt;m = \max_i z_i &lt;/math&gt; is the largest factor involved. Subtracting by it guarantees that the exponentiations result in at most 1.</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The [[Attention (machine learning)|attention mechanism]] in [[Transformer (deep learning architecture)|Transformers]] takes three arguments: a "query vector" &lt;math&gt;q&lt;/math&gt;, a list of "key vectors" &lt;math&gt;k_1, \dots, k_N&lt;/math&gt;, and a list of "value vectors" &lt;math&gt;v_1, \dots, v_N&lt;/math&gt;, and outputs a softmax-weighted sum over value vectors:&lt;math display="block"&gt;o = \sum_{i=1}^N \frac{e^{q^T k_i - m}}{\sum_{j=1}^N e^{q^T k_j - m}} v_i&lt;/math&gt;The standard softmax method involves several loops over the inputs, which would be [[Memory bottleneck|bottlenecked by memory bandwidth]]. The [[FlashAttention]] method is a [[communication-avoiding algorithm]] that fuses these operations into a single loop, increasing the [[arithmetic intensity]]. It is an [[online algorithm]] that computes the following quantities:&lt;ref&gt;{{cite <del style="font-weight: bold; text-decoration: none;">arxiv</del> |<del style="font-weight: bold; text-decoration: none;">arxiv</del>=1805.02867}}&lt;/ref&gt;&lt;ref&gt;{{Cite journal |<del style="font-weight: bold; text-decoration: none;">last</del>=Dao |<del style="font-weight: bold; text-decoration: none;">first</del>=Tri |last2=Fu |first2=Dan |last3=Ermon |first3=Stefano |last4=Rudra |first4=Atri |last5=Ré |first5=Christopher |date=2022-12-06 |title=FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness |url=https://proceedings.neurips.cc/paper_files/paper/2022/hash/67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html |journal=Advances in Neural Information Processing Systems |language=en |volume=35 |pages=16344–16359}}&lt;/ref&gt;&lt;math display="block"&gt;\begin{aligned}</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The [[Attention (machine learning)|attention mechanism]] in [[Transformer (deep learning architecture)|Transformers]] takes three arguments: a "query vector" &lt;math&gt;q&lt;/math&gt;, a list of "key vectors" &lt;math&gt;k_1, \dots, k_N&lt;/math&gt;, and a list of "value vectors" &lt;math&gt;v_1, \dots, v_N&lt;/math&gt;, and outputs a softmax-weighted sum over value vectors:&lt;math display="block"&gt;o = \sum_{i=1}^N \frac{e^{q^T k_i - m}}{\sum_{j=1}^N e^{q^T k_j - m}} v_i&lt;/math&gt;The standard softmax method involves several loops over the inputs, which would be [[Memory bottleneck|bottlenecked by memory bandwidth]]. The [[FlashAttention]] method is a [[communication-avoiding algorithm]] that fuses these operations into a single loop, increasing the [[arithmetic intensity]]. It is an [[online algorithm]] that computes the following quantities:&lt;ref&gt;{{cite <ins style="font-weight: bold; text-decoration: none;">arXiv</ins> |<ins style="font-weight: bold; text-decoration: none;">eprint</ins>=1805.02867<ins style="font-weight: bold; text-decoration: none;"> |last1=Milakov |first1=Maxim |last2=Gimelshein |first2=Natalia |title=Online normalizer calculation for softmax |date=2018 |class=cs.PF </ins>}}&lt;/ref&gt;&lt;ref&gt;{{Cite journal |<ins style="font-weight: bold; text-decoration: none;">last1</ins>=Dao |<ins style="font-weight: bold; text-decoration: none;">first1</ins>=Tri |last2=Fu |first2=Dan |last3=Ermon |first3=Stefano |last4=Rudra |first4=Atri |last5=Ré |first5=Christopher |date=2022-12-06 |title=FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness |url=https://proceedings.neurips.cc/paper_files/paper/2022/hash/67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html |journal=Advances in Neural Information Processing Systems |language=en |volume=35 |pages=16344–16359}}&lt;/ref&gt;&lt;math display="block"&gt;\begin{aligned}</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>z_i &amp;= q^T k_i &amp;\\</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>z_i &amp;= q^T k_i &amp;\\</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>m_i &amp;= \max(z_1, \dots, z_i) &amp;=&amp; \max(m_{i-1}, z_i)\\</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>m_i &amp;= \max(z_1, \dots, z_i) &amp;=&amp; \max(m_{i-1}, z_i)\\</div></td> </tr> </table> Citation bot https://en.wikipedia.org/w/index.php?title=Softmax_function&diff=1288063842&oldid=prev Headbomb: clean up 2025-04-30T05:29:17Z <p>clean up</p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Previous revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 05:29, 30 April 2025</td> </tr><tr> <td colspan="2" class="diff-lineno">Line 7:</td> <td colspan="2" class="diff-lineno">Line 7:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Definition ==</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Definition ==</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The softmax function takes as input a vector {{mvar|z}} of {{mvar|K}} real numbers, and normalizes it into a [[probability distribution]] consisting of {{mvar|K}} probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the [[Interval (mathematics)|interval]] &lt;math&gt;(0, 1)&lt;/math&gt;, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.<del style="font-weight: bold; text-decoration: none;"> </del></div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The softmax function takes as input a vector {{mvar|z}} of {{mvar|K}} real numbers, and normalizes it into a [[probability distribution]] consisting of {{mvar|K}} probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the [[Interval (mathematics)|interval]] &lt;math&gt;(0, 1)&lt;/math&gt;, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Formally, the standard (unit) softmax function &lt;math&gt;\sigma\colon \R^K \to (0, 1)^K&lt;/math&gt;, where &lt;math&gt;K &gt; 1&lt;/math&gt;, takes a vector &lt;math&gt;\mathbf{z} = (z_1, \dotsc, z_K) \in \R^K&lt;/math&gt; and computes each component of vector &lt;math&gt;\sigma(\mathbf{z}) \in (0, 1)^K&lt;/math&gt; with</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Formally, the standard (unit) softmax function &lt;math&gt;\sigma\colon \R^K \to (0, 1)^K&lt;/math&gt;, where &lt;math&gt;K &gt; 1&lt;/math&gt;, takes a vector &lt;math&gt;\mathbf{z} = (z_1, \dotsc, z_K) \in \R^K&lt;/math&gt; and computes each component of vector &lt;math&gt;\sigma(\mathbf{z}) \in (0, 1)^K&lt;/math&gt; with</div></td> </tr> <tr> <td colspan="2" class="diff-lineno">Line 87:</td> <td colspan="2" class="diff-lineno">Line 87:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>&lt;math display="block"&gt; \frac{\partial}{\partial q_k}\sigma(\textbf{q}, i) = \sigma(\textbf{q}, k)(\delta_{ik} - \sigma(\textbf{q}, i)).&lt;/math&gt;</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>&lt;math display="block"&gt; \frac{\partial}{\partial q_k}\sigma(\textbf{q}, i) = \sigma(\textbf{q}, k)(\delta_{ik} - \sigma(\textbf{q}, i)).&lt;/math&gt;</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Here, the [[Kronecker delta]] is used for simplicity (cf. the derivative of a [[sigmoid function]], being expressed via the function itself).<del style="font-weight: bold; text-decoration: none;"> </del></div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Here, the [[Kronecker delta]] is used for simplicity (cf. the derivative of a [[sigmoid function]], being expressed via the function itself).</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>To ensure stable numerical computations subtracting the maximum value from the input vector is common. This approach, while not altering the output or the derivative theoretically, enhances stability by directly controlling the maximum exponent value computed.</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>To ensure stable numerical computations subtracting the maximum value from the input vector is common. This approach, while not altering the output or the derivative theoretically, enhances stability by directly controlling the maximum exponent value computed.</div></td> </tr> <tr> <td colspan="2" class="diff-lineno">Line 104:</td> <td colspan="2" class="diff-lineno">Line 104:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In neural network applications, the number {{mvar|K}} of possible outcomes is often large, e.g. in case of [[Language model|neural language models]] that predict the most likely outcome out of a vocabulary which might contain millions of possible words.&lt;ref name=":0"&gt;{{Cite journal |last1=Onal |first1=Kezban Dilek |last2=Zhang |first2=Ye |last3=Altingovde |first3=Ismail Sengor |last4=Rahman |first4=Md Mustafizur |last5=Karagoz |first5=Pinar |last6=Braylan |first6=Alex |last7=Dang |first7=Brandon |last8=Chang |first8=Heng-Lu |last9=Kim |first9=Henna |last10=McNamara |first10=Quinten |last11=Angert |first11=Aaron |date=2018-06-01 |title=Neural information retrieval: at the end of the early years |journal=Information Retrieval Journal |language=en |volume=21 |issue=2 |pages=111–182 |doi=10.1007/s10791-017-9321-y |s2cid=21684923 |issn=1573-7659|doi-access=free |hdl=11245.1/008d6e8f-df13-4abf-8ae9-6ff2e17377f3 |hdl-access=free }}&lt;/ref&gt; This can make the calculations for the softmax layer (i.e. the matrix multiplications to determine the &lt;math&gt;z_i&lt;/math&gt;, followed by the application of the softmax function itself) computationally expensive.&lt;ref name=":0" /&gt;&lt;ref name=":1"&gt;{{Cite journal |last1=Chen |first1=Wenlin |last2=Grangier |first2=David |last3=Auli |first3=Michael |date=August 2016 |title=Strategies for Training Large Vocabulary Neural Language Models |url=https://aclanthology.org/P16-1186 |journal=Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |location=Berlin, Germany |publisher=Association for Computational Linguistics |pages=1975–1985 |doi=10.18653/v1/P16-1186|s2cid=6035643 |doi-access=free |arxiv=1512.04906 }}&lt;/ref&gt; What's more, the [[gradient descent]] [[backpropagation]] method for training such a neural network involves calculating the softmax for every training example, and the number of training examples can also become large. The computational effort for the softmax became a major limiting factor in the development of larger neural language models, motivating various remedies to reduce training times.&lt;ref name=":0" /&gt;&lt;ref name=":1" /&gt;</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In neural network applications, the number {{mvar|K}} of possible outcomes is often large, e.g. in case of [[Language model|neural language models]] that predict the most likely outcome out of a vocabulary which might contain millions of possible words.&lt;ref name=":0"&gt;{{Cite journal |last1=Onal |first1=Kezban Dilek |last2=Zhang |first2=Ye |last3=Altingovde |first3=Ismail Sengor |last4=Rahman |first4=Md Mustafizur |last5=Karagoz |first5=Pinar |last6=Braylan |first6=Alex |last7=Dang |first7=Brandon |last8=Chang |first8=Heng-Lu |last9=Kim |first9=Henna |last10=McNamara |first10=Quinten |last11=Angert |first11=Aaron |date=2018-06-01 |title=Neural information retrieval: at the end of the early years |journal=Information Retrieval Journal |language=en |volume=21 |issue=2 |pages=111–182 |doi=10.1007/s10791-017-9321-y |s2cid=21684923 |issn=1573-7659|doi-access=free |hdl=11245.1/008d6e8f-df13-4abf-8ae9-6ff2e17377f3 |hdl-access=free }}&lt;/ref&gt; This can make the calculations for the softmax layer (i.e. the matrix multiplications to determine the &lt;math&gt;z_i&lt;/math&gt;, followed by the application of the softmax function itself) computationally expensive.&lt;ref name=":0" /&gt;&lt;ref name=":1"&gt;{{Cite journal |last1=Chen |first1=Wenlin |last2=Grangier |first2=David |last3=Auli |first3=Michael |date=August 2016 |title=Strategies for Training Large Vocabulary Neural Language Models |url=https://aclanthology.org/P16-1186 |journal=Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |location=Berlin, Germany |publisher=Association for Computational Linguistics |pages=1975–1985 |doi=10.18653/v1/P16-1186|s2cid=6035643 |doi-access=free |arxiv=1512.04906 }}&lt;/ref&gt; What's more, the [[gradient descent]] [[backpropagation]] method for training such a neural network involves calculating the softmax for every training example, and the number of training examples can also become large. The computational effort for the softmax became a major limiting factor in the development of larger neural language models, motivating various remedies to reduce training times.&lt;ref name=":0" /&gt;&lt;ref name=":1" /&gt;</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Approaches that reorganize the softmax layer for more efficient calculation include the '''hierarchical softmax''' and the '''differentiated softmax'''.&lt;ref name=":0" /&gt; The hierarchical softmax (introduced by Morin and [[Yoshua Bengio|Bengio]] in 2005) uses a binary tree structure where the outcomes (vocabulary words) are the leaves and the intermediate nodes are suitably selected "classes" of outcomes, forming [[<del style="font-weight: bold; text-decoration: none;">Latent</del> variable<del style="font-weight: bold; text-decoration: none;">|latent variables</del>]].&lt;ref name=":1" /&gt;&lt;ref name=":2"&gt;{{Cite journal |last1=Morin |first1=Frederic |last2=Bengio |first2=Yoshua |date=2005-01-06 |title=Hierarchical Probabilistic Neural Network Language Model |url=https://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf |journal=International Workshop on Artificial Intelligence and Statistics |language=en |publisher=PMLR |pages=246–252}}&lt;/ref&gt; The desired probability (softmax value) of a leaf (outcome) can then be calculated as the product of the probabilities of all nodes on the path from the root to that leaf.&lt;ref name=":1" /&gt; Ideally, when the tree is balanced, this would reduce the [[computational complexity]] from &lt;math&gt;O(K)&lt;/math&gt; to &lt;math&gt;O(\log_2 K)&lt;/math&gt;.&lt;ref name=":2" /&gt; In practice, results depend on choosing a good strategy for clustering the outcomes into classes.&lt;ref name=":1" /&gt;&lt;ref name=":2" /&gt; A [[Huffman tree]] was used for this in Google's [[word2vec]] models (introduced in 2013) to achieve scalability.&lt;ref name=":0" /&gt;</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Approaches that reorganize the softmax layer for more efficient calculation include the '''hierarchical softmax''' and the '''differentiated softmax'''.&lt;ref name=":0" /&gt; The hierarchical softmax (introduced by Morin and [[Yoshua Bengio|Bengio]] in 2005) uses a binary tree structure where the outcomes (vocabulary words) are the leaves and the intermediate nodes are suitably selected "classes" of outcomes, forming [[<ins style="font-weight: bold; text-decoration: none;">latent</ins> variable]]<ins style="font-weight: bold; text-decoration: none;">s</ins>.&lt;ref name=":1" /&gt;&lt;ref name=":2"&gt;{{Cite journal |last1=Morin |first1=Frederic |last2=Bengio |first2=Yoshua |date=2005-01-06 |title=Hierarchical Probabilistic Neural Network Language Model |url=https://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf |journal=International Workshop on Artificial Intelligence and Statistics |language=en |publisher=PMLR |pages=246–252}}&lt;/ref&gt; The desired probability (softmax value) of a leaf (outcome) can then be calculated as the product of the probabilities of all nodes on the path from the root to that leaf.&lt;ref name=":1" /&gt; Ideally, when the tree is balanced, this would reduce the [[computational complexity]] from &lt;math&gt;O(K)&lt;/math&gt; to &lt;math&gt;O(\log_2 K)&lt;/math&gt;.&lt;ref name=":2" /&gt; In practice, results depend on choosing a good strategy for clustering the outcomes into classes.&lt;ref name=":1" /&gt;&lt;ref name=":2" /&gt; A [[Huffman tree]] was used for this in Google's [[word2vec]] models (introduced in 2013) to achieve scalability.&lt;ref name=":0" /&gt;</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>A second kind of remedies is based on approximating the softmax (during training) with modified loss functions that avoid the calculation of the full normalization factor.&lt;ref name=":0" /&gt; These include methods that restrict the normalization sum to a sample of outcomes (e.g. Importance Sampling, Target Sampling).&lt;ref name=":0" /&gt;&lt;ref name=":1" /&gt;</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>A second kind of remedies is based on approximating the softmax (during training) with modified loss functions that avoid the calculation of the full normalization factor.&lt;ref name=":0" /&gt; These include methods that restrict the normalization sum to a sample of outcomes (e.g. Importance Sampling, Target Sampling).&lt;ref name=":0" /&gt;&lt;ref name=":1" /&gt;</div></td> </tr> <tr> <td colspan="2" class="diff-lineno">Line 111:</td> <td colspan="2" class="diff-lineno">Line 111:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The standard softmax is numerically unstable because of large exponentiations. The '''safe softmax''' method calculates instead&lt;math display="block"&gt;\sigma(\mathbf{z})_i = \frac{e^{\beta (z_i-m)}}{\sum_{j=1}^K e^{\beta (z_j-m)}}&lt;/math&gt;where &lt;math&gt;m = \max_i z_i &lt;/math&gt; is the largest factor involved. Subtracting by it guarantees that the exponentiations result in at most 1.</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The standard softmax is numerically unstable because of large exponentiations. The '''safe softmax''' method calculates instead&lt;math display="block"&gt;\sigma(\mathbf{z})_i = \frac{e^{\beta (z_i-m)}}{\sum_{j=1}^K e^{\beta (z_j-m)}}&lt;/math&gt;where &lt;math&gt;m = \max_i z_i &lt;/math&gt; is the largest factor involved. Subtracting by it guarantees that the exponentiations result in at most 1.</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The [[Attention (machine learning)|attention mechanism]] in [[Transformer (deep learning architecture)|Transformers]] takes three arguments: a "query vector" &lt;math&gt;q&lt;/math&gt;, a list of "key vectors" &lt;math&gt;k_1, \dots, k_N&lt;/math&gt;, and a list of "value vectors" &lt;math&gt;v_1, \dots, v_N&lt;/math&gt;, and outputs a softmax-weighted sum over value vectors:&lt;math display="block"&gt;o = \sum_{i=1}^N \frac{e^{q^T k_i - m}}{\sum_{j=1}^N e^{q^T k_j - m}} v_i&lt;/math&gt;The standard softmax method involves several loops over the inputs, which would be [[Memory bottleneck|bottlenecked by memory bandwidth]]. The [[FlashAttention]] method is a [[communication-avoiding algorithm]] that fuses these operations into a single loop, increasing the [[arithmetic intensity]]. It is an [[online algorithm]] that computes the following quantities:&lt;ref&gt;{{<del style="font-weight: bold; text-decoration: none;">Citation</del> <del style="font-weight: bold; text-decoration: none;">|last=Milakov</del> |<del style="font-weight: bold; text-decoration: none;">first=Maxim |title=Online normalizer calculation for softmax |date=2018-07-28 |url=https://</del>arxiv<del style="font-weight: bold; text-decoration: none;">.org/abs/1805.02867 |publisher</del>=<del style="font-weight: bold; text-decoration: none;">arXiv |doi=10.48550/arXiv.</del>1805.02867<del style="font-weight: bold; text-decoration: none;"> |id=arXiv:1805.02867 |last2=Gimelshein |first2=Natalia</del>}}&lt;/ref&gt;&lt;ref&gt;{{Cite journal |last=Dao |first=Tri |last2=Fu |first2=Dan |last3=Ermon |first3=Stefano |last4=Rudra |first4=Atri |last5=Ré |first5=Christopher |date=2022-12-06 |title=FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness |url=https://proceedings.neurips.cc/paper_files/paper/2022/hash/67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html |journal=Advances in Neural Information Processing Systems |language=en |volume=35 |pages=16344–16359}}&lt;/ref&gt;&lt;math display="block"&gt;\begin{aligned}</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The [[Attention (machine learning)|attention mechanism]] in [[Transformer (deep learning architecture)|Transformers]] takes three arguments: a "query vector" &lt;math&gt;q&lt;/math&gt;, a list of "key vectors" &lt;math&gt;k_1, \dots, k_N&lt;/math&gt;, and a list of "value vectors" &lt;math&gt;v_1, \dots, v_N&lt;/math&gt;, and outputs a softmax-weighted sum over value vectors:&lt;math display="block"&gt;o = \sum_{i=1}^N \frac{e^{q^T k_i - m}}{\sum_{j=1}^N e^{q^T k_j - m}} v_i&lt;/math&gt;The standard softmax method involves several loops over the inputs, which would be [[Memory bottleneck|bottlenecked by memory bandwidth]]. The [[FlashAttention]] method is a [[communication-avoiding algorithm]] that fuses these operations into a single loop, increasing the [[arithmetic intensity]]. It is an [[online algorithm]] that computes the following quantities:&lt;ref&gt;{{<ins style="font-weight: bold; text-decoration: none;">cite</ins> <ins style="font-weight: bold; text-decoration: none;">arxiv</ins> |arxiv=1805.02867}}&lt;/ref&gt;&lt;ref&gt;{{Cite journal |last=Dao |first=Tri |last2=Fu |first2=Dan |last3=Ermon |first3=Stefano |last4=Rudra |first4=Atri |last5=Ré |first5=Christopher |date=2022-12-06 |title=FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness |url=https://proceedings.neurips.cc/paper_files/paper/2022/hash/67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html |journal=Advances in Neural Information Processing Systems |language=en |volume=35 |pages=16344–16359}}&lt;/ref&gt;&lt;math display="block"&gt;\begin{aligned}</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>z_i &amp;= q^T k_i &amp;\\</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>z_i &amp;= q^T k_i &amp;\\</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>m_i &amp;= \max(z_1, \dots, z_i) &amp;=&amp; \max(m_{i-1}, z_i)\\</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>m_i &amp;= \max(z_1, \dots, z_i) &amp;=&amp; \max(m_{i-1}, z_i)\\</div></td> </tr> <tr> <td colspan="2" class="diff-lineno">Line 150:</td> <td colspan="2" class="diff-lineno">Line 150:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>}}&lt;/ref&gt;</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>}}&lt;/ref&gt;</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The use of the softmax in [[decision theory]] is credited to [[R. Duncan Luce]],&lt;ref name="Gao"&gt;{{cite arXiv|eprint=1704.00805|last1=Gao|first1=Bolin|title=On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning|last2=Pavel|first2=Lacra|author2-link=Lacra Pavel|class=math.OC|year=2017}}&lt;/ref&gt;{{rp|1}} who used the axiom of [[independence of irrelevant alternatives]] in [[rational choice theory]] to deduce the softmax in [[Luce's choice axiom]] for relative preferences.{{<del style="font-weight: bold; text-decoration: none;">Cn</del>|date=March 2024}}</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The use of the softmax in [[decision theory]] is credited to [[R. Duncan Luce]],&lt;ref name="Gao"&gt;{{cite arXiv|eprint=1704.00805|last1=Gao|first1=Bolin|title=On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning|last2=Pavel|first2=Lacra|author2-link=Lacra Pavel|class=math.OC|year=2017}}&lt;/ref&gt;{{rp|1}} who used the axiom of [[independence of irrelevant alternatives]] in [[rational choice theory]] to deduce the softmax in [[Luce's choice axiom]] for relative preferences.{{<ins style="font-weight: bold; text-decoration: none;">Citation needed</ins>|date=March 2024}}</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In machine learning, the term "softmax" is credited to John S. Bridle in two 1989 conference papers, {{harvtxt|Bridle|1990a}}:&lt;ref name="Gao"/&gt;{{rp|1}} and {{harvtxt|Bridle|1990b}}:&lt;ref name="sako2018"/&gt;</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In machine learning, the term "softmax" is credited to John S. Bridle in two 1989 conference papers, {{harvtxt|Bridle|1990a}}:&lt;ref name="Gao"/&gt;{{rp|1}} and {{harvtxt|Bridle|1990b}}:&lt;ref name="sako2018"/&gt;</div></td> </tr> <tr> <td colspan="2" class="diff-empty diff-side-deleted"></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>{{blockquote</div></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>{{quote</div></td> <td colspan="2" class="diff-empty diff-side-added"></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>|We are concerned with feed-forward non-linear networks (multi-layer perceptrons, or MLPs) with multiple outputs. We wish to treat the outputs of the network as probabilities of alternatives (''e.g.'' pattern classes), conditioned on the inputs. We look for appropriate output non-linearities and for appropriate criteria for adaptation of the parameters of the network (''e.g.'' weights). We explain two modifications: probability scoring, which is an alternative to squared error minimisation, and a normalised exponential ('''softmax''') multi-input generalisation of the logistic non-linearity.&lt;ref&gt;{{cite conference</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>|We are concerned with feed-forward non-linear networks (multi-layer perceptrons, or MLPs) with multiple outputs. We wish to treat the outputs of the network as probabilities of alternatives (''e.g.'' pattern classes), conditioned on the inputs. We look for appropriate output non-linearities and for appropriate criteria for adaptation of the parameters of the network (''e.g.'' weights). We explain two modifications: probability scoring, which is an alternative to squared error minimisation, and a normalised exponential ('''softmax''') multi-input generalisation of the logistic non-linearity.&lt;ref&gt;{{cite conference</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>|first=John S. |last=Bridle</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>|first=John S. |last=Bridle</div></td> </tr> <tr> <td colspan="2" class="diff-lineno">Line 169:</td> <td colspan="2" class="diff-lineno">Line 169:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>}}&lt;/ref&gt;{{rp|227}}</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>}}&lt;/ref&gt;{{rp|227}}</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>}}</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>}}</div></td> </tr> <tr> <td colspan="2" class="diff-empty diff-side-deleted"></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>{{blockquote</div></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>{{quote</div></td> <td colspan="2" class="diff-empty diff-side-added"></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>|For any input, the outputs must all be positive and they must sum to unity. ...</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>|For any input, the outputs must all be positive and they must sum to unity. ...</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> </table> Headbomb https://en.wikipedia.org/w/index.php?title=Softmax_function&diff=1277701851&oldid=prev 2601:40F:4480:DB30:A910:2ED7:313:DBF7: /* Probability theory */ 2025-02-26T05:44:51Z <p><span class="autocomment">Probability theory</span></p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Previous revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 05:44, 26 February 2025</td> </tr><tr> <td colspan="2" class="diff-lineno">Line 44:</td> <td colspan="2" class="diff-lineno">Line 44:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>&lt;math display="block"&gt;\sigma_{1/100}(0,\, 10) = \left(1/\left(1 + e^{1/10}\right),\, e^{1/10}/\left(1 + e^{1/10}\right)\right) \approx (0.475,\, 0.525).&lt;/math&gt;</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>&lt;math display="block"&gt;\sigma_{1/100}(0,\, 10) = \left(1/\left(1 + e^{1/10}\right),\, e^{1/10}/\left(1 + e^{1/10}\right)\right) \approx (0.475,\, 0.525).&lt;/math&gt;</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>As {{tmath|\beta \to \infty}}, temperature goes to zero, &lt;math&gt;T = 1/\beta \to 0&lt;/math&gt;, so eventually all differences become large (relative to a shrinking temperature), which gives another interpretation for the limit behavior.</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>As {{tmath|\beta \to \infty}}, temperature goes to zero, &lt;math&gt;T = 1/\beta \to 0&lt;/math&gt;, so eventually all differences become large (relative to a shrinking temperature), which gives another interpretation for the limit behavior.</div></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><br /></td> <td colspan="2" class="diff-empty diff-side-added"></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>=== Probability theory ===</div></td> <td colspan="2" class="diff-empty diff-side-added"></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>In [[probability theory]], the output of the softargmax function can be used to represent a [[categorical distribution]] – that is, a [[probability distribution]] over {{mvar|K}} different possible outcomes.</div></td> <td colspan="2" class="diff-empty diff-side-added"></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>=== Statistical mechanics ===</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>=== Statistical mechanics ===</div></td> </tr> </table> 2601:40F:4480:DB30:A910:2ED7:313:DBF7 https://en.wikipedia.org/w/index.php?title=Softmax_function&diff=1275705830&oldid=prev Omegatron: link to page 2025-02-14T15:59:49Z <p>link to page</p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Previous revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 15:59, 14 February 2025</td> </tr><tr> <td colspan="2" class="diff-lineno">Line 3:</td> <td colspan="2" class="diff-lineno">Line 3:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>{{Redirect|Softmax|the Korean video game company|ESA (company)}}</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>{{Redirect|Softmax|the Korean video game company|ESA (company)}}</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>{{Machine learning}}</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>{{Machine learning}}</div></td> </tr> <tr> <td colspan="2" class="diff-empty diff-side-deleted"></td> <td class="diff-marker"><a class="mw-diff-movedpara-right" title="Paragraph was moved. Click to jump to old location." href="#movedpara_3_10_lhs">&#x26AB;</a></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><a name="movedpara_1_0_rhs"></a><ins style="font-weight: bold; text-decoration: none;">The '''softmax function,''' also known as '''softargmax'''&lt;ref&gt;{{cite book |last1=Goodfellow |first1=Ian |url=http://www.deeplearningbook.org |title=Deep Learning |last2=Bengio |first2=Yoshua |last3=Courville |first3=Aaron |publisher=MIT Press |year=2016 |isbn=978-0-26203561-3 |pages=180–184 |section=6.2.2.3 Softmax Units for Multinoulli Output Distributions |chapter-url=https://www.deeplearningbook.org/contents/mlp.html#pf15 |authorlink1=Ian Goodfellow |authorlink2=Yoshua Bengio</ins>}}&lt;/ref&gt;{{rp|184}} or '''normalized exponential function''',&lt;ref name="bishop" /&gt;{{rp|198}} converts a vector of {{mvar|K}} real numbers into a [[probability distribution]] of {{mvar|K}} possible outcomes. It is a generalization of the [[logistic function]] to multiple dimensions, and is used in [[multinomial logistic regression]]. The softmax function is often used as the last [[activation function]] of a [[Artificial neural network|neural network]] to normalize the output of a network to a [[probability distribution]] over predicted output classes.</div></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The '''softmax function,''' also known as '''softargmax'''&lt;ref&gt;{{cite book</div></td> <td colspan="2" class="diff-empty diff-side-added"></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>|title=Deep Learning |year=2016</div></td> <td colspan="2" class="diff-empty diff-side-added"></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>|first1=Ian |last1=Goodfellow |authorlink1=Ian Goodfellow</div></td> <td colspan="2" class="diff-empty diff-side-added"></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>|first2=Yoshua |last2=Bengio |authorlink2=Yoshua Bengio</div></td> <td colspan="2" class="diff-empty diff-side-added"></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>|first3=Aaron |last3=Courville</div></td> <td colspan="2" class="diff-empty diff-side-added"></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>|publisher=MIT Press</div></td> <td colspan="2" class="diff-empty diff-side-added"></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>|url=http://www.deeplearningbook.org</div></td> <td colspan="2" class="diff-empty diff-side-added"></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>|isbn=978-0-26203561-3</div></td> <td colspan="2" class="diff-empty diff-side-added"></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>|chapter-url=https://www.deeplearningbook.org/contents/mlp.html</div></td> <td colspan="2" class="diff-empty diff-side-added"></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>|section=6.2.2.3 Softmax Units for Multinoulli Output Distributions</div></td> <td colspan="2" class="diff-empty diff-side-added"></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>|pages=180–184</div></td> <td colspan="2" class="diff-empty diff-side-added"></td> </tr> <tr> <td class="diff-marker"><a class="mw-diff-movedpara-left" title="Paragraph was moved. Click to jump to new location." href="#movedpara_1_0_rhs">&#x26AB;</a></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><a name="movedpara_3_10_lhs"></a>}}&lt;/ref&gt;{{rp|184}} or '''normalized exponential function''',&lt;ref name="bishop" /&gt;{{rp|198}} converts a vector of {{mvar|K}} real numbers into a [[probability distribution]] of {{mvar|K}} possible outcomes. It is a generalization of the [[logistic function]] to multiple dimensions, and is used in [[multinomial logistic regression]]. The softmax function is often used as the last [[activation function]] of a [[Artificial neural network|neural network]] to normalize the output of a network to a [[probability distribution]] over predicted output classes.</div></td> <td colspan="2" class="diff-empty diff-side-added"></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Definition ==</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Definition ==</div></td> </tr> </table> Omegatron https://en.wikipedia.org/w/index.php?title=Softmax_function&diff=1273599451&oldid=prev Cosmia Nebula: /* Numerical algorithms */ 2025-02-03T03:44:25Z <p><span class="autocomment">Numerical algorithms</span></p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Previous revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 03:44, 3 February 2025</td> </tr><tr> <td colspan="2" class="diff-lineno">Line 130:</td> <td colspan="2" class="diff-lineno">Line 130:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>l_i &amp;= e^{z_1 - m_i} + \dots + e^{z_i - m_i} &amp;=&amp; e^{m_{i-1}-m_i} l_{i-1} + e^{z_i - m_i}\\</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>l_i &amp;= e^{z_1 - m_i} + \dots + e^{z_i - m_i} &amp;=&amp; e^{m_{i-1}-m_i} l_{i-1} + e^{z_i - m_i}\\</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>o_i &amp;= e^{z_1 - m_i} v_1 + \dots + e^{z_i - m_i}v_i &amp;=&amp; e^{m_{i-1}-m_i} o_{i-1} + e^{z_i - m_i}v_i</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>o_i &amp;= e^{z_1 - m_i} v_1 + \dots + e^{z_i - m_i}v_i &amp;=&amp; e^{m_{i-1}-m_i} o_{i-1} + e^{z_i - m_i}v_i</div></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>\end{aligned}&lt;/math&gt;and returns &lt;math&gt;o_N/l_N&lt;/math&gt;. In practice, FlashAttention operates over multiple queries and keys per loop iteration, in a similar way as [[Communication-avoiding algorithm#Blocked (tiled) matrix multiplication|blocked matrix multiplication]]. If [[backpropagation]] is needed, then the output vectors and the intermediate arrays &lt;math&gt;[m_1, \dots, m_N], [l_1, \dots, l_N]&lt;/math&gt; are cached, and during the backward pass, attention matrices are [[Rematerialization|rematerialized]] from these.</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>\end{aligned}&lt;/math&gt;and returns &lt;math&gt;o_N/l_N&lt;/math&gt;. In practice, FlashAttention operates over multiple queries and keys per loop iteration, in a similar way as [[Communication-avoiding algorithm#Blocked (tiled) matrix multiplication|blocked matrix multiplication]]. If [[backpropagation]] is needed, then the output vectors and the intermediate arrays &lt;math&gt;[m_1, \dots, m_N], [l_1, \dots, l_N]&lt;/math&gt; are cached, and during the backward pass, attention matrices are [[Rematerialization|rematerialized]] from these<ins style="font-weight: bold; text-decoration: none;">, making it a form of gradient checkpointing</ins>.</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Mathematical properties ==</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Mathematical properties ==</div></td> </tr> </table> Cosmia Nebula https://en.wikipedia.org/w/index.php?title=Softmax_function&diff=1273599357&oldid=prev Cosmia Nebula: /* Numerical algorithms */ 2025-02-03T03:43:40Z <p><span class="autocomment">Numerical algorithms</span></p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Previous revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 03:43, 3 February 2025</td> </tr><tr> <td colspan="2" class="diff-lineno">Line 130:</td> <td colspan="2" class="diff-lineno">Line 130:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>l_i &amp;= e^{z_1 - m_i} + \dots + e^{z_i - m_i} &amp;=&amp; e^{m_{i-1}-m_i} l_{i-1} + e^{z_i - m_i}\\</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>l_i &amp;= e^{z_1 - m_i} + \dots + e^{z_i - m_i} &amp;=&amp; e^{m_{i-1}-m_i} l_{i-1} + e^{z_i - m_i}\\</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>o_i &amp;= e^{z_1 - m_i} v_1 + \dots + e^{z_i - m_i}v_i &amp;=&amp; e^{m_{i-1}-m_i} o_{i-1} + e^{z_i - m_i}v_i</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>o_i &amp;= e^{z_1 - m_i} v_1 + \dots + e^{z_i - m_i}v_i &amp;=&amp; e^{m_{i-1}-m_i} o_{i-1} + e^{z_i - m_i}v_i</div></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>\end{aligned}&lt;/math&gt;and returns &lt;math&gt;o_N/l_N&lt;/math&gt;. If [[backpropagation]] is needed, then the arrays &lt;math&gt;[m_1, \dots, m_N], [l_1, \dots, l_N]&lt;/math&gt; are cached, and during the backward pass, attention matrices are [[Rematerialization|rematerialized]]<del style="font-weight: bold; text-decoration: none;">.</del> <del style="font-weight: bold; text-decoration: none;">In practice, FlashAttention operates</del> <del style="font-weight: bold; text-decoration: none;">over multiple queries and keys per loop iteration, in a similar way as [[Communication-avoiding algorithm#Blocked (tiled) matrix multiplication|blocked matrix multiplication]]</del>.</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>\end{aligned}&lt;/math&gt;and returns &lt;math&gt;o_N/l_N&lt;/math&gt;<ins style="font-weight: bold; text-decoration: none;">. In practice, FlashAttention operates over multiple queries and keys per loop iteration, in a similar way as [[Communication-avoiding algorithm#Blocked (tiled) matrix multiplication|blocked matrix multiplication]]</ins>. If [[backpropagation]] is needed, then the<ins style="font-weight: bold; text-decoration: none;"> output vectors and the intermediate</ins> arrays &lt;math&gt;[m_1, \dots, m_N], [l_1, \dots, l_N]&lt;/math&gt; are cached, and during the backward pass, attention matrices are [[Rematerialization|rematerialized]] <ins style="font-weight: bold; text-decoration: none;">from</ins> <ins style="font-weight: bold; text-decoration: none;">these</ins>.</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Mathematical properties ==</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Mathematical properties ==</div></td> </tr> </table> Cosmia Nebula https://en.wikipedia.org/w/index.php?title=Softmax_function&diff=1273599247&oldid=prev Cosmia Nebula: /* Numerical algorithms */ backprop 2025-02-03T03:42:51Z <p><span class="autocomment">Numerical algorithms: </span> backprop</p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Previous revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 03:42, 3 February 2025</td> </tr><tr> <td colspan="2" class="diff-lineno">Line 128:</td> <td colspan="2" class="diff-lineno">Line 128:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>z_i &amp;= q^T k_i &amp;\\</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>z_i &amp;= q^T k_i &amp;\\</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>m_i &amp;= \max(z_1, \dots, z_i) &amp;=&amp; \max(m_{i-1}, z_i)\\</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>m_i &amp;= \max(z_1, \dots, z_i) &amp;=&amp; \max(m_{i-1}, z_i)\\</div></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;">s_i</del> &amp;= e^{z_1 - m_i} + \dots + e^{z_i - m_i} &amp;=&amp; e^{m_{i-1}-m_i} <del style="font-weight: bold; text-decoration: none;">s_</del>{i-1} + e^{z_i - m_i}\\</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">l_i</ins> &amp;= e^{z_1 - m_i} + \dots + e^{z_i - m_i} &amp;=&amp; e^{m_{i-1}-m_i} <ins style="font-weight: bold; text-decoration: none;">l_</ins>{i-1} + e^{z_i - m_i}\\</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>o_i &amp;= e^{z_1 - m_i} v_1 + \dots + e^{z_i - m_i}v_i &amp;=&amp; e^{m_{i-1}-m_i} o_{i-1} + e^{z_i - m_i}v_i</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>o_i &amp;= e^{z_1 - m_i} v_1 + \dots + e^{z_i - m_i}v_i &amp;=&amp; e^{m_{i-1}-m_i} o_{i-1} + e^{z_i - m_i}v_i</div></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>\end{aligned}&lt;/math&gt;and returns &lt;math&gt;o_N/<del style="font-weight: bold; text-decoration: none;">s_N</del>&lt;/math&gt;. In practice, FlashAttention operates over multiple queries and keys per loop iteration, in a similar way as [[Communication-avoiding algorithm#Blocked (tiled) matrix multiplication|blocked matrix multiplication]].</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>\end{aligned}&lt;/math&gt;and returns &lt;math&gt;o_N/<ins style="font-weight: bold; text-decoration: none;">l_N</ins>&lt;/math&gt;<ins style="font-weight: bold; text-decoration: none;">. If [[backpropagation]] is needed, then the arrays &lt;math&gt;[m_1, \dots, m_N], [l_1, \dots, l_N]&lt;/math&gt; are cached, and during the backward pass, attention matrices are [[Rematerialization|rematerialized]]</ins>. In practice, FlashAttention operates over multiple queries and keys per loop iteration, in a similar way as [[Communication-avoiding algorithm#Blocked (tiled) matrix multiplication|blocked matrix multiplication]].</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Mathematical properties ==</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Mathematical properties ==</div></td> </tr> </table> Cosmia Nebula