https://en.wikipedia.org/w/index.php?action=history&feed=atom&title=Expectation%E2%80%93maximization_algorithm Expectation–maximization algorithm - Revision history 2025-05-30T07:02:08Z Revision history for this page on the wiki MediaWiki 1.45.0-wmf.3 https://en.wikipedia.org/w/index.php?title=Expectation%E2%80%93maximization_algorithm&diff=1284886684&oldid=prev Wbm1058: Bayes' theorem (via WP:JWB) 2025-04-10T10:00:35Z <p><a href="/wiki/Bayes%27_theorem" title="Bayes&#039; theorem">Bayes&#039; theorem</a> (via <a href="/wiki/Wikipedia:JWB" class="mw-redirect" title="Wikipedia:JWB">WP:JWB</a>)</p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Previous revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 10:00, 10 April 2025</td> </tr><tr> <td colspan="2" class="diff-lineno">Line 317:</td> <td colspan="2" class="diff-lineno">Line 317:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>==== E step ====</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>==== E step ====</div></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Given our current estimate of the parameters ''θ''&lt;sup&gt;(''t'')&lt;/sup&gt;, the conditional distribution of the ''Z''&lt;sub&gt;''i''&lt;/sub&gt; is determined by [[Bayes theorem]] to be the proportional height of the normal [[probability density function|density]] weighted by ''τ'':</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Given our current estimate of the parameters ''θ''&lt;sup&gt;(''t'')&lt;/sup&gt;, the conditional distribution of the ''Z''&lt;sub&gt;''i''&lt;/sub&gt; is determined by [[Bayes<ins style="font-weight: bold; text-decoration: none;">'</ins> theorem]] to be the proportional height of the normal [[probability density function|density]] weighted by ''τ'':</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>: &lt;math&gt;T_{j,i}^{(t)} := \operatorname{P}(Z_i=j \mid X_i=\mathbf{x}_i ;\theta^{(t)}) = \frac{\tau_j^{(t)} \ f(\mathbf{x}_i;\boldsymbol{\mu}_j^{(t)},\Sigma_j^{(t)})}{\tau_1^{(t)} \ f(\mathbf{x}_i;\boldsymbol{\mu}_1^{(t)},\Sigma_1^{(t)}) + \tau_2^{(t)} \ f(\mathbf{x}_i;\boldsymbol{\mu}_2^{(t)},\Sigma_2^{(t)})}.&lt;/math&gt;</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>: &lt;math&gt;T_{j,i}^{(t)} := \operatorname{P}(Z_i=j \mid X_i=\mathbf{x}_i ;\theta^{(t)}) = \frac{\tau_j^{(t)} \ f(\mathbf{x}_i;\boldsymbol{\mu}_j^{(t)},\Sigma_j^{(t)})}{\tau_1^{(t)} \ f(\mathbf{x}_i;\boldsymbol{\mu}_1^{(t)},\Sigma_1^{(t)}) + \tau_2^{(t)} \ f(\mathbf{x}_i;\boldsymbol{\mu}_2^{(t)},\Sigma_2^{(t)})}.&lt;/math&gt;</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> </table> Wbm1058 https://en.wikipedia.org/w/index.php?title=Expectation%E2%80%93maximization_algorithm&diff=1275108110&oldid=prev Quantling: /* Introduction */ Remove article with zero citations according to Google scholar. This one needs to age a bit before it can be considered encyclopedic. 2025-02-11T02:45:08Z <p><span class="autocomment">Introduction: </span> Remove article with zero citations according to Google scholar. This one needs to age a bit before it can be considered encyclopedic.</p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Previous revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 02:45, 11 February 2025</td> </tr><tr> <td colspan="2" class="diff-lineno">Line 57:</td> <td colspan="2" class="diff-lineno">Line 57:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Finding a maximum likelihood solution typically requires taking the [[derivative]]s of the [[likelihood function]] with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations. In statistical models with latent variables, this is usually impossible. Instead, the result is typically a set of interlocking equations in which the solution to the parameters requires the values of the latent variables and vice versa, but substituting one set of equations into the other produces an unsolvable equation.</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Finding a maximum likelihood solution typically requires taking the [[derivative]]s of the [[likelihood function]] with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations. In statistical models with latent variables, this is usually impossible. Instead, the result is typically a set of interlocking equations in which the solution to the parameters requires the values of the latent variables and vice versa, but substituting one set of equations into the other produces an unsolvable equation.</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The EM algorithm proceeds from the observation that there is a way to solve these two sets of equations numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the second set, then use these new values to find a better estimate of the first set, and then keep alternating between the two until the resulting values both converge to fixed points. It's not obvious that this will work, but it can be proven in this context. Additionally, it can be proven that the derivative of the likelihood is (arbitrarily close to) zero at that point, which in turn means that the point is either a local maximum or a [[saddle point]].&lt;ref name="Wu" /&gt; In general, multiple maxima may occur, with no guarantee that the global maximum will be found. Some likelihoods also have [[Mathematical singularity|singularities]] in them, i.e., nonsensical maxima. For example, one of the ''solutions'' that may be found by EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the same component to be equal to one of the data points.<del style="font-weight: bold; text-decoration: none;"> The convergence of expectation-maximization (EM)-based algorithms typically requires continuity of the likelihood function with respect to all the unknown parameters (referred to as optimization variables).&lt;ref&gt;{{Cite journal |last=Joseph |first=Geethu |date=April 24, 2024 |title=Convergence of Expectation-Maximization Algorithm With Mixed-Integer Optimization |url=https://ieeexplore.ieee.org/document/10508082 |journal=IEEE Signal Processing Letters |volume=31 |pages=1229–1233|doi=10.1109/LSP.2024.3393352 |arxiv=2401.17763 }}&lt;/ref&gt;</del></div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The EM algorithm proceeds from the observation that there is a way to solve these two sets of equations numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the second set, then use these new values to find a better estimate of the first set, and then keep alternating between the two until the resulting values both converge to fixed points. It's not obvious that this will work, but it can be proven in this context. Additionally, it can be proven that the derivative of the likelihood is (arbitrarily close to) zero at that point, which in turn means that the point is either a local maximum or a [[saddle point]].&lt;ref name="Wu" /&gt; In general, multiple maxima may occur, with no guarantee that the global maximum will be found. Some likelihoods also have [[Mathematical singularity|singularities]] in them, i.e., nonsensical maxima. For example, one of the ''solutions'' that may be found by EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the same component to be equal to one of the data points.</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Description ==</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Description ==</div></td> </tr> </table> Quantling https://en.wikipedia.org/w/index.php?title=Expectation%E2%80%93maximization_algorithm&diff=1275107357&oldid=prev Quantling: /* Proof of correctness */ Remove dubious cite of 2024 paper for result from 1987 2025-02-11T02:39:26Z <p><span class="autocomment">Proof of correctness: </span> Remove dubious cite of 2024 paper for result from 1987</p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Previous revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 02:39, 11 February 2025</td> </tr><tr> <td colspan="2" class="diff-lineno">Line 103:</td> <td colspan="2" class="diff-lineno">Line 103:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Proof of correctness ==</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Proof of correctness ==</div></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Expectation-Maximization works to improve &lt;math&gt;Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})&lt;/math&gt; rather than directly improving &lt;math&gt;\log p(\mathbf{X}\mid\boldsymbol\theta)&lt;/math&gt;. Here it is shown that improvements to the former imply improvements to the latter.&lt;ref name="Little1987"&gt;{{cite book |last1=Little |first1= Roderick J.A. |last2= Rubin |first2= Donald B. |author2-link= Donald Rubin |title= Statistical Analysis with Missing Data |url=https://archive.org/details/statisticalanaly00litt |url-access=limited | series = Wiley Series in Probability and Mathematical Statistics |year= 1987 |publisher= John Wiley &amp; Sons |location= New York |isbn= 978-0-471-80254-9 |pages= [https://archive.org/details/statisticalanaly00litt/page/n145 134]–136<del style="font-weight: bold; text-decoration: none;">}}&lt;/ref&gt;&lt;ref&gt;{{Cite journal |last1=Song |first1=Yuhang |last2=Millidge |first2=Beren |last3=Salvatori |first3=Tommaso |last4=Lukasiewicz |first4=Thomas |last5=Xu |first5=Zhenghua |last6=Bogacz |first6=Rafal |date=February 2024 |title=Inferring neural activity before plasticity as a foundation for learning beyond backpropagation |journal=Nature Neuroscience |language=en |volume=27 |issue=2 |pages=348–358 |doi=10.1038/s41593-023-01514-1 |pmid=38172438 |issn=1546-1726|pmc=7615830 </del>}}&lt;/ref&gt;</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Expectation-Maximization works to improve &lt;math&gt;Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})&lt;/math&gt; rather than directly improving &lt;math&gt;\log p(\mathbf{X}\mid\boldsymbol\theta)&lt;/math&gt;. Here it is shown that improvements to the former imply improvements to the latter.&lt;ref name="Little1987"&gt;{{cite book |last1=Little |first1= Roderick J.A. |last2= Rubin |first2= Donald B. |author2-link= Donald Rubin |title= Statistical Analysis with Missing Data |url=https://archive.org/details/statisticalanaly00litt |url-access=limited | series = Wiley Series in Probability and Mathematical Statistics |year= 1987 |publisher= John Wiley &amp; Sons |location= New York |isbn= 978-0-471-80254-9 |pages= [https://archive.org/details/statisticalanaly00litt/page/n145 134]–136}}&lt;/ref&gt;</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>For any &lt;math&gt;\mathbf{Z}&lt;/math&gt; with non-zero probability &lt;math&gt;p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta)&lt;/math&gt;, we can write</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>For any &lt;math&gt;\mathbf{Z}&lt;/math&gt; with non-zero probability &lt;math&gt;p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta)&lt;/math&gt;, we can write</div></td> </tr> </table> Quantling https://en.wikipedia.org/w/index.php?title=Expectation%E2%80%93maximization_algorithm&diff=1262410657&oldid=prev Citation bot: Alter: url, pages. URLs might have been anonymized. Add: pmid, arxiv, doi, authors 1-1. Removed URL that duplicated identifier. Removed parameters. Formatted dashes. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Dominic3203 | Category:Machine learning algorithms | #UCB_Category 37/84 2024-12-11T07:36:39Z <p>Alter: url, pages. URLs might have been anonymized. Add: pmid, arxiv, doi, authors 1-1. Removed URL that duplicated identifier. Removed parameters. Formatted <a href="/wiki/Wikipedia:ENDASH" class="mw-redirect" title="Wikipedia:ENDASH">dashes</a>. Some additions/deletions were parameter name changes. | <a href="/wiki/Wikipedia:UCB" class="mw-redirect" title="Wikipedia:UCB">Use this bot</a>. <a href="/wiki/Wikipedia:DBUG" class="mw-redirect" title="Wikipedia:DBUG">Report bugs</a>. | Suggested by Dominic3203 | <a href="/wiki/Category:Machine_learning_algorithms" title="Category:Machine learning algorithms">Category:Machine learning algorithms</a> | #UCB_Category 37/84</p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Previous revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 07:36, 11 December 2024</td> </tr><tr> <td colspan="2" class="diff-lineno">Line 17:</td> <td colspan="2" class="diff-lineno">Line 17:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div> |journal=[[Journal of the Royal Statistical Society, Series B]]</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div> |journal=[[Journal of the Royal Statistical Society, Series B]]</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div> |year=1977 |volume=39 |issue=1 |pages=1–38</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div> |year=1977 |volume=39 |issue=1 |pages=1–38</div></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div> |jstor=2984875 |mr= 0501537</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div> <ins style="font-weight: bold; text-decoration: none;">|doi= 10.1111/j.2517-6161.1977.tb01600.x </ins>|jstor=2984875 |mr= 0501537</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>}}</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>}}</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>&lt;/ref&gt; They pointed out that the method had been "proposed many times in special circumstances" by earlier authors. One of the earliest is the gene-counting method for estimating allele frequencies by [[Cedric Smith (statistician)|Cedric Smith]].&lt;ref&gt;{{cite journal |last1=Ceppelini |first1=R.M. |title=The estimation of gene frequencies in a random-mating population |journal=Ann. Hum. Genet. |volume=20 |issue=2 |pages=97–115 |doi=10.1111/j.1469-1809.1955.tb01360.x|pmid=13268982 |year=1955 |s2cid=38625779 }}&lt;/ref&gt; Another was proposed by [[Herman Otto Hartley|H.O. Hartley]] in 1958, and Hartley and Hocking in 1977, from which many of the ideas in the Dempster–Laird–Rubin paper originated.&lt;ref&gt;{{Cite journal |last=Hartley |first=Herman Otto |date=1958 |title=Maximum Likelihood estimation from incomplete data |journal=Biometrics |volume=14 |issue=2 |pages=174–194|doi=10.2307/2527783 |jstor=2527783 }}&lt;/ref&gt; Another one by S.K Ng, Thriyambakam Krishnan and G.J McLachlan in 1977.&lt;ref&gt;{{Citation |last1=Ng |first1=Shu Kay |title=The EM Algorithm |date=2011-12-21 |url=http://dx.doi.org/10.1007/978-3-642-21551-3_6 |work=Handbook of Computational Statistics |pages=139–172 |place=Berlin, Heidelberg |publisher=Springer Berlin Heidelberg |isbn=978-3-642-21550-6 |access-date=2022-10-15 |last2=Krishnan |first2=Thriyambakam |last3=McLachlan |first3=Geoffrey J.|doi=10.1007/978-3-642-21551-3_6 |s2cid=59942212 }}&lt;/ref&gt; Hartley’s ideas can be broadened to any grouped discrete distribution. A very detailed treatment of the EM method for exponential families was published by Rolf Sundberg in his thesis and several papers,&lt;ref name="Sundberg1974"&gt;{{cite journal</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>&lt;/ref&gt; They pointed out that the method had been "proposed many times in special circumstances" by earlier authors. One of the earliest is the gene-counting method for estimating allele frequencies by [[Cedric Smith (statistician)|Cedric Smith]].&lt;ref&gt;{{cite journal |last1=Ceppelini |first1=R.M. |title=The estimation of gene frequencies in a random-mating population |journal=Ann. Hum. Genet. |volume=20 |issue=2 |pages=97–115 |doi=10.1111/j.1469-1809.1955.tb01360.x|pmid=13268982 |year=1955 |s2cid=38625779 }}&lt;/ref&gt; Another was proposed by [[Herman Otto Hartley|H.O. Hartley]] in 1958, and Hartley and Hocking in 1977, from which many of the ideas in the Dempster–Laird–Rubin paper originated.&lt;ref&gt;{{Cite journal |last=Hartley |first=Herman Otto |date=1958 |title=Maximum Likelihood estimation from incomplete data |journal=Biometrics |volume=14 |issue=2 |pages=174–194|doi=10.2307/2527783 |jstor=2527783 }}&lt;/ref&gt; Another one by S.K Ng, Thriyambakam Krishnan and G.J McLachlan in 1977.&lt;ref&gt;{{Citation |last1=Ng |first1=Shu Kay |title=The EM Algorithm |date=2011-12-21 |url=http://dx.doi.org/10.1007/978-3-642-21551-3_6 |work=Handbook of Computational Statistics |pages=139–172 |place=Berlin, Heidelberg |publisher=Springer Berlin Heidelberg |isbn=978-3-642-21550-6 |access-date=2022-10-15 |last2=Krishnan |first2=Thriyambakam |last3=McLachlan |first3=Geoffrey J.|doi=10.1007/978-3-642-21551-3_6 |s2cid=59942212 }}&lt;/ref&gt; Hartley’s ideas can be broadened to any grouped discrete distribution. A very detailed treatment of the EM method for exponential families was published by Rolf Sundberg in his thesis and several papers,&lt;ref name="Sundberg1974"&gt;{{cite journal</div></td> </tr> <tr> <td colspan="2" class="diff-lineno">Line 57:</td> <td colspan="2" class="diff-lineno">Line 57:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Finding a maximum likelihood solution typically requires taking the [[derivative]]s of the [[likelihood function]] with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations. In statistical models with latent variables, this is usually impossible. Instead, the result is typically a set of interlocking equations in which the solution to the parameters requires the values of the latent variables and vice versa, but substituting one set of equations into the other produces an unsolvable equation.</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Finding a maximum likelihood solution typically requires taking the [[derivative]]s of the [[likelihood function]] with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations. In statistical models with latent variables, this is usually impossible. Instead, the result is typically a set of interlocking equations in which the solution to the parameters requires the values of the latent variables and vice versa, but substituting one set of equations into the other produces an unsolvable equation.</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The EM algorithm proceeds from the observation that there is a way to solve these two sets of equations numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the second set, then use these new values to find a better estimate of the first set, and then keep alternating between the two until the resulting values both converge to fixed points. It's not obvious that this will work, but it can be proven in this context. Additionally, it can be proven that the derivative of the likelihood is (arbitrarily close to) zero at that point, which in turn means that the point is either a local maximum or a [[saddle point]].&lt;ref name="Wu" /&gt; In general, multiple maxima may occur, with no guarantee that the global maximum will be found. Some likelihoods also have [[Mathematical singularity|singularities]] in them, i.e., nonsensical maxima. For example, one of the ''solutions'' that may be found by EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the same component to be equal to one of the data points. The convergence of expectation-maximization (EM)-based algorithms typically requires continuity of the likelihood function with respect to all the unknown parameters (referred to as optimization variables).&lt;ref&gt;{{Cite journal |last=Joseph |first=Geethu |date=April 24, 2024 |title=Convergence of Expectation-Maximization Algorithm With Mixed-Integer Optimization |url=https://ieeexplore.ieee.org<del style="font-weight: bold; text-decoration: none;">/abstract</del>/document/10508082 |journal=IEEE Signal Processing Letters |volume=31 |pages=<del style="font-weight: bold; text-decoration: none;">1229-1233</del>}}&lt;/ref&gt;</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The EM algorithm proceeds from the observation that there is a way to solve these two sets of equations numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the second set, then use these new values to find a better estimate of the first set, and then keep alternating between the two until the resulting values both converge to fixed points. It's not obvious that this will work, but it can be proven in this context. Additionally, it can be proven that the derivative of the likelihood is (arbitrarily close to) zero at that point, which in turn means that the point is either a local maximum or a [[saddle point]].&lt;ref name="Wu" /&gt; In general, multiple maxima may occur, with no guarantee that the global maximum will be found. Some likelihoods also have [[Mathematical singularity|singularities]] in them, i.e., nonsensical maxima. For example, one of the ''solutions'' that may be found by EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the same component to be equal to one of the data points. The convergence of expectation-maximization (EM)-based algorithms typically requires continuity of the likelihood function with respect to all the unknown parameters (referred to as optimization variables).&lt;ref&gt;{{Cite journal |last=Joseph |first=Geethu |date=April 24, 2024 |title=Convergence of Expectation-Maximization Algorithm With Mixed-Integer Optimization |url=https://ieeexplore.ieee.org/document/10508082 |journal=IEEE Signal Processing Letters |volume=31 |pages=<ins style="font-weight: bold; text-decoration: none;">1229–1233|doi=10.1109/LSP.2024.3393352 |arxiv=2401.17763 </ins>}}&lt;/ref&gt;</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Description ==</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Description ==</div></td> </tr> <tr> <td colspan="2" class="diff-lineno">Line 103:</td> <td colspan="2" class="diff-lineno">Line 103:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Proof of correctness ==</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Proof of correctness ==</div></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Expectation-Maximization works to improve &lt;math&gt;Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})&lt;/math&gt; rather than directly improving &lt;math&gt;\log p(\mathbf{X}\mid\boldsymbol\theta)&lt;/math&gt;. Here it is shown that improvements to the former imply improvements to the latter.&lt;ref name="Little1987"&gt;{{cite book |last1=Little |first1= Roderick J.A. |last2= Rubin |first2= Donald B. |author2-link= Donald Rubin |title= Statistical Analysis with Missing Data |url=https://archive.org/details/statisticalanaly00litt |url-access=limited | series = Wiley Series in Probability and Mathematical Statistics |year= 1987 |publisher= John Wiley &amp; Sons |location= New York |isbn= 978-0-471-80254-9 |pages= [https://archive.org/details/statisticalanaly00litt/page/n145 134]–136}}&lt;/ref&gt;&lt;ref&gt;{{Cite journal |<del style="font-weight: bold; text-decoration: none;">last</del>=Song |<del style="font-weight: bold; text-decoration: none;">first</del>=Yuhang |last2=Millidge |first2=Beren |last3=Salvatori |first3=Tommaso |last4=Lukasiewicz |first4=Thomas |last5=Xu |first5=Zhenghua |last6=Bogacz |first6=Rafal |date=February 2024 |title=Inferring neural activity before plasticity as a foundation for learning beyond backpropagation<del style="font-weight: bold; text-decoration: none;"> |url=https://www.nature.com/articles/s41593-023-01514-1</del> |journal=Nature Neuroscience |language=en |volume=27 |issue=2 |pages=348–358 |doi=10.1038/s41593-023-01514-1 |issn=1546-1726|pmc=7615830 }}&lt;/ref&gt;</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Expectation-Maximization works to improve &lt;math&gt;Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})&lt;/math&gt; rather than directly improving &lt;math&gt;\log p(\mathbf{X}\mid\boldsymbol\theta)&lt;/math&gt;. Here it is shown that improvements to the former imply improvements to the latter.&lt;ref name="Little1987"&gt;{{cite book |last1=Little |first1= Roderick J.A. |last2= Rubin |first2= Donald B. |author2-link= Donald Rubin |title= Statistical Analysis with Missing Data |url=https://archive.org/details/statisticalanaly00litt |url-access=limited | series = Wiley Series in Probability and Mathematical Statistics |year= 1987 |publisher= John Wiley &amp; Sons |location= New York |isbn= 978-0-471-80254-9 |pages= [https://archive.org/details/statisticalanaly00litt/page/n145 134]–136}}&lt;/ref&gt;&lt;ref&gt;{{Cite journal |<ins style="font-weight: bold; text-decoration: none;">last1</ins>=Song |<ins style="font-weight: bold; text-decoration: none;">first1</ins>=Yuhang |last2=Millidge |first2=Beren |last3=Salvatori |first3=Tommaso |last4=Lukasiewicz |first4=Thomas |last5=Xu |first5=Zhenghua |last6=Bogacz |first6=Rafal |date=February 2024 |title=Inferring neural activity before plasticity as a foundation for learning beyond backpropagation |journal=Nature Neuroscience |language=en |volume=27 |issue=2 |pages=348–358 |doi=10.1038/s41593-023-01514-1<ins style="font-weight: bold; text-decoration: none;"> |pmid=38172438</ins> |issn=1546-1726|pmc=7615830 }}&lt;/ref&gt;</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>For any &lt;math&gt;\mathbf{Z}&lt;/math&gt; with non-zero probability &lt;math&gt;p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta)&lt;/math&gt;, we can write</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>For any &lt;math&gt;\mathbf{Z}&lt;/math&gt; with non-zero probability &lt;math&gt;p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta)&lt;/math&gt;, we can write</div></td> </tr> </table> Citation bot https://en.wikipedia.org/w/index.php?title=Expectation%E2%80%93maximization_algorithm&diff=1255758497&oldid=prev OAbot: Open access bot: pmc updated in citation with #oabot. 2024-11-06T14:31:29Z <p><a href="/wiki/Wikipedia:OABOT" class="mw-redirect" title="Wikipedia:OABOT">Open access bot</a>: pmc updated in citation with #oabot.</p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Previous revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 14:31, 6 November 2024</td> </tr><tr> <td colspan="2" class="diff-lineno">Line 103:</td> <td colspan="2" class="diff-lineno">Line 103:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Proof of correctness ==</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Proof of correctness ==</div></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Expectation-Maximization works to improve &lt;math&gt;Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})&lt;/math&gt; rather than directly improving &lt;math&gt;\log p(\mathbf{X}\mid\boldsymbol\theta)&lt;/math&gt;. Here it is shown that improvements to the former imply improvements to the latter.&lt;ref name="Little1987"&gt;{{cite book |last1=Little |first1= Roderick J.A. |last2= Rubin |first2= Donald B. |author2-link= Donald Rubin |title= Statistical Analysis with Missing Data |url=https://archive.org/details/statisticalanaly00litt |url-access=limited | series = Wiley Series in Probability and Mathematical Statistics |year= 1987 |publisher= John Wiley &amp; Sons |location= New York |isbn= 978-0-471-80254-9 |pages= [https://archive.org/details/statisticalanaly00litt/page/n145 134]–136}}&lt;/ref&gt;&lt;ref&gt;{{Cite journal |last=Song |first=Yuhang |last2=Millidge |first2=Beren |last3=Salvatori |first3=Tommaso |last4=Lukasiewicz |first4=Thomas |last5=Xu |first5=Zhenghua |last6=Bogacz |first6=Rafal |date=February 2024 |title=Inferring neural activity before plasticity as a foundation for learning beyond backpropagation |url=https://www.nature.com/articles/s41593-023-01514-1 |journal=Nature Neuroscience |language=en |volume=27 |issue=2 |pages=348–358 |doi=10.1038/s41593-023-01514-1 |issn=1546-1726}}&lt;/ref&gt;</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Expectation-Maximization works to improve &lt;math&gt;Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})&lt;/math&gt; rather than directly improving &lt;math&gt;\log p(\mathbf{X}\mid\boldsymbol\theta)&lt;/math&gt;. Here it is shown that improvements to the former imply improvements to the latter.&lt;ref name="Little1987"&gt;{{cite book |last1=Little |first1= Roderick J.A. |last2= Rubin |first2= Donald B. |author2-link= Donald Rubin |title= Statistical Analysis with Missing Data |url=https://archive.org/details/statisticalanaly00litt |url-access=limited | series = Wiley Series in Probability and Mathematical Statistics |year= 1987 |publisher= John Wiley &amp; Sons |location= New York |isbn= 978-0-471-80254-9 |pages= [https://archive.org/details/statisticalanaly00litt/page/n145 134]–136}}&lt;/ref&gt;&lt;ref&gt;{{Cite journal |last=Song |first=Yuhang |last2=Millidge |first2=Beren |last3=Salvatori |first3=Tommaso |last4=Lukasiewicz |first4=Thomas |last5=Xu |first5=Zhenghua |last6=Bogacz |first6=Rafal |date=February 2024 |title=Inferring neural activity before plasticity as a foundation for learning beyond backpropagation |url=https://www.nature.com/articles/s41593-023-01514-1 |journal=Nature Neuroscience |language=en |volume=27 |issue=2 |pages=348–358 |doi=10.1038/s41593-023-01514-1 |issn=1546-1726<ins style="font-weight: bold; text-decoration: none;">|pmc=7615830 </ins>}}&lt;/ref&gt;</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>For any &lt;math&gt;\mathbf{Z}&lt;/math&gt; with non-zero probability &lt;math&gt;p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta)&lt;/math&gt;, we can write</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>For any &lt;math&gt;\mathbf{Z}&lt;/math&gt; with non-zero probability &lt;math&gt;p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta)&lt;/math&gt;, we can write</div></td> </tr> </table> OAbot https://en.wikipedia.org/w/index.php?title=Expectation%E2%80%93maximization_algorithm&diff=1252830333&oldid=prev 2601:47:4501:F8A0:E092:8871:D799:D016 at 03:31, 23 October 2024 2024-10-23T03:31:02Z <p></p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Previous revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 03:31, 23 October 2024</td> </tr><tr> <td colspan="2" class="diff-lineno">Line 57:</td> <td colspan="2" class="diff-lineno">Line 57:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Finding a maximum likelihood solution typically requires taking the [[derivative]]s of the [[likelihood function]] with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations. In statistical models with latent variables, this is usually impossible. Instead, the result is typically a set of interlocking equations in which the solution to the parameters requires the values of the latent variables and vice versa, but substituting one set of equations into the other produces an unsolvable equation.</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Finding a maximum likelihood solution typically requires taking the [[derivative]]s of the [[likelihood function]] with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations. In statistical models with latent variables, this is usually impossible. Instead, the result is typically a set of interlocking equations in which the solution to the parameters requires the values of the latent variables and vice versa, but substituting one set of equations into the other produces an unsolvable equation.</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The EM algorithm proceeds from the observation that there is a way to solve these two sets of equations numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the second set, then use these new values to find a better estimate of the first set, and then keep alternating between the two until the resulting values both converge to fixed points. It's not obvious that this will work, but it can be proven in this context. Additionally, it can be proven that the derivative of the likelihood is (arbitrarily close to) zero at that point, which in turn means that the point is either a local maximum or a [[saddle point]].&lt;ref name="Wu" /&gt; In general, multiple maxima may occur, with no guarantee that the global maximum will be found. Some likelihoods also have [[Mathematical singularity|singularities]] in them, i.e., nonsensical maxima. For example, one of the ''solutions'' that may be found by EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the same component to be equal to one of the data points. The convergence of expectation-maximization (EM)-based algorithms typically requires continuity of the likelihood function with respect to all the unknown parameters (referred <del style="font-weight: bold; text-decoration: none;">ro</del> as optimization variables).&lt;ref&gt;{{Cite journal |last=Joseph |first=Geethu |date=April 24, 2024 |title=Convergence of Expectation-Maximization Algorithm With Mixed-Integer Optimization |url=https://ieeexplore.ieee.org/abstract/document/10508082 |journal=IEEE Signal Processing Letters |volume=31 |pages=1229-1233}}&lt;/ref&gt;</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The EM algorithm proceeds from the observation that there is a way to solve these two sets of equations numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the second set, then use these new values to find a better estimate of the first set, and then keep alternating between the two until the resulting values both converge to fixed points. It's not obvious that this will work, but it can be proven in this context. Additionally, it can be proven that the derivative of the likelihood is (arbitrarily close to) zero at that point, which in turn means that the point is either a local maximum or a [[saddle point]].&lt;ref name="Wu" /&gt; In general, multiple maxima may occur, with no guarantee that the global maximum will be found. Some likelihoods also have [[Mathematical singularity|singularities]] in them, i.e., nonsensical maxima. For example, one of the ''solutions'' that may be found by EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the same component to be equal to one of the data points. The convergence of expectation-maximization (EM)-based algorithms typically requires continuity of the likelihood function with respect to all the unknown parameters (referred <ins style="font-weight: bold; text-decoration: none;">to</ins> as optimization variables).&lt;ref&gt;{{Cite journal |last=Joseph |first=Geethu |date=April 24, 2024 |title=Convergence of Expectation-Maximization Algorithm With Mixed-Integer Optimization |url=https://ieeexplore.ieee.org/abstract/document/10508082 |journal=IEEE Signal Processing Letters |volume=31 |pages=1229-1233}}&lt;/ref&gt;</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Description ==</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Description ==</div></td> </tr> </table> 2601:47:4501:F8A0:E092:8871:D799:D016 https://en.wikipedia.org/w/index.php?title=Expectation%E2%80%93maximization_algorithm&diff=1252075368&oldid=prev Ira Leviton: Fixed a reference. Please see Category:CS1 errors: dates. 2024-10-19T17:23:46Z <p>Fixed a reference. Please see <a href="/wiki/Category:CS1_errors:_dates" title="Category:CS1 errors: dates">Category:CS1 errors: dates</a>.</p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Previous revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 17:23, 19 October 2024</td> </tr><tr> <td colspan="2" class="diff-lineno">Line 103:</td> <td colspan="2" class="diff-lineno">Line 103:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Proof of correctness ==</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Proof of correctness ==</div></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Expectation-Maximization works to improve &lt;math&gt;Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})&lt;/math&gt; rather than directly improving &lt;math&gt;\log p(\mathbf{X}\mid\boldsymbol\theta)&lt;/math&gt;. Here it is shown that improvements to the former imply improvements to the latter.&lt;ref name="Little1987"&gt;{{cite book |last1=Little |first1= Roderick J.A. |last2= Rubin |first2= Donald B. |author2-link= Donald Rubin |title= Statistical Analysis with Missing Data |url=https://archive.org/details/statisticalanaly00litt |url-access=limited | series = Wiley Series in Probability and Mathematical Statistics |year= 1987 |publisher= John Wiley &amp; Sons |location= New York |isbn= 978-0-471-80254-9 |pages= [https://archive.org/details/statisticalanaly00litt/page/n145 134]–136}}&lt;/ref&gt;&lt;ref&gt;{{Cite journal |last=Song |first=Yuhang |last2=Millidge |first2=Beren |last3=Salvatori |first3=Tommaso |last4=Lukasiewicz |first4=Thomas |last5=Xu |first5=Zhenghua |last6=Bogacz |first6=Rafal |date=2024<del style="font-weight: bold; text-decoration: none;">-02</del> |title=Inferring neural activity before plasticity as a foundation for learning beyond backpropagation |url=https://www.nature.com/articles/s41593-023-01514-1 |journal=Nature Neuroscience |language=en |volume=27 |issue=2 |pages=348–358 |doi=10.1038/s41593-023-01514-1 |issn=1546-1726}}&lt;/ref&gt;</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Expectation-Maximization works to improve &lt;math&gt;Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})&lt;/math&gt; rather than directly improving &lt;math&gt;\log p(\mathbf{X}\mid\boldsymbol\theta)&lt;/math&gt;. Here it is shown that improvements to the former imply improvements to the latter.&lt;ref name="Little1987"&gt;{{cite book |last1=Little |first1= Roderick J.A. |last2= Rubin |first2= Donald B. |author2-link= Donald Rubin |title= Statistical Analysis with Missing Data |url=https://archive.org/details/statisticalanaly00litt |url-access=limited | series = Wiley Series in Probability and Mathematical Statistics |year= 1987 |publisher= John Wiley &amp; Sons |location= New York |isbn= 978-0-471-80254-9 |pages= [https://archive.org/details/statisticalanaly00litt/page/n145 134]–136}}&lt;/ref&gt;&lt;ref&gt;{{Cite journal |last=Song |first=Yuhang |last2=Millidge |first2=Beren |last3=Salvatori |first3=Tommaso |last4=Lukasiewicz |first4=Thomas |last5=Xu |first5=Zhenghua |last6=Bogacz |first6=Rafal |date=<ins style="font-weight: bold; text-decoration: none;">February </ins>2024 |title=Inferring neural activity before plasticity as a foundation for learning beyond backpropagation |url=https://www.nature.com/articles/s41593-023-01514-1 |journal=Nature Neuroscience |language=en |volume=27 |issue=2 |pages=348–358 |doi=10.1038/s41593-023-01514-1 |issn=1546-1726}}&lt;/ref&gt;</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>For any &lt;math&gt;\mathbf{Z}&lt;/math&gt; with non-zero probability &lt;math&gt;p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta)&lt;/math&gt;, we can write</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>For any &lt;math&gt;\mathbf{Z}&lt;/math&gt; with non-zero probability &lt;math&gt;p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta)&lt;/math&gt;, we can write</div></td> </tr> </table> Ira Leviton https://en.wikipedia.org/w/index.php?title=Expectation%E2%80%93maximization_algorithm&diff=1252009223&oldid=prev Fredklaus2278: adding relevant cite 2024-10-19T08:29:56Z <p>adding relevant cite</p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Previous revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 08:29, 19 October 2024</td> </tr><tr> <td colspan="2" class="diff-lineno">Line 103:</td> <td colspan="2" class="diff-lineno">Line 103:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Proof of correctness ==</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Proof of correctness ==</div></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Expectation-Maximization works to improve &lt;math&gt;Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})&lt;/math&gt; rather than directly improving &lt;math&gt;\log p(\mathbf{X}\mid\boldsymbol\theta)&lt;/math&gt;. Here it is shown that improvements to the former imply improvements to the latter.&lt;ref name="Little1987"&gt;{{cite book |last1=Little |first1= Roderick J.A. |last2= Rubin |first2= Donald B. |author2-link= Donald Rubin |title= Statistical Analysis with Missing Data |url=https://archive.org/details/statisticalanaly00litt |url-access=limited | series = Wiley Series in Probability and Mathematical Statistics |year= 1987 |publisher= John Wiley &amp; Sons |location= New York |isbn= 978-0-471-80254-9 |pages= [https://archive.org/details/statisticalanaly00litt/page/n145 134]–136}}&lt;/ref&gt;</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Expectation-Maximization works to improve &lt;math&gt;Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})&lt;/math&gt; rather than directly improving &lt;math&gt;\log p(\mathbf{X}\mid\boldsymbol\theta)&lt;/math&gt;. Here it is shown that improvements to the former imply improvements to the latter.&lt;ref name="Little1987"&gt;{{cite book |last1=Little |first1= Roderick J.A. |last2= Rubin |first2= Donald B. |author2-link= Donald Rubin |title= Statistical Analysis with Missing Data |url=https://archive.org/details/statisticalanaly00litt |url-access=limited | series = Wiley Series in Probability and Mathematical Statistics |year= 1987 |publisher= John Wiley &amp; Sons |location= New York |isbn= 978-0-471-80254-9 |pages= [https://archive.org/details/statisticalanaly00litt/page/n145 134]–136<ins style="font-weight: bold; text-decoration: none;">}}&lt;/ref&gt;&lt;ref&gt;{{Cite journal |last=Song |first=Yuhang |last2=Millidge |first2=Beren |last3=Salvatori |first3=Tommaso |last4=Lukasiewicz |first4=Thomas |last5=Xu |first5=Zhenghua |last6=Bogacz |first6=Rafal |date=2024-02 |title=Inferring neural activity before plasticity as a foundation for learning beyond backpropagation |url=https://www.nature.com/articles/s41593-023-01514-1 |journal=Nature Neuroscience |language=en |volume=27 |issue=2 |pages=348–358 |doi=10.1038/s41593-023-01514-1 |issn=1546-1726</ins>}}&lt;/ref&gt;</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>For any &lt;math&gt;\mathbf{Z}&lt;/math&gt; with non-zero probability &lt;math&gt;p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta)&lt;/math&gt;, we can write</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>For any &lt;math&gt;\mathbf{Z}&lt;/math&gt; with non-zero probability &lt;math&gt;p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta)&lt;/math&gt;, we can write</div></td> </tr> </table> Fredklaus2278 https://en.wikipedia.org/w/index.php?title=Expectation%E2%80%93maximization_algorithm&diff=1252008989&oldid=prev Fredklaus2278: clarifying convergence (EM)-based algorithms 2024-10-19T08:27:59Z <p>clarifying convergence (EM)-based algorithms</p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Previous revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 08:27, 19 October 2024</td> </tr><tr> <td colspan="2" class="diff-lineno">Line 57:</td> <td colspan="2" class="diff-lineno">Line 57:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Finding a maximum likelihood solution typically requires taking the [[derivative]]s of the [[likelihood function]] with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations. In statistical models with latent variables, this is usually impossible. Instead, the result is typically a set of interlocking equations in which the solution to the parameters requires the values of the latent variables and vice versa, but substituting one set of equations into the other produces an unsolvable equation.</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Finding a maximum likelihood solution typically requires taking the [[derivative]]s of the [[likelihood function]] with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations. In statistical models with latent variables, this is usually impossible. Instead, the result is typically a set of interlocking equations in which the solution to the parameters requires the values of the latent variables and vice versa, but substituting one set of equations into the other produces an unsolvable equation.</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The EM algorithm proceeds from the observation that there is a way to solve these two sets of equations numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the second set, then use these new values to find a better estimate of the first set, and then keep alternating between the two until the resulting values both converge to fixed points. It's not obvious that this will work, but it can be proven in this context. Additionally, it can be proven that the derivative of the likelihood is (arbitrarily close to) zero at that point, which in turn means that the point is either a local maximum or a [[saddle point]].&lt;ref name="Wu" /&gt; In general, multiple maxima may occur, with no guarantee that the global maximum will be found. Some likelihoods also have [[Mathematical singularity|singularities]] in them, i.e., nonsensical maxima. For example, one of the ''solutions'' that may be found by EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the same component to be equal to one of the data points.</div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The EM algorithm proceeds from the observation that there is a way to solve these two sets of equations numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the second set, then use these new values to find a better estimate of the first set, and then keep alternating between the two until the resulting values both converge to fixed points. It's not obvious that this will work, but it can be proven in this context. Additionally, it can be proven that the derivative of the likelihood is (arbitrarily close to) zero at that point, which in turn means that the point is either a local maximum or a [[saddle point]].&lt;ref name="Wu" /&gt; In general, multiple maxima may occur, with no guarantee that the global maximum will be found. Some likelihoods also have [[Mathematical singularity|singularities]] in them, i.e., nonsensical maxima. For example, one of the ''solutions'' that may be found by EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the same component to be equal to one of the data points.<ins style="font-weight: bold; text-decoration: none;"> The convergence of expectation-maximization (EM)-based algorithms typically requires continuity of the likelihood function with respect to all the unknown parameters (referred ro as optimization variables).&lt;ref&gt;{{Cite journal |last=Joseph |first=Geethu |date=April 24, 2024 |title=Convergence of Expectation-Maximization Algorithm With Mixed-Integer Optimization |url=https://ieeexplore.ieee.org/abstract/document/10508082 |journal=IEEE Signal Processing Letters |volume=31 |pages=1229-1233}}&lt;/ref&gt;</ins></div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Description ==</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Description ==</div></td> </tr> </table> Fredklaus2278 https://en.wikipedia.org/w/index.php?title=Expectation%E2%80%93maximization_algorithm&diff=1244458420&oldid=prev Phraowin: /* Alternatives */style 2024-09-07T07:06:28Z <p><span class="autocomment">Alternatives: </span>style</p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Previous revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 07:06, 7 September 2024</td> </tr><tr> <td colspan="2" class="diff-lineno">Line 366:</td> <td colspan="2" class="diff-lineno">Line 366:</td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Alternatives ==</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Alternatives ==</div></td> </tr> <tr> <td class="diff-marker" data-marker="−"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>EM typically converges to a local optimum, not necessarily the global optimum, with no bound on the convergence rate in general. It is possible that it can be arbitrarily poor in high dimensions and there can be an exponential number of local optima. Hence, a need exists for alternative methods for guaranteed learning, especially in the high-dimensional setting. Alternatives to EM exist with better guarantees for consistency, which are termed ''moment-based approaches''&lt;ref&gt;{{Cite journal|last=Pearson|first=Karl|date=1894|title=Contributions to the Mathematical Theory of Evolution|journal=Philosophical Transactions of the Royal Society of London A|volume=185|pages=71–110|issn=0264-3820|jstor=90667|doi=10.1098/rsta.1894.0003|bibcode=1894RSPTA.185...71P|doi-access=free}}&lt;/ref&gt; or the so-called ''spectral techniques''.&lt;ref&gt;{{Cite journal|last1=Shaban|first1=Amirreza|last2=Mehrdad|first2=Farajtabar|last3=Bo|first3=Xie|last4=Le|first4=Song|last5=Byron|first5=Boots|date=2015|title=Learning Latent Variable Models by Improving Spectral Solutions with Exterior Point Method|url=https://www.cc.gatech.edu/~bboots3/files/SpectralExteriorPoint-NIPSWorkshop.pdf|journal=UAI|pages=792–801|access-date=2019-06-12|archive-date=2016-12-24|archive-url=https://web.archive.org/web/20161224102320/https://www.cc.gatech.edu/~bboots3/files/SpectralExteriorPoint-NIPSWorkshop.pdf|url-status=dead}}&lt;/ref&gt;&lt;ref&gt;{{Cite book|title=Local Loss Optimization in Operator Models: A New Insight into Spectral Learning|last=Balle, Borja Quattoni, Ariadna Carreras, Xavier|date=2012-06-27|oclc=815865081}}&lt;/ref&gt; Moment-based approaches to learning the parameters of a probabilistic model enjoy guarantees such as global convergence under certain conditions unlike EM which is often plagued by the issue of getting stuck in local optima. Algorithms with guarantees for learning can be derived for a number of important models such as mixture models, HMMs etc. For these spectral methods, no spurious local optima occur, and the true parameters can be consistently estimated under some regularity conditions{{Citation needed|date=April 2019}}<del style="font-weight: bold; text-decoration: none;">.</del></div></td> <td class="diff-marker" data-marker="+"></td> <td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>EM typically converges to a local optimum, not necessarily the global optimum, with no bound on the convergence rate in general. It is possible that it can be arbitrarily poor in high dimensions and there can be an exponential number of local optima. Hence, a need exists for alternative methods for guaranteed learning, especially in the high-dimensional setting. Alternatives to EM exist with better guarantees for consistency, which are termed ''moment-based approaches''&lt;ref&gt;{{Cite journal|last=Pearson|first=Karl|date=1894|title=Contributions to the Mathematical Theory of Evolution|journal=Philosophical Transactions of the Royal Society of London A|volume=185|pages=71–110|issn=0264-3820|jstor=90667|doi=10.1098/rsta.1894.0003|bibcode=1894RSPTA.185...71P|doi-access=free}}&lt;/ref&gt; or the so-called ''spectral techniques''.&lt;ref&gt;{{Cite journal|last1=Shaban|first1=Amirreza|last2=Mehrdad|first2=Farajtabar|last3=Bo|first3=Xie|last4=Le|first4=Song|last5=Byron|first5=Boots|date=2015|title=Learning Latent Variable Models by Improving Spectral Solutions with Exterior Point Method|url=https://www.cc.gatech.edu/~bboots3/files/SpectralExteriorPoint-NIPSWorkshop.pdf|journal=UAI|pages=792–801|access-date=2019-06-12|archive-date=2016-12-24|archive-url=https://web.archive.org/web/20161224102320/https://www.cc.gatech.edu/~bboots3/files/SpectralExteriorPoint-NIPSWorkshop.pdf|url-status=dead}}&lt;/ref&gt;&lt;ref&gt;{{Cite book|title=Local Loss Optimization in Operator Models: A New Insight into Spectral Learning|last=Balle, Borja Quattoni, Ariadna Carreras, Xavier|date=2012-06-27|oclc=815865081}}&lt;/ref&gt; Moment-based approaches to learning the parameters of a probabilistic model enjoy guarantees such as global convergence under certain conditions unlike EM which is often plagued by the issue of getting stuck in local optima. Algorithms with guarantees for learning can be derived for a number of important models such as mixture models, HMMs etc. For these spectral methods, no spurious local optima occur, and the true parameters can be consistently estimated under some regularity conditions<ins style="font-weight: bold; text-decoration: none;">.</ins>{{Citation needed|date=April 2019}}</div></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td> </tr> <tr> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== See also ==</div></td> <td class="diff-marker"></td> <td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== See also ==</div></td> </tr> </table> Phraowin