Expectation–maximization algorithm - Revision history

Wbm1058: Bayes' theorem (via WP:JWB)

2025-04-10T10:00:35Z

← Previous revision		Revision as of 10:00, 10 April 2025
Line 317:		Line 317:

	==== E step ====		==== E step ====
	Given our current estimate of the parameters ''θ''<sup>(''t'')</sup>, the conditional distribution of the ''Z''<sub>''i''</sub> is determined by [[Bayes theorem]] to be the proportional height of the normal [[probability density function\|density]] weighted by ''τ'':		Given our current estimate of the parameters ''θ''<sup>(''t'')</sup>, the conditional distribution of the ''Z''<sub>''i''</sub> is determined by [[Bayes' theorem]] to be the proportional height of the normal [[probability density function\|density]] weighted by ''τ'':
	: <math>T_{j,i}^{(t)} := \operatorname{P}(Z_i=j \mid X_i=\mathbf{x}_i ;\theta^{(t)}) = \frac{\tau_j^{(t)} \ f(\mathbf{x}_i;\boldsymbol{\mu}_j^{(t)},\Sigma_j^{(t)})}{\tau_1^{(t)} \ f(\mathbf{x}_i;\boldsymbol{\mu}_1^{(t)},\Sigma_1^{(t)}) + \tau_2^{(t)} \ f(\mathbf{x}_i;\boldsymbol{\mu}_2^{(t)},\Sigma_2^{(t)})}.</math>		: <math>T_{j,i}^{(t)} := \operatorname{P}(Z_i=j \mid X_i=\mathbf{x}_i ;\theta^{(t)}) = \frac{\tau_j^{(t)} \ f(\mathbf{x}_i;\boldsymbol{\mu}_j^{(t)},\Sigma_j^{(t)})}{\tau_1^{(t)} \ f(\mathbf{x}_i;\boldsymbol{\mu}_1^{(t)},\Sigma_1^{(t)}) + \tau_2^{(t)} \ f(\mathbf{x}_i;\boldsymbol{\mu}_2^{(t)},\Sigma_2^{(t)})}.</math>

Quantling: /* Introduction */ Remove article with zero citations according to Google scholar. This one needs to age a bit before it can be considered encyclopedic.

2025-02-11T02:45:08Z

Introduction: Remove article with zero citations according to Google scholar. This one needs to age a bit before it can be considered encyclopedic.

← Previous revision		Revision as of 02:45, 11 February 2025
Line 57:		Line 57:
	Finding a maximum likelihood solution typically requires taking the [[derivative]]s of the [[likelihood function]] with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations. In statistical models with latent variables, this is usually impossible. Instead, the result is typically a set of interlocking equations in which the solution to the parameters requires the values of the latent variables and vice versa, but substituting one set of equations into the other produces an unsolvable equation.		Finding a maximum likelihood solution typically requires taking the [[derivative]]s of the [[likelihood function]] with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations. In statistical models with latent variables, this is usually impossible. Instead, the result is typically a set of interlocking equations in which the solution to the parameters requires the values of the latent variables and vice versa, but substituting one set of equations into the other produces an unsolvable equation.

	The EM algorithm proceeds from the observation that there is a way to solve these two sets of equations numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the second set, then use these new values to find a better estimate of the first set, and then keep alternating between the two until the resulting values both converge to fixed points. It's not obvious that this will work, but it can be proven in this context. Additionally, it can be proven that the derivative of the likelihood is (arbitrarily close to) zero at that point, which in turn means that the point is either a local maximum or a [[saddle point]].<ref name="Wu" /> In general, multiple maxima may occur, with no guarantee that the global maximum will be found. Some likelihoods also have [[Mathematical singularity\|singularities]] in them, i.e., nonsensical maxima. For example, one of the ''solutions'' that may be found by EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the same component to be equal to one of the data points. The convergence of expectation-maximization (EM)-based algorithms typically requires continuity of the likelihood function with respect to all the unknown parameters (referred to as optimization variables).<ref>{{Cite journal \|last=Joseph \|first=Geethu \|date=April 24, 2024 \|title=Convergence of Expectation-Maximization Algorithm With Mixed-Integer Optimization \|url=https://ieeexplore.ieee.org/document/10508082 \|journal=IEEE Signal Processing Letters \|volume=31 \|pages=1229–1233\|doi=10.1109/LSP.2024.3393352 \|arxiv=2401.17763 }}</ref>		The EM algorithm proceeds from the observation that there is a way to solve these two sets of equations numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the second set, then use these new values to find a better estimate of the first set, and then keep alternating between the two until the resulting values both converge to fixed points. It's not obvious that this will work, but it can be proven in this context. Additionally, it can be proven that the derivative of the likelihood is (arbitrarily close to) zero at that point, which in turn means that the point is either a local maximum or a [[saddle point]].<ref name="Wu" /> In general, multiple maxima may occur, with no guarantee that the global maximum will be found. Some likelihoods also have [[Mathematical singularity\|singularities]] in them, i.e., nonsensical maxima. For example, one of the ''solutions'' that may be found by EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the same component to be equal to one of the data points.

	== Description ==		== Description ==

Quantling: /* Proof of correctness */ Remove dubious cite of 2024 paper for result from 1987

2025-02-11T02:39:26Z

Proof of correctness: Remove dubious cite of 2024 paper for result from 1987

← Previous revision		Revision as of 02:39, 11 February 2025
Line 103:		Line 103:

	== Proof of correctness ==		== Proof of correctness ==
	Expectation-Maximization works to improve <math>Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})</math> rather than directly improving <math>\log p(\mathbf{X}\mid\boldsymbol\theta)</math>. Here it is shown that improvements to the former imply improvements to the latter.<ref name="Little1987">{{cite book \|last1=Little \|first1= Roderick J.A. \|last2= Rubin \|first2= Donald B. \|author2-link= Donald Rubin \|title= Statistical Analysis with Missing Data \|url=https://archive.org/details/statisticalanaly00litt \|url-access=limited \| series = Wiley Series in Probability and Mathematical Statistics \|year= 1987 \|publisher= John Wiley & Sons \|location= New York \|isbn= 978-0-471-80254-9 \|pages= [https://archive.org/details/statisticalanaly00litt/page/n145 134]–136}}</ref><ref>{{Cite journal \|last1=Song \|first1=Yuhang \|last2=Millidge \|first2=Beren \|last3=Salvatori \|first3=Tommaso \|last4=Lukasiewicz \|first4=Thomas \|last5=Xu \|first5=Zhenghua \|last6=Bogacz \|first6=Rafal \|date=February 2024 \|title=Inferring neural activity before plasticity as a foundation for learning beyond backpropagation \|journal=Nature Neuroscience \|language=en \|volume=27 \|issue=2 \|pages=348–358 \|doi=10.1038/s41593-023-01514-1 \|pmid=38172438 \|issn=1546-1726\|pmc=7615830 }}</ref>		Expectation-Maximization works to improve <math>Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})</math> rather than directly improving <math>\log p(\mathbf{X}\mid\boldsymbol\theta)</math>. Here it is shown that improvements to the former imply improvements to the latter.<ref name="Little1987">{{cite book \|last1=Little \|first1= Roderick J.A. \|last2= Rubin \|first2= Donald B. \|author2-link= Donald Rubin \|title= Statistical Analysis with Missing Data \|url=https://archive.org/details/statisticalanaly00litt \|url-access=limited \| series = Wiley Series in Probability and Mathematical Statistics \|year= 1987 \|publisher= John Wiley & Sons \|location= New York \|isbn= 978-0-471-80254-9 \|pages= [https://archive.org/details/statisticalanaly00litt/page/n145 134]–136}}</ref>

	For any <math>\mathbf{Z}</math> with non-zero probability <math>p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta)</math>, we can write		For any <math>\mathbf{Z}</math> with non-zero probability <math>p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta)</math>, we can write

Citation bot: Alter: url, pages. URLs might have been anonymized. Add: pmid, arxiv, doi, authors 1-1. Removed URL that duplicated identifier. Removed parameters. Formatted dashes. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Dominic3203 | Category:Machine learning algorithms | #UCB_Category 37/84

2024-12-11T07:36:39Z

Alter: url, pages. URLs might have been anonymized. Add: pmid, arxiv, doi, authors 1-1. Removed URL that duplicated identifier. Removed parameters. Formatted dashes. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Dominic3203 | Category:Machine learning algorithms | #UCB_Category 37/84

← Previous revision		Revision as of 07:36, 11 December 2024
Line 17:		Line 17:
	\|journal=[[Journal of the Royal Statistical Society, Series B]]		\|journal=[[Journal of the Royal Statistical Society, Series B]]
	\|year=1977 \|volume=39 \|issue=1 \|pages=1–38		\|year=1977 \|volume=39 \|issue=1 \|pages=1–38
	\|jstor=2984875 \|mr= 0501537		\|doi= 10.1111/j.2517-6161.1977.tb01600.x \|jstor=2984875 \|mr= 0501537
	}}		}}
	</ref> They pointed out that the method had been "proposed many times in special circumstances" by earlier authors. One of the earliest is the gene-counting method for estimating allele frequencies by [[Cedric Smith (statistician)\|Cedric Smith]].<ref>{{cite journal \|last1=Ceppelini \|first1=R.M. \|title=The estimation of gene frequencies in a random-mating population \|journal=Ann. Hum. Genet. \|volume=20 \|issue=2 \|pages=97–115 \|doi=10.1111/j.1469-1809.1955.tb01360.x\|pmid=13268982 \|year=1955 \|s2cid=38625779 }}</ref> Another was proposed by [[Herman Otto Hartley\|H.O. Hartley]] in 1958, and Hartley and Hocking in 1977, from which many of the ideas in the Dempster–Laird–Rubin paper originated.<ref>{{Cite journal \|last=Hartley \|first=Herman Otto \|date=1958 \|title=Maximum Likelihood estimation from incomplete data \|journal=Biometrics \|volume=14 \|issue=2 \|pages=174–194\|doi=10.2307/2527783 \|jstor=2527783 }}</ref> Another one by S.K Ng, Thriyambakam Krishnan and G.J McLachlan in 1977.<ref>{{Citation \|last1=Ng \|first1=Shu Kay \|title=The EM Algorithm \|date=2011-12-21 \|url=http://dx.doi.org/10.1007/978-3-642-21551-3_6 \|work=Handbook of Computational Statistics \|pages=139–172 \|place=Berlin, Heidelberg \|publisher=Springer Berlin Heidelberg \|isbn=978-3-642-21550-6 \|access-date=2022-10-15 \|last2=Krishnan \|first2=Thriyambakam \|last3=McLachlan \|first3=Geoffrey J.\|doi=10.1007/978-3-642-21551-3_6 \|s2cid=59942212 }}</ref> Hartley’s ideas can be broadened to any grouped discrete distribution. A very detailed treatment of the EM method for exponential families was published by Rolf Sundberg in his thesis and several papers,<ref name="Sundberg1974">{{cite journal		</ref> They pointed out that the method had been "proposed many times in special circumstances" by earlier authors. One of the earliest is the gene-counting method for estimating allele frequencies by [[Cedric Smith (statistician)\|Cedric Smith]].<ref>{{cite journal \|last1=Ceppelini \|first1=R.M. \|title=The estimation of gene frequencies in a random-mating population \|journal=Ann. Hum. Genet. \|volume=20 \|issue=2 \|pages=97–115 \|doi=10.1111/j.1469-1809.1955.tb01360.x\|pmid=13268982 \|year=1955 \|s2cid=38625779 }}</ref> Another was proposed by [[Herman Otto Hartley\|H.O. Hartley]] in 1958, and Hartley and Hocking in 1977, from which many of the ideas in the Dempster–Laird–Rubin paper originated.<ref>{{Cite journal \|last=Hartley \|first=Herman Otto \|date=1958 \|title=Maximum Likelihood estimation from incomplete data \|journal=Biometrics \|volume=14 \|issue=2 \|pages=174–194\|doi=10.2307/2527783 \|jstor=2527783 }}</ref> Another one by S.K Ng, Thriyambakam Krishnan and G.J McLachlan in 1977.<ref>{{Citation \|last1=Ng \|first1=Shu Kay \|title=The EM Algorithm \|date=2011-12-21 \|url=http://dx.doi.org/10.1007/978-3-642-21551-3_6 \|work=Handbook of Computational Statistics \|pages=139–172 \|place=Berlin, Heidelberg \|publisher=Springer Berlin Heidelberg \|isbn=978-3-642-21550-6 \|access-date=2022-10-15 \|last2=Krishnan \|first2=Thriyambakam \|last3=McLachlan \|first3=Geoffrey J.\|doi=10.1007/978-3-642-21551-3_6 \|s2cid=59942212 }}</ref> Hartley’s ideas can be broadened to any grouped discrete distribution. A very detailed treatment of the EM method for exponential families was published by Rolf Sundberg in his thesis and several papers,<ref name="Sundberg1974">{{cite journal
Line 57:		Line 57:
	Finding a maximum likelihood solution typically requires taking the [[derivative]]s of the [[likelihood function]] with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations. In statistical models with latent variables, this is usually impossible. Instead, the result is typically a set of interlocking equations in which the solution to the parameters requires the values of the latent variables and vice versa, but substituting one set of equations into the other produces an unsolvable equation.		Finding a maximum likelihood solution typically requires taking the [[derivative]]s of the [[likelihood function]] with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations. In statistical models with latent variables, this is usually impossible. Instead, the result is typically a set of interlocking equations in which the solution to the parameters requires the values of the latent variables and vice versa, but substituting one set of equations into the other produces an unsolvable equation.

	The EM algorithm proceeds from the observation that there is a way to solve these two sets of equations numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the second set, then use these new values to find a better estimate of the first set, and then keep alternating between the two until the resulting values both converge to fixed points. It's not obvious that this will work, but it can be proven in this context. Additionally, it can be proven that the derivative of the likelihood is (arbitrarily close to) zero at that point, which in turn means that the point is either a local maximum or a [[saddle point]].<ref name="Wu" /> In general, multiple maxima may occur, with no guarantee that the global maximum will be found. Some likelihoods also have [[Mathematical singularity\|singularities]] in them, i.e., nonsensical maxima. For example, one of the ''solutions'' that may be found by EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the same component to be equal to one of the data points. The convergence of expectation-maximization (EM)-based algorithms typically requires continuity of the likelihood function with respect to all the unknown parameters (referred to as optimization variables).<ref>{{Cite journal \|last=Joseph \|first=Geethu \|date=April 24, 2024 \|title=Convergence of Expectation-Maximization Algorithm With Mixed-Integer Optimization \|url=https://ieeexplore.ieee.org~~/abstract~~/document/10508082 \|journal=IEEE Signal Processing Letters \|volume=31 \|pages=~~1229-1233~~}}</ref>		The EM algorithm proceeds from the observation that there is a way to solve these two sets of equations numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the second set, then use these new values to find a better estimate of the first set, and then keep alternating between the two until the resulting values both converge to fixed points. It's not obvious that this will work, but it can be proven in this context. Additionally, it can be proven that the derivative of the likelihood is (arbitrarily close to) zero at that point, which in turn means that the point is either a local maximum or a [[saddle point]].<ref name="Wu" /> In general, multiple maxima may occur, with no guarantee that the global maximum will be found. Some likelihoods also have [[Mathematical singularity\|singularities]] in them, i.e., nonsensical maxima. For example, one of the ''solutions'' that may be found by EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the same component to be equal to one of the data points. The convergence of expectation-maximization (EM)-based algorithms typically requires continuity of the likelihood function with respect to all the unknown parameters (referred to as optimization variables).<ref>{{Cite journal \|last=Joseph \|first=Geethu \|date=April 24, 2024 \|title=Convergence of Expectation-Maximization Algorithm With Mixed-Integer Optimization \|url=https://ieeexplore.ieee.org/document/10508082 \|journal=IEEE Signal Processing Letters \|volume=31 \|pages=1229–1233\|doi=10.1109/LSP.2024.3393352 \|arxiv=2401.17763 }}</ref>

	== Description ==		== Description ==
Line 103:		Line 103:

	== Proof of correctness ==		== Proof of correctness ==
	Expectation-Maximization works to improve <math>Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})</math> rather than directly improving <math>\log p(\mathbf{X}\mid\boldsymbol\theta)</math>. Here it is shown that improvements to the former imply improvements to the latter.<ref name="Little1987">{{cite book \|last1=Little \|first1= Roderick J.A. \|last2= Rubin \|first2= Donald B. \|author2-link= Donald Rubin \|title= Statistical Analysis with Missing Data \|url=https://archive.org/details/statisticalanaly00litt \|url-access=limited \| series = Wiley Series in Probability and Mathematical Statistics \|year= 1987 \|publisher= John Wiley & Sons \|location= New York \|isbn= 978-0-471-80254-9 \|pages= [https://archive.org/details/statisticalanaly00litt/page/n145 134]–136}}</ref><ref>{{Cite journal \|~~last~~=Song \|~~first~~=Yuhang \|last2=Millidge \|first2=Beren \|last3=Salvatori \|first3=Tommaso \|last4=Lukasiewicz \|first4=Thomas \|last5=Xu \|first5=Zhenghua \|last6=Bogacz \|first6=Rafal \|date=February 2024 \|title=Inferring neural activity before plasticity as a foundation for learning beyond backpropagation ~~\|url=https://www.nature.com/articles/s41593-023-01514-1~~ \|journal=Nature Neuroscience \|language=en \|volume=27 \|issue=2 \|pages=348–358 \|doi=10.1038/s41593-023-01514-1 \|issn=1546-1726\|pmc=7615830 }}</ref>		Expectation-Maximization works to improve <math>Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})</math> rather than directly improving <math>\log p(\mathbf{X}\mid\boldsymbol\theta)</math>. Here it is shown that improvements to the former imply improvements to the latter.<ref name="Little1987">{{cite book \|last1=Little \|first1= Roderick J.A. \|last2= Rubin \|first2= Donald B. \|author2-link= Donald Rubin \|title= Statistical Analysis with Missing Data \|url=https://archive.org/details/statisticalanaly00litt \|url-access=limited \| series = Wiley Series in Probability and Mathematical Statistics \|year= 1987 \|publisher= John Wiley & Sons \|location= New York \|isbn= 978-0-471-80254-9 \|pages= [https://archive.org/details/statisticalanaly00litt/page/n145 134]–136}}</ref><ref>{{Cite journal \|last1=Song \|first1=Yuhang \|last2=Millidge \|first2=Beren \|last3=Salvatori \|first3=Tommaso \|last4=Lukasiewicz \|first4=Thomas \|last5=Xu \|first5=Zhenghua \|last6=Bogacz \|first6=Rafal \|date=February 2024 \|title=Inferring neural activity before plasticity as a foundation for learning beyond backpropagation \|journal=Nature Neuroscience \|language=en \|volume=27 \|issue=2 \|pages=348–358 \|doi=10.1038/s41593-023-01514-1 \|pmid=38172438 \|issn=1546-1726\|pmc=7615830 }}</ref>

	For any <math>\mathbf{Z}</math> with non-zero probability <math>p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta)</math>, we can write		For any <math>\mathbf{Z}</math> with non-zero probability <math>p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta)</math>, we can write

OAbot: Open access bot: pmc updated in citation with #oabot.

2024-11-06T14:31:29Z

Open access bot: pmc updated in citation with #oabot.

← Previous revision		Revision as of 14:31, 6 November 2024
Line 103:		Line 103:

	== Proof of correctness ==		== Proof of correctness ==
	Expectation-Maximization works to improve <math>Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})</math> rather than directly improving <math>\log p(\mathbf{X}\mid\boldsymbol\theta)</math>. Here it is shown that improvements to the former imply improvements to the latter.<ref name="Little1987">{{cite book \|last1=Little \|first1= Roderick J.A. \|last2= Rubin \|first2= Donald B. \|author2-link= Donald Rubin \|title= Statistical Analysis with Missing Data \|url=https://archive.org/details/statisticalanaly00litt \|url-access=limited \| series = Wiley Series in Probability and Mathematical Statistics \|year= 1987 \|publisher= John Wiley & Sons \|location= New York \|isbn= 978-0-471-80254-9 \|pages= [https://archive.org/details/statisticalanaly00litt/page/n145 134]–136}}</ref><ref>{{Cite journal \|last=Song \|first=Yuhang \|last2=Millidge \|first2=Beren \|last3=Salvatori \|first3=Tommaso \|last4=Lukasiewicz \|first4=Thomas \|last5=Xu \|first5=Zhenghua \|last6=Bogacz \|first6=Rafal \|date=February 2024 \|title=Inferring neural activity before plasticity as a foundation for learning beyond backpropagation \|url=https://www.nature.com/articles/s41593-023-01514-1 \|journal=Nature Neuroscience \|language=en \|volume=27 \|issue=2 \|pages=348–358 \|doi=10.1038/s41593-023-01514-1 \|issn=1546-1726}}</ref>		Expectation-Maximization works to improve <math>Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})</math> rather than directly improving <math>\log p(\mathbf{X}\mid\boldsymbol\theta)</math>. Here it is shown that improvements to the former imply improvements to the latter.<ref name="Little1987">{{cite book \|last1=Little \|first1= Roderick J.A. \|last2= Rubin \|first2= Donald B. \|author2-link= Donald Rubin \|title= Statistical Analysis with Missing Data \|url=https://archive.org/details/statisticalanaly00litt \|url-access=limited \| series = Wiley Series in Probability and Mathematical Statistics \|year= 1987 \|publisher= John Wiley & Sons \|location= New York \|isbn= 978-0-471-80254-9 \|pages= [https://archive.org/details/statisticalanaly00litt/page/n145 134]–136}}</ref><ref>{{Cite journal \|last=Song \|first=Yuhang \|last2=Millidge \|first2=Beren \|last3=Salvatori \|first3=Tommaso \|last4=Lukasiewicz \|first4=Thomas \|last5=Xu \|first5=Zhenghua \|last6=Bogacz \|first6=Rafal \|date=February 2024 \|title=Inferring neural activity before plasticity as a foundation for learning beyond backpropagation \|url=https://www.nature.com/articles/s41593-023-01514-1 \|journal=Nature Neuroscience \|language=en \|volume=27 \|issue=2 \|pages=348–358 \|doi=10.1038/s41593-023-01514-1 \|issn=1546-1726\|pmc=7615830 }}</ref>

	For any <math>\mathbf{Z}</math> with non-zero probability <math>p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta)</math>, we can write		For any <math>\mathbf{Z}</math> with non-zero probability <math>p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta)</math>, we can write

2601:47:4501:F8A0:E092:8871:D799:D016 at 03:31, 23 October 2024

2024-10-23T03:31:02Z

← Previous revision		Revision as of 03:31, 23 October 2024
Line 57:		Line 57:
	Finding a maximum likelihood solution typically requires taking the [[derivative]]s of the [[likelihood function]] with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations. In statistical models with latent variables, this is usually impossible. Instead, the result is typically a set of interlocking equations in which the solution to the parameters requires the values of the latent variables and vice versa, but substituting one set of equations into the other produces an unsolvable equation.		Finding a maximum likelihood solution typically requires taking the [[derivative]]s of the [[likelihood function]] with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations. In statistical models with latent variables, this is usually impossible. Instead, the result is typically a set of interlocking equations in which the solution to the parameters requires the values of the latent variables and vice versa, but substituting one set of equations into the other produces an unsolvable equation.

	The EM algorithm proceeds from the observation that there is a way to solve these two sets of equations numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the second set, then use these new values to find a better estimate of the first set, and then keep alternating between the two until the resulting values both converge to fixed points. It's not obvious that this will work, but it can be proven in this context. Additionally, it can be proven that the derivative of the likelihood is (arbitrarily close to) zero at that point, which in turn means that the point is either a local maximum or a [[saddle point]].<ref name="Wu" /> In general, multiple maxima may occur, with no guarantee that the global maximum will be found. Some likelihoods also have [[Mathematical singularity\|singularities]] in them, i.e., nonsensical maxima. For example, one of the ''solutions'' that may be found by EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the same component to be equal to one of the data points. The convergence of expectation-maximization (EM)-based algorithms typically requires continuity of the likelihood function with respect to all the unknown parameters (referred ro as optimization variables).<ref>{{Cite journal \|last=Joseph \|first=Geethu \|date=April 24, 2024 \|title=Convergence of Expectation-Maximization Algorithm With Mixed-Integer Optimization \|url=https://ieeexplore.ieee.org/abstract/document/10508082 \|journal=IEEE Signal Processing Letters \|volume=31 \|pages=1229-1233}}</ref>		The EM algorithm proceeds from the observation that there is a way to solve these two sets of equations numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the second set, then use these new values to find a better estimate of the first set, and then keep alternating between the two until the resulting values both converge to fixed points. It's not obvious that this will work, but it can be proven in this context. Additionally, it can be proven that the derivative of the likelihood is (arbitrarily close to) zero at that point, which in turn means that the point is either a local maximum or a [[saddle point]].<ref name="Wu" /> In general, multiple maxima may occur, with no guarantee that the global maximum will be found. Some likelihoods also have [[Mathematical singularity\|singularities]] in them, i.e., nonsensical maxima. For example, one of the ''solutions'' that may be found by EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the same component to be equal to one of the data points. The convergence of expectation-maximization (EM)-based algorithms typically requires continuity of the likelihood function with respect to all the unknown parameters (referred to as optimization variables).<ref>{{Cite journal \|last=Joseph \|first=Geethu \|date=April 24, 2024 \|title=Convergence of Expectation-Maximization Algorithm With Mixed-Integer Optimization \|url=https://ieeexplore.ieee.org/abstract/document/10508082 \|journal=IEEE Signal Processing Letters \|volume=31 \|pages=1229-1233}}</ref>

	== Description ==		== Description ==

Ira Leviton: Fixed a reference. Please see Category:CS1 errors: dates.

2024-10-19T17:23:46Z

Fixed a reference. Please see Category:CS1 errors: dates.

← Previous revision		Revision as of 17:23, 19 October 2024
Line 103:		Line 103:

	== Proof of correctness ==		== Proof of correctness ==
	Expectation-Maximization works to improve <math>Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})</math> rather than directly improving <math>\log p(\mathbf{X}\mid\boldsymbol\theta)</math>. Here it is shown that improvements to the former imply improvements to the latter.<ref name="Little1987">{{cite book \|last1=Little \|first1= Roderick J.A. \|last2= Rubin \|first2= Donald B. \|author2-link= Donald Rubin \|title= Statistical Analysis with Missing Data \|url=https://archive.org/details/statisticalanaly00litt \|url-access=limited \| series = Wiley Series in Probability and Mathematical Statistics \|year= 1987 \|publisher= John Wiley & Sons \|location= New York \|isbn= 978-0-471-80254-9 \|pages= [https://archive.org/details/statisticalanaly00litt/page/n145 134]–136}}</ref><ref>{{Cite journal \|last=Song \|first=Yuhang \|last2=Millidge \|first2=Beren \|last3=Salvatori \|first3=Tommaso \|last4=Lukasiewicz \|first4=Thomas \|last5=Xu \|first5=Zhenghua \|last6=Bogacz \|first6=Rafal \|date=2024~~-02~~ \|title=Inferring neural activity before plasticity as a foundation for learning beyond backpropagation \|url=https://www.nature.com/articles/s41593-023-01514-1 \|journal=Nature Neuroscience \|language=en \|volume=27 \|issue=2 \|pages=348–358 \|doi=10.1038/s41593-023-01514-1 \|issn=1546-1726}}</ref>		Expectation-Maximization works to improve <math>Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})</math> rather than directly improving <math>\log p(\mathbf{X}\mid\boldsymbol\theta)</math>. Here it is shown that improvements to the former imply improvements to the latter.<ref name="Little1987">{{cite book \|last1=Little \|first1= Roderick J.A. \|last2= Rubin \|first2= Donald B. \|author2-link= Donald Rubin \|title= Statistical Analysis with Missing Data \|url=https://archive.org/details/statisticalanaly00litt \|url-access=limited \| series = Wiley Series in Probability and Mathematical Statistics \|year= 1987 \|publisher= John Wiley & Sons \|location= New York \|isbn= 978-0-471-80254-9 \|pages= [https://archive.org/details/statisticalanaly00litt/page/n145 134]–136}}</ref><ref>{{Cite journal \|last=Song \|first=Yuhang \|last2=Millidge \|first2=Beren \|last3=Salvatori \|first3=Tommaso \|last4=Lukasiewicz \|first4=Thomas \|last5=Xu \|first5=Zhenghua \|last6=Bogacz \|first6=Rafal \|date=February 2024 \|title=Inferring neural activity before plasticity as a foundation for learning beyond backpropagation \|url=https://www.nature.com/articles/s41593-023-01514-1 \|journal=Nature Neuroscience \|language=en \|volume=27 \|issue=2 \|pages=348–358 \|doi=10.1038/s41593-023-01514-1 \|issn=1546-1726}}</ref>

	For any <math>\mathbf{Z}</math> with non-zero probability <math>p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta)</math>, we can write		For any <math>\mathbf{Z}</math> with non-zero probability <math>p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta)</math>, we can write

Fredklaus2278: adding relevant cite

2024-10-19T08:29:56Z

adding relevant cite

← Previous revision		Revision as of 08:29, 19 October 2024
Line 103:		Line 103:

	== Proof of correctness ==		== Proof of correctness ==
	Expectation-Maximization works to improve <math>Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})</math> rather than directly improving <math>\log p(\mathbf{X}\mid\boldsymbol\theta)</math>. Here it is shown that improvements to the former imply improvements to the latter.<ref name="Little1987">{{cite book \|last1=Little \|first1= Roderick J.A. \|last2= Rubin \|first2= Donald B. \|author2-link= Donald Rubin \|title= Statistical Analysis with Missing Data \|url=https://archive.org/details/statisticalanaly00litt \|url-access=limited \| series = Wiley Series in Probability and Mathematical Statistics \|year= 1987 \|publisher= John Wiley & Sons \|location= New York \|isbn= 978-0-471-80254-9 \|pages= [https://archive.org/details/statisticalanaly00litt/page/n145 134]–136}}</ref>		Expectation-Maximization works to improve <math>Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})</math> rather than directly improving <math>\log p(\mathbf{X}\mid\boldsymbol\theta)</math>. Here it is shown that improvements to the former imply improvements to the latter.<ref name="Little1987">{{cite book \|last1=Little \|first1= Roderick J.A. \|last2= Rubin \|first2= Donald B. \|author2-link= Donald Rubin \|title= Statistical Analysis with Missing Data \|url=https://archive.org/details/statisticalanaly00litt \|url-access=limited \| series = Wiley Series in Probability and Mathematical Statistics \|year= 1987 \|publisher= John Wiley & Sons \|location= New York \|isbn= 978-0-471-80254-9 \|pages= [https://archive.org/details/statisticalanaly00litt/page/n145 134]–136}}</ref><ref>{{Cite journal \|last=Song \|first=Yuhang \|last2=Millidge \|first2=Beren \|last3=Salvatori \|first3=Tommaso \|last4=Lukasiewicz \|first4=Thomas \|last5=Xu \|first5=Zhenghua \|last6=Bogacz \|first6=Rafal \|date=2024-02 \|title=Inferring neural activity before plasticity as a foundation for learning beyond backpropagation \|url=https://www.nature.com/articles/s41593-023-01514-1 \|journal=Nature Neuroscience \|language=en \|volume=27 \|issue=2 \|pages=348–358 \|doi=10.1038/s41593-023-01514-1 \|issn=1546-1726}}</ref>

	For any <math>\mathbf{Z}</math> with non-zero probability <math>p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta)</math>, we can write		For any <math>\mathbf{Z}</math> with non-zero probability <math>p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta)</math>, we can write

Fredklaus2278: clarifying convergence (EM)-based algorithms

2024-10-19T08:27:59Z

clarifying convergence (EM)-based algorithms

← Previous revision		Revision as of 08:27, 19 October 2024
Line 57:		Line 57:
	Finding a maximum likelihood solution typically requires taking the [[derivative]]s of the [[likelihood function]] with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations. In statistical models with latent variables, this is usually impossible. Instead, the result is typically a set of interlocking equations in which the solution to the parameters requires the values of the latent variables and vice versa, but substituting one set of equations into the other produces an unsolvable equation.		Finding a maximum likelihood solution typically requires taking the [[derivative]]s of the [[likelihood function]] with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations. In statistical models with latent variables, this is usually impossible. Instead, the result is typically a set of interlocking equations in which the solution to the parameters requires the values of the latent variables and vice versa, but substituting one set of equations into the other produces an unsolvable equation.

	The EM algorithm proceeds from the observation that there is a way to solve these two sets of equations numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the second set, then use these new values to find a better estimate of the first set, and then keep alternating between the two until the resulting values both converge to fixed points. It's not obvious that this will work, but it can be proven in this context. Additionally, it can be proven that the derivative of the likelihood is (arbitrarily close to) zero at that point, which in turn means that the point is either a local maximum or a [[saddle point]].<ref name="Wu" /> In general, multiple maxima may occur, with no guarantee that the global maximum will be found. Some likelihoods also have [[Mathematical singularity\|singularities]] in them, i.e., nonsensical maxima. For example, one of the ''solutions'' that may be found by EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the same component to be equal to one of the data points.		The EM algorithm proceeds from the observation that there is a way to solve these two sets of equations numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the second set, then use these new values to find a better estimate of the first set, and then keep alternating between the two until the resulting values both converge to fixed points. It's not obvious that this will work, but it can be proven in this context. Additionally, it can be proven that the derivative of the likelihood is (arbitrarily close to) zero at that point, which in turn means that the point is either a local maximum or a [[saddle point]].<ref name="Wu" /> In general, multiple maxima may occur, with no guarantee that the global maximum will be found. Some likelihoods also have [[Mathematical singularity\|singularities]] in them, i.e., nonsensical maxima. For example, one of the ''solutions'' that may be found by EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the same component to be equal to one of the data points. The convergence of expectation-maximization (EM)-based algorithms typically requires continuity of the likelihood function with respect to all the unknown parameters (referred ro as optimization variables).<ref>{{Cite journal \|last=Joseph \|first=Geethu \|date=April 24, 2024 \|title=Convergence of Expectation-Maximization Algorithm With Mixed-Integer Optimization \|url=https://ieeexplore.ieee.org/abstract/document/10508082 \|journal=IEEE Signal Processing Letters \|volume=31 \|pages=1229-1233}}</ref>

	== Description ==		== Description ==

Phraowin: /* Alternatives */style

2024-09-07T07:06:28Z

Alternatives: style

← Previous revision		Revision as of 07:06, 7 September 2024
Line 366:		Line 366:

	== Alternatives ==		== Alternatives ==
	EM typically converges to a local optimum, not necessarily the global optimum, with no bound on the convergence rate in general. It is possible that it can be arbitrarily poor in high dimensions and there can be an exponential number of local optima. Hence, a need exists for alternative methods for guaranteed learning, especially in the high-dimensional setting. Alternatives to EM exist with better guarantees for consistency, which are termed ''moment-based approaches''<ref>{{Cite journal\|last=Pearson\|first=Karl\|date=1894\|title=Contributions to the Mathematical Theory of Evolution\|journal=Philosophical Transactions of the Royal Society of London A\|volume=185\|pages=71–110\|issn=0264-3820\|jstor=90667\|doi=10.1098/rsta.1894.0003\|bibcode=1894RSPTA.185...71P\|doi-access=free}}</ref> or the so-called ''spectral techniques''.<ref>{{Cite journal\|last1=Shaban\|first1=Amirreza\|last2=Mehrdad\|first2=Farajtabar\|last3=Bo\|first3=Xie\|last4=Le\|first4=Song\|last5=Byron\|first5=Boots\|date=2015\|title=Learning Latent Variable Models by Improving Spectral Solutions with Exterior Point Method\|url=https://www.cc.gatech.edu/~bboots3/files/SpectralExteriorPoint-NIPSWorkshop.pdf\|journal=UAI\|pages=792–801\|access-date=2019-06-12\|archive-date=2016-12-24\|archive-url=https://web.archive.org/web/20161224102320/https://www.cc.gatech.edu/~bboots3/files/SpectralExteriorPoint-NIPSWorkshop.pdf\|url-status=dead}}</ref><ref>{{Cite book\|title=Local Loss Optimization in Operator Models: A New Insight into Spectral Learning\|last=Balle, Borja Quattoni, Ariadna Carreras, Xavier\|date=2012-06-27\|oclc=815865081}}</ref> Moment-based approaches to learning the parameters of a probabilistic model enjoy guarantees such as global convergence under certain conditions unlike EM which is often plagued by the issue of getting stuck in local optima. Algorithms with guarantees for learning can be derived for a number of important models such as mixture models, HMMs etc. For these spectral methods, no spurious local optima occur, and the true parameters can be consistently estimated under some regularity conditions{{Citation needed\|date=April 2019}}.		EM typically converges to a local optimum, not necessarily the global optimum, with no bound on the convergence rate in general. It is possible that it can be arbitrarily poor in high dimensions and there can be an exponential number of local optima. Hence, a need exists for alternative methods for guaranteed learning, especially in the high-dimensional setting. Alternatives to EM exist with better guarantees for consistency, which are termed ''moment-based approaches''<ref>{{Cite journal\|last=Pearson\|first=Karl\|date=1894\|title=Contributions to the Mathematical Theory of Evolution\|journal=Philosophical Transactions of the Royal Society of London A\|volume=185\|pages=71–110\|issn=0264-3820\|jstor=90667\|doi=10.1098/rsta.1894.0003\|bibcode=1894RSPTA.185...71P\|doi-access=free}}</ref> or the so-called ''spectral techniques''.<ref>{{Cite journal\|last1=Shaban\|first1=Amirreza\|last2=Mehrdad\|first2=Farajtabar\|last3=Bo\|first3=Xie\|last4=Le\|first4=Song\|last5=Byron\|first5=Boots\|date=2015\|title=Learning Latent Variable Models by Improving Spectral Solutions with Exterior Point Method\|url=https://www.cc.gatech.edu/~bboots3/files/SpectralExteriorPoint-NIPSWorkshop.pdf\|journal=UAI\|pages=792–801\|access-date=2019-06-12\|archive-date=2016-12-24\|archive-url=https://web.archive.org/web/20161224102320/https://www.cc.gatech.edu/~bboots3/files/SpectralExteriorPoint-NIPSWorkshop.pdf\|url-status=dead}}</ref><ref>{{Cite book\|title=Local Loss Optimization in Operator Models: A New Insight into Spectral Learning\|last=Balle, Borja Quattoni, Ariadna Carreras, Xavier\|date=2012-06-27\|oclc=815865081}}</ref> Moment-based approaches to learning the parameters of a probabilistic model enjoy guarantees such as global convergence under certain conditions unlike EM which is often plagued by the issue of getting stuck in local optima. Algorithms with guarantees for learning can be derived for a number of important models such as mixture models, HMMs etc. For these spectral methods, no spurious local optima occur, and the true parameters can be consistently estimated under some regularity conditions.{{Citation needed\|date=April 2019}}

	== See also ==		== See also ==