Yarowsky algorithm - Revision history

OAbot: Open access bot: doi added to citation with #oabot.

2023-01-28T18:00:12Z

Open access bot: doi added to citation with #oabot.

← Previous revision		Revision as of 18:00, 28 January 2023
Line 42:		Line 42:
	\| access-date = 1 November 2022		\| access-date = 1 November 2022

			\| doi-access= free
	}}		}}

209.64.107.75: /* Application */

2022-11-28T23:31:50Z

Application

← Previous revision		Revision as of 23:31, 28 November 2022
Line 11:		Line 11:
	A [[decision list]] algorithm is then used to identify other reliable collocations. This training algorithm calculates the probability Pr(Sense \| Collocation), and the decision list is ranked by the log-likelihood ratio:		A [[decision list]] algorithm is then used to identify other reliable collocations. This training algorithm calculates the probability Pr(Sense \| Collocation), and the decision list is ranked by the log-likelihood ratio:

	: <math> \log \left( \frac{\Pr(\text{Sense}_A \| \text{Collocation}_i)}{ \Pr( \text{Sense}_B \| \text{Collocation}_i)} \right) </math>		: <math> \log \left( \frac{\Pr(\text{Sense}_A \mid \text{Collocation}_i)}{ \Pr( \text{Sense}_B \mid \text{Collocation}_i)} \right) </math>

	A [[smoothing]] algorithm will then be used to avoid 0 values. The decision-list algorithm resolves many problems in a large set of non-independent evidence source by using only the most reliable piece of evidence rather than the whole matching collocation set.		A [[smoothing]] algorithm will then be used to avoid 0 values. The decision-list algorithm resolves many problems in a large set of non-independent evidence source by using only the most reliable piece of evidence rather than the whole matching collocation set.

Aryamccarthy: Use template for citation; update categories

2022-11-01T15:12:13Z

Use template for citation; update categories

@@ Line 29: / Line 29: @@
 ==References==
 [[Category:Corpus linguistics]]
 [[Category:Word-sense disambiguation]]

PrimeBOT: Task 24: remove a maintenance template following a TFD

2022-01-18T22:29:09Z

Task 24: remove a maintenance template following a TFD

← Previous revision		Revision as of 22:29, 18 January 2022
Line 1:		Line 1:
	{{short description\|Method in computational linguistics}}		{{short description\|Method in computational linguistics}}

		⚫	In [[computational linguistics]] the '''Yarowsky algorithm''' is an [[unsupervised learning]] [[algorithm]] for [[word sense disambiguation]] that uses the "one sense per [[collocation]]" and the "one sense per discourse" properties of [[natural language\|human languages]] for word sense disambiguation. From observation, words tend to exhibit only one sense in most given discourse and in a given collocation.
	{{Expert-subject\|linguistics\|date=January 2009}}

⚫	In [[computational linguistics]] the '''Yarowsky algorithm''' is an [[unsupervised learning]] [[algorithm]] for [[word sense disambiguation]] that uses the "one sense per [[collocation]]" and the "one sense per discourse" properties of [[natural language\|human languages]] for word sense disambiguation. From observation, words tend to exhibit only one sense in most given discourse and in a given collocation.

	== Application ==		== Application ==
Line 17:		Line 15:
	A [[smoothing]] algorithm will then be used to avoid 0 values. The decision-list algorithm resolves many problems in a large set of non-independent evidence source by using only the most reliable piece of evidence rather than the whole matching collocation set.		A [[smoothing]] algorithm will then be used to avoid 0 values. The decision-list algorithm resolves many problems in a large set of non-independent evidence source by using only the most reliable piece of evidence rather than the whole matching collocation set.

	The new resulting classifier will then be applied to the whole sample set. Add those examples in the [[errors and residuals in statistics\|residual]] that are tagged as A or B with probability above a reasonable threshold to the seed sets. The decision-list algorithm and the above adding step are applied [[iteration\|iteratively]]. As more newly-learned collocations are added to the seed sets, the sense A or sense B set will grow, and the original residual will shrink. However, these collocations stay in the seed sets only if their probability of classification remains above the threshold, otherwise they are returned to the residual for later classification. At the end of each iteration, the "one sense per discourse" property can be used to help preventing initially mistagged collocates and hence improving the purity of the seed sets.		The new resulting classifier will then be applied to the whole sample set. Add those examples in the [[errors and residuals in statistics\|residual]] that are tagged as A or B with probability above a reasonable threshold to the seed sets. The decision-list algorithm and the above adding step are applied [[iteration\|iteratively]]. As more newly-learned collocations are added to the seed sets, the sense A or sense B set will grow, and the original residual will shrink. However, these collocations stay in the seed sets only if their probability of classification remains above the threshold, otherwise they are returned to the residual for later classification. At the end of each iteration, the "one sense per discourse" property can be used to help preventing initially mistagged collocates and hence improving the purity of the seed sets.

	In order to avoid strong collocates becoming indicators for the wrong class, the class-inclusion threshold needs to be randomly altered. For the same purpose, after intermediate convergence the algorithm will also need to increase the width of the context window.		In order to avoid strong collocates becoming indicators for the wrong class, the class-inclusion threshold needs to be randomly altered. For the same purpose, after intermediate convergence the algorithm will also need to increase the width of the context window.
Line 31:		Line 29:
	==References==		==References==

	* Yarowsky, D. "Unsupervised Word Sense Disambiguation Rivaling Supervised Methods". ''Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics''. Cambridge, MA, pp. 189–196, 1995.		* Yarowsky, D. "Unsupervised Word Sense Disambiguation Rivaling Supervised Methods". ''Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics''. Cambridge, MA, pp. 189–196, 1995.

	[[Category:Corpus linguistics]]		[[Category:Corpus linguistics]]

GVnayR: +short description

2020-08-20T19:47:49Z

+short description

← Previous revision		Revision as of 19:47, 20 August 2020
Line 1:		Line 1:
			{{short description\|Method in computational linguistics}}

	{{Expert-subject\|linguistics\|date=January 2009}}		{{Expert-subject\|linguistics\|date=January 2009}}

Cydebot: Robot - Speedily moving category Word sense disambiguation to :Category:Word-sense disambiguation per CFDS.

2012-01-02T04:00:25Z

Robot - Speedily moving category Word sense disambiguation to Category:Word-sense disambiguation per CFDS.

← Previous revision		Revision as of 04:00, 2 January 2012
Line 32:		Line 32:

	[[Category:Corpus linguistics]]		[[Category:Corpus linguistics]]
	[[Category:Word sense disambiguation]]		[[Category:Word-sense disambiguation]]

R'n'B: Fix links to disambiguation page Predicate

2011-11-28T15:41:22Z

Fix links to disambiguation page Predicate

← Previous revision		Revision as of 15:41, 28 November 2011
Line 7:		Line 7:
	The algorithm starts with a large, untagged [[text corpus\|corpus]], in which it identifies examples of the given [[polysemy\|polysemous]] word, and stores all the relevant [[Sentence (linguistics)\|sentence]]s as lines. For instance, Yarowsky uses the word "plant" in his 1995 paper to demonstrate the algorithm. If it is assumed that there are two possible senses of the word, the next step is to identify a small number of seed collocations representative of each sense, give each sense a label (i.e. sense A and B), then assign the appropriate label to all training examples containing the seed collocations. In this case, the words "life" and "manufacturing" are chosen as initial seed collocations for senses A and B respectively. The residual examples (85%–98% according to Yarowsky) remain untagged.		The algorithm starts with a large, untagged [[text corpus\|corpus]], in which it identifies examples of the given [[polysemy\|polysemous]] word, and stores all the relevant [[Sentence (linguistics)\|sentence]]s as lines. For instance, Yarowsky uses the word "plant" in his 1995 paper to demonstrate the algorithm. If it is assumed that there are two possible senses of the word, the next step is to identify a small number of seed collocations representative of each sense, give each sense a label (i.e. sense A and B), then assign the appropriate label to all training examples containing the seed collocations. In this case, the words "life" and "manufacturing" are chosen as initial seed collocations for senses A and B respectively. The residual examples (85%–98% according to Yarowsky) remain untagged.

	The algorithm should initially choose seed collocations representative that will distinguish sense A and B accurately and productively. This can be done by selecting seed words from a [[dictionary]]’s entry for that sense. The collocations tend to have stronger effect if they are adjacent to the target word, the effect weakens with distance. According to the criteria given in Yarowsky (1993), seed words that appear in the most reliable collocational relationships with the target word will be selected. The effect is much stronger for words in a [[predicate]]-argument relationship than for arbitrary associations at the same distance to the target word, and is much stronger for collocations with content words than with function words. Having said this, a collocation word can have several collocational relationships with the target word throughout the corpus. This could give the word different rankings or even different classifications. Alternatively, it can be done by identifying a single defining collocate for each class, and using for seeds only those contexts containing one of these defining words. A publicly available database [[WordNet]] can be used as an automatic source for such defining terms. In addition, words that occur near the target word in great frequency can be selected as seed collocations representative. This approach is not fully automatic, a human judge must decide which word will be selected for each target word’s sense, the outputs will be reliable indicators of the senses.		The algorithm should initially choose seed collocations representative that will distinguish sense A and B accurately and productively. This can be done by selecting seed words from a [[dictionary]]’s entry for that sense. The collocations tend to have stronger effect if they are adjacent to the target word, the effect weakens with distance. According to the criteria given in Yarowsky (1993), seed words that appear in the most reliable collocational relationships with the target word will be selected. The effect is much stronger for words in a [[wikt:predicate\|predicate]]-argument relationship than for arbitrary associations at the same distance to the target word, and is much stronger for collocations with content words than with function words. Having said this, a collocation word can have several collocational relationships with the target word throughout the corpus. This could give the word different rankings or even different classifications. Alternatively, it can be done by identifying a single defining collocate for each class, and using for seeds only those contexts containing one of these defining words. A publicly available database [[WordNet]] can be used as an automatic source for such defining terms. In addition, words that occur near the target word in great frequency can be selected as seed collocations representative. This approach is not fully automatic, a human judge must decide which word will be selected for each target word’s sense, the outputs will be reliable indicators of the senses.

	A [[decision list]] algorithm is then used to identify other reliable collocations. This training algorithm calculates the probability Pr(Sense \| Collocation), and the decision list is ranked by the log-likelihood ratio:		A [[decision list]] algorithm is then used to identify other reliable collocations. This training algorithm calculates the probability Pr(Sense \| Collocation), and the decision list is ranked by the log-likelihood ratio:

Mieaou at 10:27, 15 June 2011

2011-06-15T10:27:42Z

← Previous revision		Revision as of 10:27, 15 June 2011
Line 9:		Line 9:
	The algorithm should initially choose seed collocations representative that will distinguish sense A and B accurately and productively. This can be done by selecting seed words from a [[dictionary]]’s entry for that sense. The collocations tend to have stronger effect if they are adjacent to the target word, the effect weakens with distance. According to the criteria given in Yarowsky (1993), seed words that appear in the most reliable collocational relationships with the target word will be selected. The effect is much stronger for words in a [[predicate]]-argument relationship than for arbitrary associations at the same distance to the target word, and is much stronger for collocations with content words than with function words. Having said this, a collocation word can have several collocational relationships with the target word throughout the corpus. This could give the word different rankings or even different classifications. Alternatively, it can be done by identifying a single defining collocate for each class, and using for seeds only those contexts containing one of these defining words. A publicly available database [[WordNet]] can be used as an automatic source for such defining terms. In addition, words that occur near the target word in great frequency can be selected as seed collocations representative. This approach is not fully automatic, a human judge must decide which word will be selected for each target word’s sense, the outputs will be reliable indicators of the senses.		The algorithm should initially choose seed collocations representative that will distinguish sense A and B accurately and productively. This can be done by selecting seed words from a [[dictionary]]’s entry for that sense. The collocations tend to have stronger effect if they are adjacent to the target word, the effect weakens with distance. According to the criteria given in Yarowsky (1993), seed words that appear in the most reliable collocational relationships with the target word will be selected. The effect is much stronger for words in a [[predicate]]-argument relationship than for arbitrary associations at the same distance to the target word, and is much stronger for collocations with content words than with function words. Having said this, a collocation word can have several collocational relationships with the target word throughout the corpus. This could give the word different rankings or even different classifications. Alternatively, it can be done by identifying a single defining collocate for each class, and using for seeds only those contexts containing one of these defining words. A publicly available database [[WordNet]] can be used as an automatic source for such defining terms. In addition, words that occur near the target word in great frequency can be selected as seed collocations representative. This approach is not fully automatic, a human judge must decide which word will be selected for each target word’s sense, the outputs will be reliable indicators of the senses.

	A decision-list algorithm is then used to identify other reliable collocations. This training algorithm calculates the probability Pr(Sense \| Collocation), and the decision list is ranked by the log-likelihood ratio:		A [[decision list]] algorithm is then used to identify other reliable collocations. This training algorithm calculates the probability Pr(Sense \| Collocation), and the decision list is ranked by the log-likelihood ratio:

	: <math> \log \left( \frac{\Pr(\text{Sense}_A \| \text{Collocation}_i)}{ \Pr( \text{Sense}_B \| \text{Collocation}_i)} \right) </math>		: <math> \log \left( \frac{\Pr(\text{Sense}_A \| \text{Collocation}_i)}{ \Pr( \text{Sense}_B \| \text{Collocation}_i)} \right) </math>

Michael Hardy: dab

2011-05-05T20:12:13Z

dab

← Previous revision		Revision as of 20:12, 5 May 2011
Line 15:		Line 15:
	A [[smoothing]] algorithm will then be used to avoid 0 values. The decision-list algorithm resolves many problems in a large set of non-independent evidence source by using only the most reliable piece of evidence rather than the whole matching collocation set.		A [[smoothing]] algorithm will then be used to avoid 0 values. The decision-list algorithm resolves many problems in a large set of non-independent evidence source by using only the most reliable piece of evidence rather than the whole matching collocation set.

	The new resulting classifier will then be applied to the whole sample set. Add those examples in the [[residual]] that are tagged as A or B with probability above a reasonable threshold to the seed sets. The decision-list algorithm and the above adding step are applied [[iteration\|iteratively]]. As more newly-learned collocations are added to the seed sets, the sense A or sense B set will grow, and the original residual will shrink. However, these collocations stay in the seed sets only if their probability of classification remains above the threshold, otherwise they are returned to the residual for later classification. At the end of each iteration, the "one sense per discourse" property can be used to help preventing initially mistagged collocates and hence improving the purity of the seed sets.		The new resulting classifier will then be applied to the whole sample set. Add those examples in the [[errors and residuals in statistics\|residual]] that are tagged as A or B with probability above a reasonable threshold to the seed sets. The decision-list algorithm and the above adding step are applied [[iteration\|iteratively]]. As more newly-learned collocations are added to the seed sets, the sense A or sense B set will grow, and the original residual will shrink. However, these collocations stay in the seed sets only if their probability of classification remains above the threshold, otherwise they are returned to the residual for later classification. At the end of each iteration, the "one sense per discourse" property can be used to help preventing initially mistagged collocates and hence improving the purity of the seed sets.

	In order to avoid strong collocates becoming indicators for the wrong class, the class-inclusion threshold needs to be randomly altered. For the same purpose, after intermediate convergence the algorithm will also need to increase the width of the context window.		In order to avoid strong collocates becoming indicators for the wrong class, the class-inclusion threshold needs to be randomly altered. For the same purpose, after intermediate convergence the algorithm will also need to increase the width of the context window.

The Interior: /* Application */ Repairing links to disambiguation pages - You can help!

2010-09-23T03:14:48Z

Application: Repairing links to disambiguation pages - You can help!

← Previous revision		Revision as of 03:14, 23 September 2010
Line 5:		Line 5:
	== Application ==		== Application ==

	The algorithm starts with a large, untagged [[corpus]], in which it identifies examples of the given [[polysemy\|polysemous]] word, and stores all the relevant [[Sentence (linguistics)\|sentence]]s as lines. For instance, Yarowsky uses the word "plant" in his 1995 paper to demonstrate the algorithm. If it is assumed that there are two possible senses of the word, the next step is to identify a small number of seed collocations representative of each sense, give each sense a label (i.e. sense A and B), then assign the appropriate label to all training examples containing the seed collocations. In this case, the words "life" and "manufacturing" are chosen as initial seed collocations for senses A and B respectively. The residual examples (85%–98% according to Yarowsky) remain untagged.		The algorithm starts with a large, untagged [[text corpus\|corpus]], in which it identifies examples of the given [[polysemy\|polysemous]] word, and stores all the relevant [[Sentence (linguistics)\|sentence]]s as lines. For instance, Yarowsky uses the word "plant" in his 1995 paper to demonstrate the algorithm. If it is assumed that there are two possible senses of the word, the next step is to identify a small number of seed collocations representative of each sense, give each sense a label (i.e. sense A and B), then assign the appropriate label to all training examples containing the seed collocations. In this case, the words "life" and "manufacturing" are chosen as initial seed collocations for senses A and B respectively. The residual examples (85%–98% according to Yarowsky) remain untagged.

	The algorithm should initially choose seed collocations representative that will distinguish sense A and B accurately and productively. This can be done by selecting seed words from a [[dictionary]]’s entry for that sense. The collocations tend to have stronger effect if they are adjacent to the target word, the effect weakens with distance. According to the criteria given in Yarowsky (1993), seed words that appear in the most reliable collocational relationships with the target word will be selected. The effect is much stronger for words in a [[predicate]]-argument relationship than for arbitrary associations at the same distance to the target word, and is much stronger for collocations with content words than with function words. Having said this, a collocation word can have several collocational relationships with the target word throughout the corpus. This could give the word different rankings or even different classifications. Alternatively, it can be done by identifying a single defining collocate for each class, and using for seeds only those contexts containing one of these defining words. A publicly available database [[WordNet]] can be used as an automatic source for such defining terms. In addition, words that occur near the target word in great frequency can be selected as seed collocations representative. This approach is not fully automatic, a human judge must decide which word will be selected for each target word’s sense, the outputs will be reliable indicators of the senses.		The algorithm should initially choose seed collocations representative that will distinguish sense A and B accurately and productively. This can be done by selecting seed words from a [[dictionary]]’s entry for that sense. The collocations tend to have stronger effect if they are adjacent to the target word, the effect weakens with distance. According to the criteria given in Yarowsky (1993), seed words that appear in the most reliable collocational relationships with the target word will be selected. The effect is much stronger for words in a [[predicate]]-argument relationship than for arbitrary associations at the same distance to the target word, and is much stronger for collocations with content words than with function words. Having said this, a collocation word can have several collocational relationships with the target word throughout the corpus. This could give the word different rankings or even different classifications. Alternatively, it can be done by identifying a single defining collocate for each class, and using for seeds only those contexts containing one of these defining words. A publicly available database [[WordNet]] can be used as an automatic source for such defining terms. In addition, words that occur near the target word in great frequency can be selected as seed collocations representative. This approach is not fully automatic, a human judge must decide which word will be selected for each target word’s sense, the outputs will be reliable indicators of the senses.