Cache language model - Revision history

Me, Myself, and I are Here: fixed dashes using a script

2024-03-22T02:33:40Z

← Previous revision		Revision as of 02:33, 22 March 2024
Line 1:		Line 1:

	A '''cache language model''' is a type of statistical [[language model]]. These occur in the [[natural language processing]] subfield of [[computer science]] and assign [[probability\|probabilities]] to given sequences of words by means of a [[probability distribution]]. Statistical language models are key components of [[speech recognition]] systems and of many [[machine translation]] systems: they tell such systems which possible output word sequences are probable and which are improbable. The particular characteristic of a cache language model is that it contains a [[Cache (computing)\|cache component]] and assigns relatively high probabilities to words or word sequences that occur elsewhere in a given text. The primary, but by no means sole, use of cache language models is in speech recognition systems.{{Citation needed\|date=September 2011}}		A '''cache language model''' is a type of statistical [[language model]]. These occur in the [[natural language processing]] subfield of [[computer science]] and assign [[probability\|probabilities]] to given sequences of words by means of a [[probability distribution]]. Statistical language models are key components of [[speech recognition]] systems and of many [[machine translation]] systems: they tell such systems which possible output word sequences are probable and which are improbable. The particular characteristic of a cache language model is that it contains a [[Cache (computing)\|cache component]] and assigns relatively high probabilities to words or word sequences that occur elsewhere in a given text. The primary, but by no means sole, use of cache language models is in speech recognition systems.{{Citation needed\|date=September 2011}}

Line 12:		Line 11:
	The success of the cache language model in improving [[word prediction]] rests on the human tendency to use words in a "bursty" fashion: when one is discussing a certain topic in a certain context, the frequency with which one uses certain words will be quite different from their frequencies when one is discussing other topics in other contexts. The traditional N-gram language models, which rely entirely on information from a very small number (four, three, or two) of words preceding the word to which a probability is to be assigned, do not adequately model this "burstiness".{{Citation needed\|date=September 2011}}		The success of the cache language model in improving [[word prediction]] rests on the human tendency to use words in a "bursty" fashion: when one is discussing a certain topic in a certain context, the frequency with which one uses certain words will be quite different from their frequencies when one is discussing other topics in other contexts. The traditional N-gram language models, which rely entirely on information from a very small number (four, three, or two) of words preceding the word to which a probability is to be assigned, do not adequately model this "burstiness".{{Citation needed\|date=September 2011}}

	Recently, the cache language model concept - originally conceived for the N-gram statistical language model paradigm - has been adapted for use in the neural paradigm. For instance, recent work on continuous cache language models in the [[recurrent neural network]] (RNN) setting has applied the cache concept to much larger contexts than before, yielding significant reductions in perplexity.<ref>{{cite conference \|url=https://dl.acm.org/citation.cfm?id=3295353 \|title=Unbounded cache model for online language modeling with open vocabulary \| author=Edouard Grave \|author2= Moustapha Cisse \|author3=Armand Joulin \|book-title=NIPS'17 Proceedings of the 31st International Conference on Neural Information Processing Systems \|year=2017 \|pages=6044–6054 \|location=Long Beach, California \|publisher=Association for Computing Machinery \|isbn=978-1-5108-6096-4}}</ref> Another recent line of research involves incorporating a cache component in a [[Feedforward neural network\|feed-forward]] neural language model (FN-LM) to achieve rapid domain adaptation.<ref>{{cite conference \| title=i-Vectors in Language Modeling: An Efficient Way of Domain Adaptation for Feed-Forward Models \| author=Karel Benes\|author2=Santosh Kesiraju \|author3=Lukas Burget \| s2cid=52192034 \| conference=Interspeech 2018 \| year=2018 \| pages=3383–3387 \| location=Hyderabad, India \| publisher=Interspeech\| doi=10.21437/Interspeech.2018-1070 }}</ref>		Recently, the cache language model concept – originally conceived for the N-gram statistical language model paradigm – has been adapted for use in the neural paradigm. For instance, recent work on continuous cache language models in the [[recurrent neural network]] (RNN) setting has applied the cache concept to much larger contexts than before, yielding significant reductions in perplexity.<ref>{{cite conference \|url=https://dl.acm.org/citation.cfm?id=3295353 \|title=Unbounded cache model for online language modeling with open vocabulary \| author=Edouard Grave \|author2= Moustapha Cisse \|author3=Armand Joulin \|book-title=NIPS'17 Proceedings of the 31st International Conference on Neural Information Processing Systems \|year=2017 \|pages=6044–6054 \|location=Long Beach, California \|publisher=Association for Computing Machinery \|isbn=978-1-5108-6096-4}}</ref> Another recent line of research involves incorporating a cache component in a [[Feedforward neural network\|feed-forward]] neural language model (FN-LM) to achieve rapid domain adaptation.<ref>{{cite conference \| title=i-Vectors in Language Modeling: An Efficient Way of Domain Adaptation for Feed-Forward Models \| author=Karel Benes\|author2=Santosh Kesiraju \|author3=Lukas Burget \| s2cid=52192034 \| conference=Interspeech 2018 \| year=2018 \| pages=3383–3387 \| location=Hyderabad, India \| publisher=Interspeech\| doi=10.21437/Interspeech.2018-1070 }}</ref>

	==See also==		==See also==

Tillander: clean up

2023-12-16T09:54:59Z

clean up

← Previous revision		Revision as of 09:54, 16 December 2023
Line 1:		Line 1:

	{{Confusing\|date=December 2022}}
	A '''cache language model''' is a type of statistical [[language model]]. These occur in the [[natural language processing]] subfield of [[computer science]] and assign [[probability\|probabilities]] to given sequences of words by means of a [[probability distribution]]. Statistical language models are key components of [[speech recognition]] systems and of many [[machine translation]] systems: they tell such systems which possible output word sequences are probable and which are improbable. The particular characteristic of a cache language model is that it contains a [[Cache (computing)\|cache component]] and assigns relatively high probabilities to words or word sequences that occur elsewhere in a given text. The primary, but by no means sole, use of cache language models is in speech recognition systems.{{Citation needed\|date=September 2011}}		A '''cache language model''' is a type of statistical [[language model]]. These occur in the [[natural language processing]] subfield of [[computer science]] and assign [[probability\|probabilities]] to given sequences of words by means of a [[probability distribution]]. Statistical language models are key components of [[speech recognition]] systems and of many [[machine translation]] systems: they tell such systems which possible output word sequences are probable and which are improbable. The particular characteristic of a cache language model is that it contains a [[Cache (computing)\|cache component]] and assigns relatively high probabilities to words or word sequences that occur elsewhere in a given text. The primary, but by no means sole, use of cache language models is in speech recognition systems.{{Citation needed\|date=September 2011}}

Badui Nijil at 09:46, 4 December 2023

2023-12-04T09:46:23Z

← Previous revision		Revision as of 09:46, 4 December 2023
Line 12:		Line 12:
	The success of the cache language model in improving [[word prediction]] rests on the human tendency to use words in a "bursty" fashion: when one is discussing a certain topic in a certain context, the frequency with which one uses certain words will be quite different from their frequencies when one is discussing other topics in other contexts. The traditional N-gram language models, which rely entirely on information from a very small number (four, three, or two) of words preceding the word to which a probability is to be assigned, do not adequately model this "burstiness".{{Citation needed\|date=September 2011}}		The success of the cache language model in improving [[word prediction]] rests on the human tendency to use words in a "bursty" fashion: when one is discussing a certain topic in a certain context, the frequency with which one uses certain words will be quite different from their frequencies when one is discussing other topics in other contexts. The traditional N-gram language models, which rely entirely on information from a very small number (four, three, or two) of words preceding the word to which a probability is to be assigned, do not adequately model this "burstiness".{{Citation needed\|date=September 2011}}

	Recently, the cache language model concept - originally conceived for the N-gram statistical language model paradigm - has been adapted for use in the neural paradigm. For instance, recent work on continuous cache language models in the [[recurrent neural network]] (RNN) setting has applied the cache concept to much larger contexts than before, yielding significant reductions in perplexity		Recently, the cache language model concept - originally conceived for the N-gram statistical language model paradigm - has been adapted for use in the neural paradigm. For instance, recent work on continuous cache language models in the [[recurrent neural network]] (RNN) setting has applied the cache concept to much larger contexts than before, yielding significant reductions in perplexity.<ref>{{cite conference \|url=https://dl.acm.org/citation.cfm?id=3295353 \|title=Unbounded cache model for online language modeling with open vocabulary \| author=Edouard Grave \|author2= Moustapha Cisse \|author3=Armand Joulin \|book-title=NIPS'17 Proceedings of the 31st International Conference on Neural Information Processing Systems \|year=2017 \|pages=6044–6054 \|location=Long Beach, California \|publisher=Association for Computing Machinery \|isbn=978-1-5108-6096-4}}</ref> Another recent line of research involves incorporating a cache component in a [[Feedforward neural network\|feed-forward]] neural language model (FN-LM) to achieve rapid domain adaptation.<ref>{{cite conference \| title=i-Vectors in Language Modeling: An Efficient Way of Domain Adaptation for Feed-Forward Models \| author=Karel Benes\|author2=Santosh Kesiraju \|author3=Lukas Burget \| s2cid=52192034 \| conference=Interspeech 2018 \| year=2018 \| pages=3383–3387 \| location=Hyderabad, India \| publisher=Interspeech\| doi=10.21437/Interspeech.2018-1070 }}</ref>
	.<ref>{{cite conference \|url=https://dl.acm.org/citation.cfm?id=3295353 \|title=Unbounded cache model for online language modeling with open vocabulary \| author=Edouard Grave \|author2= Moustapha Cisse \|author3=Armand Joulin \|book-title=NIPS'17 Proceedings of the 31st International Conference on Neural Information Processing Systems \|year=2017 \|pages=6044–6054 \|location=Long Beach, California \|publisher=Association for Computing Machinery \|isbn=978-1-5108-6096-4}}</ref> Another recent line of research involves incorporating a cache component in a [[Feedforward neural network\|feed-forward]] neural language model (FN-LM) to achieve rapid domain adaptation
	.<ref>{{cite conference \| title=i-Vectors in Language Modeling: An Efficient Way of Domain Adaptation for Feed-Forward Models \| author=Karel Benes\|author2=Santosh Kesiraju \|author3=Lukas Burget \| s2cid=52192034 \| conference=Interspeech 2018 \| year=2018 \| pages=3383–3387 \| location=Hyderabad, India \| publisher=Interspeech\| doi=10.21437/Interspeech.2018-1070 }}</ref>

	==See also==		==See also==

Badui Nijil at 09:45, 4 December 2023

2023-12-04T09:45:19Z

← Previous revision		Revision as of 09:45, 4 December 2023
Line 4:		Line 4:
	To understand why it is a good idea for a statistical language model to contain a cache component one might consider someone who is dictating a letter about elephants to a speech recognition system. Standard (non-cache) [[N-gram]] language models will assign a very low probability to the word "elephant" because it is a very rare word in [[English language\|English]]. If the speech recognition system does not contain a cache component, the person dictating the letter may be annoyed: each time the word "elephant" is spoken another sequence of words with a higher probability according to the N-gram language model may be recognized (e.g., "tell a plan"). These erroneous sequences will have to be deleted manually and replaced in the text by "elephant" each time "elephant" is spoken. If the system has a cache language model, "elephant" will still probably be misrecognized the first time it is spoken and will have to be entered into the text manually; however, from this point on the system is aware that "elephant" is likely to occur again – the estimated probability of occurrence of "elephant" has been increased, making it more likely that if it is spoken it will be recognized correctly. Once "elephant" has occurred several times, the system is likely to recognize it correctly every time it is spoken until the letter has been completely dictated. This increase in the probability assigned to the occurrence of "elephant" is an example of a consequence of [[machine learning]] and more specifically of [[pattern recognition]].		To understand why it is a good idea for a statistical language model to contain a cache component one might consider someone who is dictating a letter about elephants to a speech recognition system. Standard (non-cache) [[N-gram]] language models will assign a very low probability to the word "elephant" because it is a very rare word in [[English language\|English]]. If the speech recognition system does not contain a cache component, the person dictating the letter may be annoyed: each time the word "elephant" is spoken another sequence of words with a higher probability according to the N-gram language model may be recognized (e.g., "tell a plan"). These erroneous sequences will have to be deleted manually and replaced in the text by "elephant" each time "elephant" is spoken. If the system has a cache language model, "elephant" will still probably be misrecognized the first time it is spoken and will have to be entered into the text manually; however, from this point on the system is aware that "elephant" is likely to occur again – the estimated probability of occurrence of "elephant" has been increased, making it more likely that if it is spoken it will be recognized correctly. Once "elephant" has occurred several times, the system is likely to recognize it correctly every time it is spoken until the letter has been completely dictated. This increase in the probability assigned to the occurrence of "elephant" is an example of a consequence of [[machine learning]] and more specifically of [[pattern recognition]].

	There exist variants of the cache language model in which not only single words but also multi-word sequences that have occurred previously are assigned higher probabilities (e.g., if "San Francisco" occurred near the beginning of the text subsequent instances of it would be assigned a higher probability).		There exist variants of the cache language model in which not only single words but also multi-word sequences that have occurred previously are assigned higher probabilities (e.g., if "San Francisco" occurred near the beginning of the text subsequent instances of it would be assigned a higher probability).{{Citation needed\|date=September 2011}}

	The cache language model was first proposed in a paper published in 1990,<ref>{{cite journal \| last1=Kuhn \| first1=R. \| last2=De Mori \| first2=R. \| author-link2=Renato de Mori \| title=A Cache-Based Natural Language Model for Speech Recognition \| journal=[[IEEE Transactions on Pattern Analysis and Machine Intelligence]] \| date=June 1990 \| volume=12 \| issue=6 \| pages=570–583 \| url=http://visgraph.cs.ust.hk/biometrics/Papers/Voice/pami1990-06-01.pdf \| issn=1939-3539 \| doi=10.1109/34.56193 \| access-date=2011-09-24 \| archive-url=https://web.archive.org/web/20110901154408/http://visgraph.cs.ust.hk/biometrics/Papers/Voice/pami1990-06-01.pdf \| archive-date=2011-09-01 \| url-status=dead }} ([https://www.computer.org/csdl/trans/tp/1990/06/i0570-abs.html Abstract])</ref> after which the [[IBM]] speech-recognition group experimented with the concept. The group found that implementation of a form of cache language model yielded a 24% drop in [[Word error rate\|word-error rates]] once the first few hundred words of a document had been dictated.<ref>{{cite journal \| url=http://acl.ldc.upenn.edu/H/H91/H91-1057.pdf \| title=A Dynamic Language Model for Speech Recognition \| author=F. Jelinek\|author2=B. Merialdo\|author3=S. Roukos\|author4=M. Strauss\|name-list-style=amp \| journal=The Journal of the Acoustical Society of America \| year=1991 \| volume=98 \| issue=2 \| pages=293–295 \| doi=10.3115/112405.112464 \| s2cid=11601499 \| archive-url=https://web.archive.org/web/20060614121245/http://acl.ldc.upenn.edu/H/H91/H91-1057.pdf \| archive-date=June 14, 2006 \| url-status=dead}} Conference: Speech and Natural Language, Proceedings of a Workshop held at Pacific Grove, California, USA, February 19–22, 1999.</ref> A detailed survey of language modeling techniques concluded that the cache language model was one of the few new language modeling techniques that yielded improvements over the standard N-gram approach: "Our caching results show that caching is by far the most useful technique for perplexity reduction at small and medium [[Training set\|training data]] sizes".<ref>{{cite book\|author=Joshua T. Goodman\|year=2001\|title=A Bit of Progress in Language Modeling: Extended Version\|publisher=Microsoft Research\|location=Redmond, WA (US)\|id=Technical report MSR-TR-2001-72\|arxiv=cs/0108005v1 \|bibcode=2001cs........8005G}}</ref>		The cache language model was first proposed in a paper published in 1990,<ref>{{cite journal \| last1=Kuhn \| first1=R. \| last2=De Mori \| first2=R. \| author-link2=Renato de Mori \| title=A Cache-Based Natural Language Model for Speech Recognition \| journal=[[IEEE Transactions on Pattern Analysis and Machine Intelligence]] \| date=June 1990 \| volume=12 \| issue=6 \| pages=570–583 \| url=http://visgraph.cs.ust.hk/biometrics/Papers/Voice/pami1990-06-01.pdf \| issn=1939-3539 \| doi=10.1109/34.56193 \| access-date=2011-09-24 \| archive-url=https://web.archive.org/web/20110901154408/http://visgraph.cs.ust.hk/biometrics/Papers/Voice/pami1990-06-01.pdf \| archive-date=2011-09-01 \| url-status=dead }} ([https://www.computer.org/csdl/trans/tp/1990/06/i0570-abs.html Abstract])</ref> after which the [[IBM]] speech-recognition group experimented with the concept. The group found that implementation of a form of cache language model yielded a 24% drop in [[Word error rate\|word-error rates]] once the first few hundred words of a document had been dictated.<ref>{{cite journal \| url=http://acl.ldc.upenn.edu/H/H91/H91-1057.pdf \| title=A Dynamic Language Model for Speech Recognition \| author=F. Jelinek\|author2=B. Merialdo\|author3=S. Roukos\|author4=M. Strauss\|name-list-style=amp \| journal=The Journal of the Acoustical Society of America \| year=1991 \| volume=98 \| issue=2 \| pages=293–295 \| doi=10.3115/112405.112464 \| s2cid=11601499 \| archive-url=https://web.archive.org/web/20060614121245/http://acl.ldc.upenn.edu/H/H91/H91-1057.pdf \| archive-date=June 14, 2006 \| url-status=dead}} Conference: Speech and Natural Language, Proceedings of a Workshop held at Pacific Grove, California, USA, February 19–22, 1999.</ref> A detailed survey of language modeling techniques concluded that the cache language model was one of the few new language modeling techniques that yielded improvements over the standard N-gram approach: "Our caching results show that caching is by far the most useful technique for perplexity reduction at small and medium [[Training set\|training data]] sizes".<ref>{{cite book\|author=Joshua T. Goodman\|year=2001\|title=A Bit of Progress in Language Modeling: Extended Version\|publisher=Microsoft Research\|location=Redmond, WA (US)\|id=Technical report MSR-TR-2001-72\|arxiv=cs/0108005v1 \|bibcode=2001cs........8005G}}</ref>

Nierty at 12:49, 6 February 2023

2023-02-06T12:49:39Z

← Previous revision		Revision as of 12:49, 6 February 2023
Line 2:		Line 2:
	A '''cache language model''' is a type of statistical [[language model]]. These occur in the [[natural language processing]] subfield of [[computer science]] and assign [[probability\|probabilities]] to given sequences of words by means of a [[probability distribution]]. Statistical language models are key components of [[speech recognition]] systems and of many [[machine translation]] systems: they tell such systems which possible output word sequences are probable and which are improbable. The particular characteristic of a cache language model is that it contains a [[Cache (computing)\|cache component]] and assigns relatively high probabilities to words or word sequences that occur elsewhere in a given text. The primary, but by no means sole, use of cache language models is in speech recognition systems.{{Citation needed\|date=September 2011}}		A '''cache language model''' is a type of statistical [[language model]]. These occur in the [[natural language processing]] subfield of [[computer science]] and assign [[probability\|probabilities]] to given sequences of words by means of a [[probability distribution]]. Statistical language models are key components of [[speech recognition]] systems and of many [[machine translation]] systems: they tell such systems which possible output word sequences are probable and which are improbable. The particular characteristic of a cache language model is that it contains a [[Cache (computing)\|cache component]] and assigns relatively high probabilities to words or word sequences that occur elsewhere in a given text. The primary, but by no means sole, use of cache language models is in speech recognition systems.{{Citation needed\|date=September 2011}}

	To understand why it is a good idea for a statistical language model to contain a cache component one might consider someone who is dictating a letter about elephants to a speech recognition system. Standard (non-cache) [[N-gram]] language models will assign a very low probability to the word "elephant" because it is a very rare word in [[English language\|English]]. If the speech recognition system does not contain a cache component the person dictating the letter may be annoyed: each time the word "elephant" is spoken another sequence of words with a higher probability according to the N-gram language model may be recognized (e.g., "tell a plan"). These erroneous sequences will have to be deleted manually and replaced in the text by "elephant" each time "elephant" is spoken. If the system has a cache language model, "elephant" will still probably be misrecognized the first time it is spoken and will have to be entered into the text manually; however, from this point on the system is aware that "elephant" is likely to occur again – the estimated probability of occurrence of "elephant" has been increased, making it more likely that if it is spoken it will be recognized correctly. Once "elephant" has occurred several times the system is likely to recognize it correctly every time it is spoken until the letter has been completely dictated. This increase in the probability assigned to the occurrence of "elephant" is an example of a consequence of [[machine learning]] and more specifically of [[pattern recognition]].		To understand why it is a good idea for a statistical language model to contain a cache component one might consider someone who is dictating a letter about elephants to a speech recognition system. Standard (non-cache) [[N-gram]] language models will assign a very low probability to the word "elephant" because it is a very rare word in [[English language\|English]]. If the speech recognition system does not contain a cache component, the person dictating the letter may be annoyed: each time the word "elephant" is spoken another sequence of words with a higher probability according to the N-gram language model may be recognized (e.g., "tell a plan"). These erroneous sequences will have to be deleted manually and replaced in the text by "elephant" each time "elephant" is spoken. If the system has a cache language model, "elephant" will still probably be misrecognized the first time it is spoken and will have to be entered into the text manually; however, from this point on the system is aware that "elephant" is likely to occur again – the estimated probability of occurrence of "elephant" has been increased, making it more likely that if it is spoken it will be recognized correctly. Once "elephant" has occurred several times, the system is likely to recognize it correctly every time it is spoken until the letter has been completely dictated. This increase in the probability assigned to the occurrence of "elephant" is an example of a consequence of [[machine learning]] and more specifically of [[pattern recognition]].

	There exist variants of the cache language model in which not only single words but also multi-word sequences that have occurred previously are assigned higher probabilities (e.g., if "San Francisco" occurred near the beginning of the text subsequent instances of it would be assigned a higher probability).		There exist variants of the cache language model in which not only single words but also multi-word sequences that have occurred previously are assigned higher probabilities (e.g., if "San Francisco" occurred near the beginning of the text subsequent instances of it would be assigned a higher probability).
Line 8:		Line 8:
	The cache language model was first proposed in a paper published in 1990,<ref>{{cite journal \| last1=Kuhn \| first1=R. \| last2=De Mori \| first2=R. \| author-link2=Renato de Mori \| title=A Cache-Based Natural Language Model for Speech Recognition \| journal=[[IEEE Transactions on Pattern Analysis and Machine Intelligence]] \| date=June 1990 \| volume=12 \| issue=6 \| pages=570–583 \| url=http://visgraph.cs.ust.hk/biometrics/Papers/Voice/pami1990-06-01.pdf \| issn=1939-3539 \| doi=10.1109/34.56193 \| access-date=2011-09-24 \| archive-url=https://web.archive.org/web/20110901154408/http://visgraph.cs.ust.hk/biometrics/Papers/Voice/pami1990-06-01.pdf \| archive-date=2011-09-01 \| url-status=dead }} ([https://www.computer.org/csdl/trans/tp/1990/06/i0570-abs.html Abstract])</ref> after which the [[IBM]] speech-recognition group experimented with the concept. The group found that implementation of a form of cache language model yielded a 24% drop in [[Word error rate\|word-error rates]] once the first few hundred words of a document had been dictated.<ref>{{cite journal \| url=http://acl.ldc.upenn.edu/H/H91/H91-1057.pdf \| title=A Dynamic Language Model for Speech Recognition \| author=F. Jelinek\|author2=B. Merialdo\|author3=S. Roukos\|author4=M. Strauss\|name-list-style=amp \| journal=The Journal of the Acoustical Society of America \| year=1991 \| volume=98 \| issue=2 \| pages=293–295 \| doi=10.3115/112405.112464 \| s2cid=11601499 \| archive-url=https://web.archive.org/web/20060614121245/http://acl.ldc.upenn.edu/H/H91/H91-1057.pdf \| archive-date=June 14, 2006 \| url-status=dead}} Conference: Speech and Natural Language, Proceedings of a Workshop held at Pacific Grove, California, USA, February 19–22, 1999.</ref> A detailed survey of language modeling techniques concluded that the cache language model was one of the few new language modeling techniques that yielded improvements over the standard N-gram approach: "Our caching results show that caching is by far the most useful technique for perplexity reduction at small and medium [[Training set\|training data]] sizes".<ref>{{cite book\|author=Joshua T. Goodman\|year=2001\|title=A Bit of Progress in Language Modeling: Extended Version\|publisher=Microsoft Research\|location=Redmond, WA (US)\|id=Technical report MSR-TR-2001-72\|arxiv=cs/0108005v1 \|bibcode=2001cs........8005G}}</ref>		The cache language model was first proposed in a paper published in 1990,<ref>{{cite journal \| last1=Kuhn \| first1=R. \| last2=De Mori \| first2=R. \| author-link2=Renato de Mori \| title=A Cache-Based Natural Language Model for Speech Recognition \| journal=[[IEEE Transactions on Pattern Analysis and Machine Intelligence]] \| date=June 1990 \| volume=12 \| issue=6 \| pages=570–583 \| url=http://visgraph.cs.ust.hk/biometrics/Papers/Voice/pami1990-06-01.pdf \| issn=1939-3539 \| doi=10.1109/34.56193 \| access-date=2011-09-24 \| archive-url=https://web.archive.org/web/20110901154408/http://visgraph.cs.ust.hk/biometrics/Papers/Voice/pami1990-06-01.pdf \| archive-date=2011-09-01 \| url-status=dead }} ([https://www.computer.org/csdl/trans/tp/1990/06/i0570-abs.html Abstract])</ref> after which the [[IBM]] speech-recognition group experimented with the concept. The group found that implementation of a form of cache language model yielded a 24% drop in [[Word error rate\|word-error rates]] once the first few hundred words of a document had been dictated.<ref>{{cite journal \| url=http://acl.ldc.upenn.edu/H/H91/H91-1057.pdf \| title=A Dynamic Language Model for Speech Recognition \| author=F. Jelinek\|author2=B. Merialdo\|author3=S. Roukos\|author4=M. Strauss\|name-list-style=amp \| journal=The Journal of the Acoustical Society of America \| year=1991 \| volume=98 \| issue=2 \| pages=293–295 \| doi=10.3115/112405.112464 \| s2cid=11601499 \| archive-url=https://web.archive.org/web/20060614121245/http://acl.ldc.upenn.edu/H/H91/H91-1057.pdf \| archive-date=June 14, 2006 \| url-status=dead}} Conference: Speech and Natural Language, Proceedings of a Workshop held at Pacific Grove, California, USA, February 19–22, 1999.</ref> A detailed survey of language modeling techniques concluded that the cache language model was one of the few new language modeling techniques that yielded improvements over the standard N-gram approach: "Our caching results show that caching is by far the most useful technique for perplexity reduction at small and medium [[Training set\|training data]] sizes".<ref>{{cite book\|author=Joshua T. Goodman\|year=2001\|title=A Bit of Progress in Language Modeling: Extended Version\|publisher=Microsoft Research\|location=Redmond, WA (US)\|id=Technical report MSR-TR-2001-72\|arxiv=cs/0108005v1 \|bibcode=2001cs........8005G}}</ref>

	The development of the cache language model has generated considerable interest among those concerned with [[computational linguistics]] in general and [[statistical natural language processing]] in particular: recently there has been interest in applying the cache language model in the field of statistical machine translation.<ref>{{cite conference \| url=http://www.aclweb.org/anthology/W/W10/W10-2602.pdf \| title=Context Adaptation in Statistical Machine Translation Using Models with Exponentially Decaying Cache \| author=Tiedemann, Jorg \| conference=Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing, ACL 2010 \| year=2010 \| pages=8–15 \| location=Uppsala, Sweden \| publisher=Association for Computational Linguistics}}</ref>		The development of the cache language model has generated considerable interest among those concerned with [[computational linguistics]] in general and [[statistical natural language processing]] in particular: recently, there has been interest in applying the cache language model in the field of statistical machine translation.<ref>{{cite conference \| url=http://www.aclweb.org/anthology/W/W10/W10-2602.pdf \| title=Context Adaptation in Statistical Machine Translation Using Models with Exponentially Decaying Cache \| author=Tiedemann, Jorg \| conference=Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing, ACL 2010 \| year=2010 \| pages=8–15 \| location=Uppsala, Sweden \| publisher=Association for Computational Linguistics}}</ref>

	The success of the cache language model in improving [[word prediction]] rests on the human tendency to use words in a "bursty" fashion: when one is discussing a certain topic in a certain context the frequency with which one uses certain words will be quite different from their frequencies when one is discussing other topics in other contexts. The traditional N-gram language models, which rely entirely on information from a very small number (four, three, or two) of words preceding the word to which a probability is to be assigned, do not adequately model this "burstiness".		The success of the cache language model in improving [[word prediction]] rests on the human tendency to use words in a "bursty" fashion: when one is discussing a certain topic in a certain context, the frequency with which one uses certain words will be quite different from their frequencies when one is discussing other topics in other contexts. The traditional N-gram language models, which rely entirely on information from a very small number (four, three, or two) of words preceding the word to which a probability is to be assigned, do not adequately model this "burstiness".{{Citation needed\|date=September 2011}}

	Recently, the cache language model concept - originally conceived for the N-gram statistical language model paradigm - has been adapted for use in the neural paradigm. For instance, recent work on continuous cache language models in the [[recurrent neural network]] (RNN) setting has applied the cache concept to much larger contexts than before, yielding significant reductions in perplexity		Recently, the cache language model concept - originally conceived for the N-gram statistical language model paradigm - has been adapted for use in the neural paradigm. For instance, recent work on continuous cache language models in the [[recurrent neural network]] (RNN) setting has applied the cache concept to much larger contexts than before, yielding significant reductions in perplexity

DarklitShadow: Added {{Confusing}} tag

2022-12-19T23:04:27Z

Added {{Confusing}} tag

← Previous revision		Revision as of 23:04, 19 December 2022
Line 1:		Line 1:
			{{Confusing\|date=December 2022}}
	A '''cache language model''' is a type of statistical [[language model]]. These occur in the [[natural language processing]] subfield of [[computer science]] and assign [[probability\|probabilities]] to given sequences of words by means of a [[probability distribution]]. Statistical language models are key components of [[speech recognition]] systems and of many [[machine translation]] systems: they tell such systems which possible output word sequences are probable and which are improbable. The particular characteristic of a cache language model is that it contains a [[Cache (computing)\|cache component]] and assigns relatively high probabilities to words or word sequences that occur elsewhere in a given text. The primary, but by no means sole, use of cache language models is in speech recognition systems.{{Citation needed\|date=September 2011}}		A '''cache language model''' is a type of statistical [[language model]]. These occur in the [[natural language processing]] subfield of [[computer science]] and assign [[probability\|probabilities]] to given sequences of words by means of a [[probability distribution]]. Statistical language models are key components of [[speech recognition]] systems and of many [[machine translation]] systems: they tell such systems which possible output word sequences are probable and which are improbable. The particular characteristic of a cache language model is that it contains a [[Cache (computing)\|cache component]] and assigns relatively high probabilities to words or word sequences that occur elsewhere in a given text. The primary, but by no means sole, use of cache language models is in speech recognition systems.{{Citation needed\|date=September 2011}}

Monkbot: Task 18 (cosmetic): eval 7 templates: del empty params (1×); hyphenate params (4×);

2020-12-13T19:02:59Z

Task 18 (cosmetic): eval 7 templates: del empty params (1×); hyphenate params (4×);

← Previous revision		Revision as of 19:02, 13 December 2020
Line 5:		Line 5:
	There exist variants of the cache language model in which not only single words but also multi-word sequences that have occurred previously are assigned higher probabilities (e.g., if "San Francisco" occurred near the beginning of the text subsequent instances of it would be assigned a higher probability).		There exist variants of the cache language model in which not only single words but also multi-word sequences that have occurred previously are assigned higher probabilities (e.g., if "San Francisco" occurred near the beginning of the text subsequent instances of it would be assigned a higher probability).

	The cache language model was first proposed in a paper published in 1990,<ref>{{cite journal \| last1=Kuhn \| first1=R. \| last2=De Mori \| first2=R. \| ~~authorlink1= \| authorlink2~~=Renato de Mori \| title=A Cache-Based Natural Language Model for Speech Recognition \| journal=[[IEEE Transactions on Pattern Analysis and Machine Intelligence]] \| date=June 1990 \| volume=12 \| issue=6 \| pages=570–583 \| url=http://visgraph.cs.ust.hk/biometrics/Papers/Voice/pami1990-06-01.pdf \| issn=1939-3539 \| doi=10.1109/34.56193 \| access-date=2011-09-24 \| archive-url=https://web.archive.org/web/20110901154408/http://visgraph.cs.ust.hk/biometrics/Papers/Voice/pami1990-06-01.pdf \| archive-date=2011-09-01 \| url-status=dead }} ([https://www.computer.org/csdl/trans/tp/1990/06/i0570-abs.html Abstract])</ref> after which the [[IBM]] speech-recognition group experimented with the concept. The group found that implementation of a form of cache language model yielded a 24% drop in [[Word error rate\|word-error rates]] once the first few hundred words of a document had been dictated.<ref>{{cite journal \| url=http://acl.ldc.upenn.edu/H/H91/H91-1057.pdf \| title=A Dynamic Language Model for Speech Recognition \| author=F. Jelinek\|author2=B. Merialdo\|author3=S. Roukos\|author4=M. Strauss\|name-list-style=amp \| journal=The Journal of the Acoustical Society of America \| year=1991 \| volume=98 \| issue=2 \| pages=293–295 \| doi=10.3115/112405.112464 \| s2cid=11601499 \| ~~archiveurl~~=https://web.archive.org/web/20060614121245/http://acl.ldc.upenn.edu/H/H91/H91-1057.pdf \| ~~archivedate~~=June 14, 2006 \| url-status=dead}} Conference: Speech and Natural Language, Proceedings of a Workshop held at Pacific Grove, California, USA, February 19–22, 1999.</ref> A detailed survey of language modeling techniques concluded that the cache language model was one of the few new language modeling techniques that yielded improvements over the standard N-gram approach: "Our caching results show that caching is by far the most useful technique for perplexity reduction at small and medium [[Training set\|training data]] sizes".<ref>{{cite book\|author=Joshua T. Goodman\|year=2001\|title=A Bit of Progress in Language Modeling: Extended Version\|publisher=Microsoft Research\|location=Redmond, WA (US)\|id=Technical report MSR-TR-2001-72\|arxiv=cs/0108005v1 \|bibcode=2001cs........8005G}}</ref>		The cache language model was first proposed in a paper published in 1990,<ref>{{cite journal \| last1=Kuhn \| first1=R. \| last2=De Mori \| first2=R. \| author-link2=Renato de Mori \| title=A Cache-Based Natural Language Model for Speech Recognition \| journal=[[IEEE Transactions on Pattern Analysis and Machine Intelligence]] \| date=June 1990 \| volume=12 \| issue=6 \| pages=570–583 \| url=http://visgraph.cs.ust.hk/biometrics/Papers/Voice/pami1990-06-01.pdf \| issn=1939-3539 \| doi=10.1109/34.56193 \| access-date=2011-09-24 \| archive-url=https://web.archive.org/web/20110901154408/http://visgraph.cs.ust.hk/biometrics/Papers/Voice/pami1990-06-01.pdf \| archive-date=2011-09-01 \| url-status=dead }} ([https://www.computer.org/csdl/trans/tp/1990/06/i0570-abs.html Abstract])</ref> after which the [[IBM]] speech-recognition group experimented with the concept. The group found that implementation of a form of cache language model yielded a 24% drop in [[Word error rate\|word-error rates]] once the first few hundred words of a document had been dictated.<ref>{{cite journal \| url=http://acl.ldc.upenn.edu/H/H91/H91-1057.pdf \| title=A Dynamic Language Model for Speech Recognition \| author=F. Jelinek\|author2=B. Merialdo\|author3=S. Roukos\|author4=M. Strauss\|name-list-style=amp \| journal=The Journal of the Acoustical Society of America \| year=1991 \| volume=98 \| issue=2 \| pages=293–295 \| doi=10.3115/112405.112464 \| s2cid=11601499 \| archive-url=https://web.archive.org/web/20060614121245/http://acl.ldc.upenn.edu/H/H91/H91-1057.pdf \| archive-date=June 14, 2006 \| url-status=dead}} Conference: Speech and Natural Language, Proceedings of a Workshop held at Pacific Grove, California, USA, February 19–22, 1999.</ref> A detailed survey of language modeling techniques concluded that the cache language model was one of the few new language modeling techniques that yielded improvements over the standard N-gram approach: "Our caching results show that caching is by far the most useful technique for perplexity reduction at small and medium [[Training set\|training data]] sizes".<ref>{{cite book\|author=Joshua T. Goodman\|year=2001\|title=A Bit of Progress in Language Modeling: Extended Version\|publisher=Microsoft Research\|location=Redmond, WA (US)\|id=Technical report MSR-TR-2001-72\|arxiv=cs/0108005v1 \|bibcode=2001cs........8005G}}</ref>

	The development of the cache language model has generated considerable interest among those concerned with [[computational linguistics]] in general and [[statistical natural language processing]] in particular: recently there has been interest in applying the cache language model in the field of statistical machine translation.<ref>{{cite conference \| url=http://www.aclweb.org/anthology/W/W10/W10-2602.pdf \| title=Context Adaptation in Statistical Machine Translation Using Models with Exponentially Decaying Cache \| author=Tiedemann, Jorg \| conference=Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing, ACL 2010 \| year=2010 \| pages=8–15 \| location=Uppsala, Sweden \| publisher=Association for Computational Linguistics}}</ref>		The development of the cache language model has generated considerable interest among those concerned with [[computational linguistics]] in general and [[statistical natural language processing]] in particular: recently there has been interest in applying the cache language model in the field of statistical machine translation.<ref>{{cite conference \| url=http://www.aclweb.org/anthology/W/W10/W10-2602.pdf \| title=Context Adaptation in Statistical Machine Translation Using Models with Exponentially Decaying Cache \| author=Tiedemann, Jorg \| conference=Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing, ACL 2010 \| year=2010 \| pages=8–15 \| location=Uppsala, Sweden \| publisher=Association for Computational Linguistics}}</ref>
Line 26:		Line 26:

	== Further reading ==		== Further reading ==
	*{{cite book \| last=Jelinek \| first=Frederick \| ~~authorlink~~=Frederick Jelinek \| title=Statistical Methods for Speech Recognition \| publisher=[[The MIT Press]] \| url=http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=7447 \| year=1997 \| isbn=0-262-10066-5 \| access-date=2011-09-24 \| archive-url=https://web.archive.org/web/20110805015427/http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=7447 \| archive-date=2011-08-05 \| url-status=dead }}		*{{cite book \| last=Jelinek \| first=Frederick \| author-link=Frederick Jelinek \| title=Statistical Methods for Speech Recognition \| publisher=[[The MIT Press]] \| url=http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=7447 \| year=1997 \| isbn=0-262-10066-5 \| access-date=2011-09-24 \| archive-url=https://web.archive.org/web/20110805015427/http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=7447 \| archive-date=2011-08-05 \| url-status=dead }}

	[[Category:Language modeling]]		[[Category:Language modeling]]

Monkbot: /* top */Task 17 (BRFA trial): replace deprecated: |last-author-amp= (1× replaced; usage: 1 of 4);

2020-10-20T03:36:25Z

top: Task 17 (BRFA trial): replace deprecated: |last-author-amp= (1× replaced; usage: 1 of 4);

← Previous revision		Revision as of 03:36, 20 October 2020
Line 5:		Line 5:
	There exist variants of the cache language model in which not only single words but also multi-word sequences that have occurred previously are assigned higher probabilities (e.g., if "San Francisco" occurred near the beginning of the text subsequent instances of it would be assigned a higher probability).		There exist variants of the cache language model in which not only single words but also multi-word sequences that have occurred previously are assigned higher probabilities (e.g., if "San Francisco" occurred near the beginning of the text subsequent instances of it would be assigned a higher probability).

	The cache language model was first proposed in a paper published in 1990,<ref>{{cite journal \| last1=Kuhn \| first1=R. \| last2=De Mori \| first2=R. \| authorlink1= \| authorlink2=Renato de Mori \| title=A Cache-Based Natural Language Model for Speech Recognition \| journal=[[IEEE Transactions on Pattern Analysis and Machine Intelligence]] \| date=June 1990 \| volume=12 \| issue=6 \| pages=570–583 \| url=http://visgraph.cs.ust.hk/biometrics/Papers/Voice/pami1990-06-01.pdf \| issn=1939-3539 \| doi=10.1109/34.56193 \| access-date=2011-09-24 \| archive-url=https://web.archive.org/web/20110901154408/http://visgraph.cs.ust.hk/biometrics/Papers/Voice/pami1990-06-01.pdf \| archive-date=2011-09-01 \| url-status=dead }} ([https://www.computer.org/csdl/trans/tp/1990/06/i0570-abs.html Abstract])</ref> after which the [[IBM]] speech-recognition group experimented with the concept. The group found that implementation of a form of cache language model yielded a 24% drop in [[Word error rate\|word-error rates]] once the first few hundred words of a document had been dictated.<ref>{{cite journal \| url=http://acl.ldc.upenn.edu/H/H91/H91-1057.pdf \| title=A Dynamic Language Model for Speech Recognition \| author=F. Jelinek\|author2=B. Merialdo\|author3=S. Roukos\|author4=M. Strauss\|~~lastauthoramp~~=y \| journal=The Journal of the Acoustical Society of America \| year=1991 \| volume=98 \| issue=2 \| pages=293–295 \| doi=10.3115/112405.112464 \| s2cid=11601499 \| archiveurl=https://web.archive.org/web/20060614121245/http://acl.ldc.upenn.edu/H/H91/H91-1057.pdf \| archivedate=June 14, 2006 \| url-status=dead}} Conference: Speech and Natural Language, Proceedings of a Workshop held at Pacific Grove, California, USA, February 19–22, 1999.</ref> A detailed survey of language modeling techniques concluded that the cache language model was one of the few new language modeling techniques that yielded improvements over the standard N-gram approach: "Our caching results show that caching is by far the most useful technique for perplexity reduction at small and medium [[Training set\|training data]] sizes".<ref>{{cite book\|author=Joshua T. Goodman\|year=2001\|title=A Bit of Progress in Language Modeling: Extended Version\|publisher=Microsoft Research\|location=Redmond, WA (US)\|id=Technical report MSR-TR-2001-72\|arxiv=cs/0108005v1 \|bibcode=2001cs........8005G}}</ref>		The cache language model was first proposed in a paper published in 1990,<ref>{{cite journal \| last1=Kuhn \| first1=R. \| last2=De Mori \| first2=R. \| authorlink1= \| authorlink2=Renato de Mori \| title=A Cache-Based Natural Language Model for Speech Recognition \| journal=[[IEEE Transactions on Pattern Analysis and Machine Intelligence]] \| date=June 1990 \| volume=12 \| issue=6 \| pages=570–583 \| url=http://visgraph.cs.ust.hk/biometrics/Papers/Voice/pami1990-06-01.pdf \| issn=1939-3539 \| doi=10.1109/34.56193 \| access-date=2011-09-24 \| archive-url=https://web.archive.org/web/20110901154408/http://visgraph.cs.ust.hk/biometrics/Papers/Voice/pami1990-06-01.pdf \| archive-date=2011-09-01 \| url-status=dead }} ([https://www.computer.org/csdl/trans/tp/1990/06/i0570-abs.html Abstract])</ref> after which the [[IBM]] speech-recognition group experimented with the concept. The group found that implementation of a form of cache language model yielded a 24% drop in [[Word error rate\|word-error rates]] once the first few hundred words of a document had been dictated.<ref>{{cite journal \| url=http://acl.ldc.upenn.edu/H/H91/H91-1057.pdf \| title=A Dynamic Language Model for Speech Recognition \| author=F. Jelinek\|author2=B. Merialdo\|author3=S. Roukos\|author4=M. Strauss\|name-list-style=amp \| journal=The Journal of the Acoustical Society of America \| year=1991 \| volume=98 \| issue=2 \| pages=293–295 \| doi=10.3115/112405.112464 \| s2cid=11601499 \| archiveurl=https://web.archive.org/web/20060614121245/http://acl.ldc.upenn.edu/H/H91/H91-1057.pdf \| archivedate=June 14, 2006 \| url-status=dead}} Conference: Speech and Natural Language, Proceedings of a Workshop held at Pacific Grove, California, USA, February 19–22, 1999.</ref> A detailed survey of language modeling techniques concluded that the cache language model was one of the few new language modeling techniques that yielded improvements over the standard N-gram approach: "Our caching results show that caching is by far the most useful technique for perplexity reduction at small and medium [[Training set\|training data]] sizes".<ref>{{cite book\|author=Joshua T. Goodman\|year=2001\|title=A Bit of Progress in Language Modeling: Extended Version\|publisher=Microsoft Research\|location=Redmond, WA (US)\|id=Technical report MSR-TR-2001-72\|arxiv=cs/0108005v1 \|bibcode=2001cs........8005G}}</ref>

	The development of the cache language model has generated considerable interest among those concerned with [[computational linguistics]] in general and [[statistical natural language processing]] in particular: recently there has been interest in applying the cache language model in the field of statistical machine translation.<ref>{{cite conference \| url=http://www.aclweb.org/anthology/W/W10/W10-2602.pdf \| title=Context Adaptation in Statistical Machine Translation Using Models with Exponentially Decaying Cache \| author=Tiedemann, Jorg \| conference=Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing, ACL 2010 \| year=2010 \| pages=8–15 \| location=Uppsala, Sweden \| publisher=Association for Computational Linguistics}}</ref>		The development of the cache language model has generated considerable interest among those concerned with [[computational linguistics]] in general and [[statistical natural language processing]] in particular: recently there has been interest in applying the cache language model in the field of statistical machine translation.<ref>{{cite conference \| url=http://www.aclweb.org/anthology/W/W10/W10-2602.pdf \| title=Context Adaptation in Statistical Machine Translation Using Models with Exponentially Decaying Cache \| author=Tiedemann, Jorg \| conference=Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing, ACL 2010 \| year=2010 \| pages=8–15 \| location=Uppsala, Sweden \| publisher=Association for Computational Linguistics}}</ref>

WikiCleanerBot: v2.03b - Bot T20 CW#61 - WP:WCW project (Reference before punctuation)

2020-10-04T10:57:33Z

v2.03b - Bot T20 CW#61 - WP:WCW project (Reference before punctuation)

← Previous revision		Revision as of 10:57, 4 October 2020
Line 12:		Line 12:

	Recently, the cache language model concept - originally conceived for the N-gram statistical language model paradigm - has been adapted for use in the neural paradigm. For instance, recent work on continuous cache language models in the [[recurrent neural network]] (RNN) setting has applied the cache concept to much larger contexts than before, yielding significant reductions in perplexity		Recently, the cache language model concept - originally conceived for the N-gram statistical language model paradigm - has been adapted for use in the neural paradigm. For instance, recent work on continuous cache language models in the [[recurrent neural network]] (RNN) setting has applied the cache concept to much larger contexts than before, yielding significant reductions in perplexity
	<ref>{{cite conference \|url=https://dl.acm.org/citation.cfm?id=3295353 \|title=Unbounded cache model for online language modeling with open vocabulary \| author=Edouard Grave \|author2= Moustapha Cisse \|author3=Armand Joulin \|book-title=NIPS'17 Proceedings of the 31st International Conference on Neural Information Processing Systems \|year=2017 \|pages=6044–6054 \|location=Long Beach, California \|publisher=Association for Computing Machinery \|isbn=978-1-5108-6096-4}}</ref>. Another recent line of research involves incorporating a cache component in a [[Feedforward neural network\|feed-forward]] neural language model (FN-LM) to achieve rapid domain adaptation		.<ref>{{cite conference \|url=https://dl.acm.org/citation.cfm?id=3295353 \|title=Unbounded cache model for online language modeling with open vocabulary \| author=Edouard Grave \|author2= Moustapha Cisse \|author3=Armand Joulin \|book-title=NIPS'17 Proceedings of the 31st International Conference on Neural Information Processing Systems \|year=2017 \|pages=6044–6054 \|location=Long Beach, California \|publisher=Association for Computing Machinery \|isbn=978-1-5108-6096-4}}</ref> Another recent line of research involves incorporating a cache component in a [[Feedforward neural network\|feed-forward]] neural language model (FN-LM) to achieve rapid domain adaptation
	<ref>{{cite conference \| title=i-Vectors in Language Modeling: An Efficient Way of Domain Adaptation for Feed-Forward Models \| author=Karel Benes\|author2=Santosh Kesiraju \|author3=Lukas Burget \| s2cid=52192034 \| conference=Interspeech 2018 \| year=2018 \| pages=3383–3387 \| location=Hyderabad, India \| publisher=Interspeech\| doi=10.21437/Interspeech.2018-1070 }}</ref>.		.<ref>{{cite conference \| title=i-Vectors in Language Modeling: An Efficient Way of Domain Adaptation for Feed-Forward Models \| author=Karel Benes\|author2=Santosh Kesiraju \|author3=Lukas Burget \| s2cid=52192034 \| conference=Interspeech 2018 \| year=2018 \| pages=3383–3387 \| location=Hyderabad, India \| publisher=Interspeech\| doi=10.21437/Interspeech.2018-1070 }}</ref>

	==See also==		==See also==

Citation bot: Add: s2cid. | You can use this bot yourself. Report bugs here. | Suggested by Amigao | Category:Computational linguistics | via #UCB_Category

2020-08-30T21:04:54Z

Add: s2cid. | You can use this bot yourself. Report bugs here. | Suggested by Amigao | Category:Computational linguistics | via #UCB_Category

← Previous revision		Revision as of 21:04, 30 August 2020
Line 5:		Line 5:
	There exist variants of the cache language model in which not only single words but also multi-word sequences that have occurred previously are assigned higher probabilities (e.g., if "San Francisco" occurred near the beginning of the text subsequent instances of it would be assigned a higher probability).		There exist variants of the cache language model in which not only single words but also multi-word sequences that have occurred previously are assigned higher probabilities (e.g., if "San Francisco" occurred near the beginning of the text subsequent instances of it would be assigned a higher probability).

	The cache language model was first proposed in a paper published in 1990,<ref>{{cite journal \| last1=Kuhn \| first1=R. \| last2=De Mori \| first2=R. \| authorlink1= \| authorlink2=Renato de Mori \| title=A Cache-Based Natural Language Model for Speech Recognition \| journal=[[IEEE Transactions on Pattern Analysis and Machine Intelligence]] \| date=June 1990 \| volume=12 \| issue=6 \| pages=570–583 \| url=http://visgraph.cs.ust.hk/biometrics/Papers/Voice/pami1990-06-01.pdf \| issn=1939-3539 \| doi=10.1109/34.56193 \| access-date=2011-09-24 \| archive-url=https://web.archive.org/web/20110901154408/http://visgraph.cs.ust.hk/biometrics/Papers/Voice/pami1990-06-01.pdf \| archive-date=2011-09-01 \| url-status=dead }} ([https://www.computer.org/csdl/trans/tp/1990/06/i0570-abs.html Abstract])</ref> after which the [[IBM]] speech-recognition group experimented with the concept. The group found that implementation of a form of cache language model yielded a 24% drop in [[Word error rate\|word-error rates]] once the first few hundred words of a document had been dictated.<ref>{{cite journal \| url=http://acl.ldc.upenn.edu/H/H91/H91-1057.pdf \| title=A Dynamic Language Model for Speech Recognition \| author=F. Jelinek\|author2=B. Merialdo\|author3=S. Roukos\|author4=M. Strauss\|lastauthoramp=y \| journal=The Journal of the Acoustical Society of America \| year=1991 \| volume=98 \| issue=2 \| pages=293–295 \| doi=10.3115/112405.112464 \| archiveurl=https://web.archive.org/web/20060614121245/http://acl.ldc.upenn.edu/H/H91/H91-1057.pdf \| archivedate=June 14, 2006 \| url-status=dead}} Conference: Speech and Natural Language, Proceedings of a Workshop held at Pacific Grove, California, USA, February 19–22, 1999.</ref> A detailed survey of language modeling techniques concluded that the cache language model was one of the few new language modeling techniques that yielded improvements over the standard N-gram approach: "Our caching results show that caching is by far the most useful technique for perplexity reduction at small and medium [[Training set\|training data]] sizes".<ref>{{cite book\|author=Joshua T. Goodman\|year=2001\|title=A Bit of Progress in Language Modeling: Extended Version\|publisher=Microsoft Research\|location=Redmond, WA (US)\|id=Technical report MSR-TR-2001-72\|arxiv=cs/0108005v1 \|bibcode=2001cs........8005G}}</ref>		The cache language model was first proposed in a paper published in 1990,<ref>{{cite journal \| last1=Kuhn \| first1=R. \| last2=De Mori \| first2=R. \| authorlink1= \| authorlink2=Renato de Mori \| title=A Cache-Based Natural Language Model for Speech Recognition \| journal=[[IEEE Transactions on Pattern Analysis and Machine Intelligence]] \| date=June 1990 \| volume=12 \| issue=6 \| pages=570–583 \| url=http://visgraph.cs.ust.hk/biometrics/Papers/Voice/pami1990-06-01.pdf \| issn=1939-3539 \| doi=10.1109/34.56193 \| access-date=2011-09-24 \| archive-url=https://web.archive.org/web/20110901154408/http://visgraph.cs.ust.hk/biometrics/Papers/Voice/pami1990-06-01.pdf \| archive-date=2011-09-01 \| url-status=dead }} ([https://www.computer.org/csdl/trans/tp/1990/06/i0570-abs.html Abstract])</ref> after which the [[IBM]] speech-recognition group experimented with the concept. The group found that implementation of a form of cache language model yielded a 24% drop in [[Word error rate\|word-error rates]] once the first few hundred words of a document had been dictated.<ref>{{cite journal \| url=http://acl.ldc.upenn.edu/H/H91/H91-1057.pdf \| title=A Dynamic Language Model for Speech Recognition \| author=F. Jelinek\|author2=B. Merialdo\|author3=S. Roukos\|author4=M. Strauss\|lastauthoramp=y \| journal=The Journal of the Acoustical Society of America \| year=1991 \| volume=98 \| issue=2 \| pages=293–295 \| doi=10.3115/112405.112464 \| s2cid=11601499 \| archiveurl=https://web.archive.org/web/20060614121245/http://acl.ldc.upenn.edu/H/H91/H91-1057.pdf \| archivedate=June 14, 2006 \| url-status=dead}} Conference: Speech and Natural Language, Proceedings of a Workshop held at Pacific Grove, California, USA, February 19–22, 1999.</ref> A detailed survey of language modeling techniques concluded that the cache language model was one of the few new language modeling techniques that yielded improvements over the standard N-gram approach: "Our caching results show that caching is by far the most useful technique for perplexity reduction at small and medium [[Training set\|training data]] sizes".<ref>{{cite book\|author=Joshua T. Goodman\|year=2001\|title=A Bit of Progress in Language Modeling: Extended Version\|publisher=Microsoft Research\|location=Redmond, WA (US)\|id=Technical report MSR-TR-2001-72\|arxiv=cs/0108005v1 \|bibcode=2001cs........8005G}}</ref>

	The development of the cache language model has generated considerable interest among those concerned with [[computational linguistics]] in general and [[statistical natural language processing]] in particular: recently there has been interest in applying the cache language model in the field of statistical machine translation.<ref>{{cite conference \| url=http://www.aclweb.org/anthology/W/W10/W10-2602.pdf \| title=Context Adaptation in Statistical Machine Translation Using Models with Exponentially Decaying Cache \| author=Tiedemann, Jorg \| conference=Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing, ACL 2010 \| year=2010 \| pages=8–15 \| location=Uppsala, Sweden \| publisher=Association for Computational Linguistics}}</ref>		The development of the cache language model has generated considerable interest among those concerned with [[computational linguistics]] in general and [[statistical natural language processing]] in particular: recently there has been interest in applying the cache language model in the field of statistical machine translation.<ref>{{cite conference \| url=http://www.aclweb.org/anthology/W/W10/W10-2602.pdf \| title=Context Adaptation in Statistical Machine Translation Using Models with Exponentially Decaying Cache \| author=Tiedemann, Jorg \| conference=Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing, ACL 2010 \| year=2010 \| pages=8–15 \| location=Uppsala, Sweden \| publisher=Association for Computational Linguistics}}</ref>