Generalized Hebbian algorithm: Difference between revisions

Content deleted Content added

Inline

Revision as of 01:25, 20 November 2024

The generalized Hebbian algorithm, also known in the literature as Sanger's rule, is a linear feedforward neural network for unsupervised learning with applications primarily in principal components analysis. First defined in 1989,^[1] it is similar to Oja's rule in its formulation and stability, except it can be applied to networks with multiple outputs. The name originates because of the similarity between the algorithm and a hypothesis made by Donald Hebb^[2] about the way in which synaptic strengths in the brain are modified in response to experience, i.e., that changes are proportional to the correlation between the firing of pre- and post-synaptic neurons.^[3]

Theory

Consider a problem of learning a linear code for some data. Each data is a multi-dimensional vector $x\in \mathbb {R} ^{n}$ , and can be (approximately) represented as a linear sum of linear code vectors $w_{1},\dots ,w_{m}$ . When $m=n$ , it is possible to exactly represent the data. If $m<n$ , it is possible to approximately represent the data. To minimize the L2 loss of representation, $w_{1},\dots ,w_{m}$ should be the highest principal component vectors.

The generalized Hebbian algorithm is an iterative algorithm to find the highest principal component vectors, in an algorithmic form that resembles unsupervised Hebbian learning in neural networks.

Consider a one-layered neural network with $n$ input neurons and $m$ output neurons $y_{1},\dots ,y_{m}$ . The linear code vectors are the connection strengths, that is, $w_{ij}$ is the synaptic weight or connection strength between the $j$ -th input and $i$ -th output neurons.

The generalized Hebbian algorithm learning rule is of the form

\,\Delta w_{ij}~=~\eta y_{i}\left(x_{j}-\sum _{k=1}^{i}w_{kj}y_{k}\right)

where $\eta$ is the learning rate parameter.^[4]

Derivation

In matrix form, Oja's rule can be written

\,{\frac {{\text{d}}w(t)}{{\text{d}}t}}~=~w(t)Q-\mathrm {diag} [w(t)Qw(t)^{\mathrm {T} }]w(t)

,

and the Gram-Schmidt algorithm is

\,\Delta w(t)~=~-\mathrm {lower} [w(t)w(t)^{\mathrm {T} }]w(t)

,

where $w (t)$ is any matrix, in this case representing synaptic weights, $Q = η x x T$ is the autocorrelation matrix, simply the outer product of inputs, $diag$ is the function that diagonalizes a matrix, and $lower$ is the function that sets all matrix elements on or above the diagonal equal to 0. We can combine these equations to get our original rule in matrix form,

\,\Delta w(t)~=~\eta (t)\left(\mathbf {y} (t)\mathbf {x} (t)^{\mathrm {T} }-\mathrm {LT} [\mathbf {y} (t)\mathbf {y} (t)^{\mathrm {T} }]w(t)\right)

,

where the function $LT$ sets all matrix elements above the diagonal equal to 0, and note that our output $y (t) = w (t) x (t)$ is a linear neuron.^[1]

Stability and Principal Components Analysis

^[5]

Oja's rule is the special case where $m=1$ .^[6] One can think of the generalized Hebbian algorithm as iterating Oja's rule.

With Oja's rule, $w_{1}$ is learned, and it has the same direction as the largest principal component vector is learned, with length determined by $E[x_{j}]=E[w_{1j}y_{1}]$ for all $j$ , where the expectation is taken over all input-output pairs. In other words, the length of the vector $w_{1}$ is such that we have an autoencoder, with the latent code $y_{1}=\sum _{i}w_{1i}x_{i}$ , such that $E[\|x-y_{1}w_{1}\|^{2}]$ is minimized.

When $m=2$ , the first neuron in the hidden layer of the autoencoder still learns as described, since it is unaffected by the second neuron. So, after the first neuron and its vector $w_{1}$ has converged, the second neuron is effectively running another Oja's rule on the modified input vectors, defined by $x'=x-y_{1}w_{1}$ , which we know is the input vector with the first principal component removed. Therefore, the second neuron learns to code for the second principal component.

By induction, this results in finding the top- $m$ principal components for arbitrary $m$ .

Applications

The generalized Hebbian algorithm is used in applications where a self-organizing map is necessary, or where a feature or principal components analysis can be used. Examples of such cases include artificial intelligence and speech and image processing.

Its importance comes from the fact that learning is a single-layer process—that is, a synaptic weight changes only depending on the response of the inputs and outputs of that layer, thus avoiding the multi-layer dependence associated with the backpropagation algorithm. It also has a simple and predictable trade-off between learning speed and accuracy of convergence as set by the learning rate parameter $η$ .^[5]

As an example, (Olshausen and Field, 1996)^[7] performed the generalized Hebbian algorithm on 8-by-8 patches of photos of natural scenes, and found that it results in Fourier-like features. The features are the same as the principal components found by principal components analysis, as expected, and that, the features are determined by the $64\times 64$ variance matrix of the samples of 8-by-8 patches. In other words, it is determined by the second-order statistics of the pixels in images. They criticized this as insufficient to capture higher-order statistics which are necessary to explain the Gabor-like features of simple cells in the primary visual cortex.

References

^ ^a ^b Sanger, Terence D. (1989). "Optimal unsupervised learning in a single-layer linear feedforward neural network" (PDF). Neural Networks. 2 (6): 459–473. CiteSeerX 10.1.1.128.6893. doi:10.1016/0893-6080(89)90044-0. Retrieved 2007-11-24.
^ Hebb, D.O. (1949). The Organization of Behavior. New York: Wiley & Sons. ISBN 9781135631918. {{cite book}}: ISBN / Date incompatibility (help)
^ Hertz, John; Anders Krough; Richard G. Palmer (1991). Introduction to the Theory of Neural Computation. Redwood City, CA: Addison-Wesley Publishing Company. ISBN 978-0201515602.
^ Gorrell, Genevieve (2006), "Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing.", EACL, CiteSeerX 10.1.1.102.2084
^ ^a ^b Haykin, Simon (1998). Neural Networks: A Comprehensive Foundation (2 ed.). Prentice Hall. ISBN 978-0-13-273350-2.
^ Oja, Erkki (November 1982). "Simplified neuron model as a principal component analyzer". Journal of Mathematical Biology. 15 (3): 267–273. doi:10.1007/BF00275687. PMID 7153672. S2CID 16577977. BF00275687.
^ Olshausen, Bruno A.; Field, David J. (1996-06). "Emergence of simple-cell receptive field properties by learning a sparse code for natural images". Nature. 381 (6583): 607–609. doi:10.1038/381607a0. ISSN 1476-4687. {{cite journal}}: Check date values in: |date= (help)

[Sanger89-1] Sanger, Terence D. (1989). "Optimal unsupervised learning in a single-layer linear feedforward neural network" (PDF). Neural Networks. 2 (6): 459–473. CiteSeerX 10.1.1.128.6893. doi:10.1016/0893-6080(89)90044-0. Retrieved 2007-11-24.

[Hebb_1949-2] Hebb, D.O. (1949). The Organization of Behavior. New York: Wiley & Sons. ISBN 9781135631918. {{cite book}}: ISBN / Date incompatibility (help)

[Hertz,_Krough,_and_Palmer,_1991-3] Hertz, John; Anders Krough; Richard G. Palmer (1991). Introduction to the Theory of Neural Computation. Redwood City, CA: Addison-Wesley Publishing Company. ISBN 978-0201515602.

[4] Gorrell, Genevieve (2006), "Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing.", EACL, CiteSeerX 10.1.1.102.2084

[Haykin98-5] Haykin, Simon (1998). Neural Networks: A Comprehensive Foundation (2 ed.). Prentice Hall. ISBN 978-0-13-273350-2.

[Oja82-6] Oja, Erkki (November 1982). "Simplified neuron model as a principal component analyzer". Journal of Mathematical Biology. 15 (3): 267–273. doi:10.1007/BF00275687. PMID 7153672. S2CID 16577977. BF00275687.

[7] Olshausen, Bruno A.; Field, David J. (1996-06). "Emergence of simple-cell receptive field properties by learning a sparse code for natural images". Nature. 381 (6583): 607–609. doi:10.1038/381607a0. ISSN 1476-4687. {{cite journal}}: Check date values in: |date= (help)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

@@ Line 1: / Line 1: @@
 {{short description|Linear feedforward neural network model}}
-The '''generalized Hebbian algorithm''' ('''GHA'''), also known in the literature as '''Sanger's rule''', is a linear [[feedforward neural network]] for [[unsupervised learning]] with applications primarily in [[principal components analysis]]. First defined in 1989,<ref name="Sanger89">{{cite journal |last=Sanger |first=Terence D. |author-link=Terence Sanger |year=1989 |title= Optimal unsupervised learning in a single-layer linear feedforward neural network |journal=Neural Networks |volume=2 |issue=6 |pages=459–473 |url=http://courses.cs.washington.edu/courses/cse528/09sp/sanger_pca_nn.pdf |access-date= 2007-11-24 |doi= 10.1016/0893-6080(89)90044-0 |citeseerx=10.1.1.128.6893 }}</ref> it is similar to [[Oja's rule]] in its formulation and stability, except it can be applied to networks with multiple outputs.  The name originates because of the similarity between the algorithm and a hypothesis made by [[Donald Hebb]]<ref name="Hebb 1949">{{Cite book|last=Hebb|first=D.O.|author-link=Donald Olding Hebb|title=The Organization of Behavior|publisher=Wiley & Sons|location=New York|year=1949|isbn=9781135631918|url=https://books.google.com/books?id=uyV5AgAAQBAJ}}</ref> about the way in which synaptic strengths in the brain are modified in response to experience, i.e., that changes are proportional to the correlation between the firing of pre- and post-synaptic [[neurons]].<ref name="Hertz, Krough, and Palmer, 1991">{{Cite book|last=Hertz|first=John|author2=Anders Krough |author3=Richard G. Palmer |title=Introduction to the Theory of Neural Computation|publisher=Addison-Wesley Publishing Company|location=Redwood City, CA|year=1991|isbn=978-0201515602}}</ref>
+The '''generalized Hebbian algorithm''', also known in the literature as '''Sanger's rule''', is a linear [[feedforward neural network]] for [[unsupervised learning]] with applications primarily in [[principal components analysis]]. First defined in 1989,<ref name="Sanger89">{{cite journal |last=Sanger |first=Terence D. |author-link=Terence Sanger |year=1989 |title= Optimal unsupervised learning in a single-layer linear feedforward neural network |journal=Neural Networks |volume=2 |issue=6 |pages=459–473 |url=http://courses.cs.washington.edu/courses/cse528/09sp/sanger_pca_nn.pdf |access-date= 2007-11-24 |doi= 10.1016/0893-6080(89)90044-0 |citeseerx=10.1.1.128.6893 }}</ref> it is similar to [[Oja's rule]] in its formulation and stability, except it can be applied to networks with multiple outputs.  The name originates because of the similarity between the algorithm and a hypothesis made by [[Donald Hebb]]<ref name="Hebb 1949">{{Cite book|last=Hebb|first=D.O.|author-link=Donald Olding Hebb|title=The Organization of Behavior|publisher=Wiley & Sons|location=New York|year=1949|isbn=9781135631918|url=https://books.google.com/books?id=uyV5AgAAQBAJ}}</ref> about the way in which synaptic strengths in the brain are modified in response to experience, i.e., that changes are proportional to the correlation between the firing of pre- and post-synaptic [[neurons]].<ref name="Hertz, Krough, and Palmer, 1991">{{Cite book|last=Hertz|first=John|author2=Anders Krough |author3=Richard G. Palmer |title=Introduction to the Theory of Neural Computation|publisher=Addison-Wesley Publishing Company|location=Redwood City, CA|year=1991|isbn=978-0201515602}}</ref>
 ==Theory==
 Consider a problem of learning a linear code for some data. Each data is a multi-dimensional vector <math>x \in \R^n</math>, and can be (approximately) represented as a linear sum of linear code vectors <math>w_1, \dots, w_m</math>. When <math>m = n</math>, it is possible to exactly represent the data. If <math>m < n</math>, it is possible to approximately represent the data. To minimize the L2 loss of representation, <math>w_1, \dots, w_m</math> should be the highest principal component vectors.
-The GHA is an iterative algorithm to find the highest principal component vectors, in an algorithmic form that resembles [[Unsupervised learning|unsupervised]] Hebbian learning in neural networks.
+The generalized Hebbian algorithm is an iterative algorithm to find the highest principal component vectors, in an algorithmic form that resembles [[Unsupervised learning|unsupervised]] Hebbian learning in neural networks.
 Consider a one-layered neural network with <math>n</math> input neurons and <math>m</math> output neurons <math>y_1, \dots, y_m</math>. The linear code vectors are the connection strengths, that is, <math>w_{ij}</math> is the [[synaptic weight]] or connection strength between the <math>j</math>-th input and <math>i</math>-th output neurons.
+The generalized Hebbian algorithm learning rule is of the form
-The GHA learning rule is of the form<ref>{{Citation |last=Gorrell |first=Genevieve |title=Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing. |journal=EACL |year=2006 |citeseerx=10.1.1.102.2084}}</ref>
 :<math>\,\Delta w_{ij} ~ = ~ \eta y_i \left(x_j - \sum_{k=1}^{i} w_{kj} y_k \right)</math>
-where <math>\eta</math> is the ''[[learning rate]]'' parameter.
+where <math>\eta</math> is the ''[[learning rate]]'' parameter.<ref>{{Citation |last=Gorrell |first=Genevieve |title=Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing. |journal=EACL |year=2006 |citeseerx=10.1.1.102.2084}}</ref>
 ===Derivation===
@@ Line 30: / Line 30: @@
 where the function {{math|LT}} sets all matrix elements above the diagonal equal to 0, and note that our output {{math|'''y'''(''t'') {{=}} ''w''(''t'') '''x'''(''t'')}} is a linear neuron.<ref name="Sanger89"/>
-===Stability and PCA===
+===Stability and Principal Components Analysis===
 <ref name="Haykin98">{{cite book |last=Haykin |first=Simon |author-link=Simon Haykin |title=Neural Networks: A Comprehensive Foundation |edition=2 |year=1998 |publisher=Prentice Hall |isbn=978-0-13-273350-2 }}</ref>
-[[Oja's rule]] is the special case where <math>m = 1</math>.<ref name="Oja82">{{cite journal |last=Oja |first=Erkki |author-link=Erkki Oja |date=November 1982 |title=Simplified neuron model as a principal component analyzer |journal=Journal of Mathematical Biology |volume=15 |issue=3 |pages=267–273 |doi=10.1007/BF00275687 |pmid=7153672 |s2cid=16577977 |id=BF00275687}}</ref> One can think of the GHA as iterating Oja's rule.
+[[Oja's rule]] is the special case where <math>m = 1</math>.<ref name="Oja82">{{cite journal |last=Oja |first=Erkki |author-link=Erkki Oja |date=November 1982 |title=Simplified neuron model as a principal component analyzer |journal=Journal of Mathematical Biology |volume=15 |issue=3 |pages=267–273 |doi=10.1007/BF00275687 |pmid=7153672 |s2cid=16577977 |id=BF00275687}}</ref> One can think of the generalized Hebbian algorithm as iterating Oja's rule.
 With Oja's rule, <math>w_1</math> is learned, and it has the same direction as the largest principal component vector is learned, with length determined by <math>E[x_j] = E[w_{1j} y_1]</math> for all <math>j </math>, where the expectation is taken over all input-output pairs. In other words, the length of the vector <math>w_1</math> is such that we have an [[autoencoder]], with the latent code <math>y_1 = \sum_i w_{1i} x_i   </math>, such that <math>E[\| x - y_1 w_1 \|^2] </math> is minimized.
@@ Line 42: / Line 42: @@
 ==Applications==
-The GHA is used in applications where a [[self-organizing map]] is necessary, or where a feature or [[principal components analysis]] can be used. Examples of such cases include [[artificial intelligence]] and speech and image processing.
+The generalized Hebbian algorithm is used in applications where a [[self-organizing map]] is necessary, or where a feature or [[principal components analysis]] can be used. Examples of such cases include [[artificial intelligence]] and speech and image processing.
 Its importance comes from the fact that learning is a single-layer process—that is, a synaptic weight changes only depending on the response of the inputs and outputs of that layer, thus avoiding the multi-layer dependence associated with the [[backpropagation]] algorithm. It also has a simple and predictable trade-off between learning speed and accuracy of convergence as set by the [[learning]] rate parameter {{math|η}}.<ref name="Haykin98"/>
-[[File:Generalized Hebbian algorithm on 8-by-8 patches of Caltech101.png|thumb|Features learned by GHA running on 8-by-8 patches of [[Caltech 101]].]]
+[[File:Generalized Hebbian algorithm on 8-by-8 patches of Caltech101.png|thumb|Features learned by generalized Hebbian algorithm running on 8-by-8 patches of [[Caltech 101]].]]
 [[File:Principal component analysis of Caltech101.png|thumb|Features found by [[Principal Component Analysis]] on the same Caltech 101 dataset.]]
-As an example, (Olshausen and Field, 1996)<ref>{{Cite journal |last=Olshausen |first=Bruno A. |last2=Field |first2=David J. |date=1996-06 |title=Emergence of simple-cell receptive field properties by learning a sparse code for natural images |url=https://www.nature.com/articles/381607a0 |journal=Nature |language=en |volume=381 |issue=6583 |pages=607–609 |doi=10.1038/381607a0 |issn=1476-4687}}</ref> performed GHA on 8-by-8 patches of photos of natural scenes, and found that it results in Fourier-like features. The features are the same as the principal components found by PCA, as expected, and that, the features are determined by the <math>64\times 64</math> variance matrix of the samples of 8-by-8 patches. In other words, it is determined by the second-order statistics of the pixels in images. They criticized this as insufficient to capture higher-order statistics which are necessary to explain the Gabor-like features of simple cells in the [[primary visual cortex]].
+As an example, (Olshausen and Field, 1996)<ref>{{Cite journal |last=Olshausen |first=Bruno A. |last2=Field |first2=David J. |date=1996-06 |title=Emergence of simple-cell receptive field properties by learning a sparse code for natural images |url=https://www.nature.com/articles/381607a0 |journal=Nature |language=en |volume=381 |issue=6583 |pages=607–609 |doi=10.1038/381607a0 |issn=1476-4687}}</ref> performed the generalized Hebbian algorithm on 8-by-8 patches of photos of natural scenes, and found that it results in Fourier-like features. The features are the same as the principal components found by principal components analysis, as expected, and that, the features are determined by the <math>64\times 64</math> variance matrix of the samples of 8-by-8 patches. In other words, it is determined by the second-order statistics of the pixels in images. They criticized this as insufficient to capture higher-order statistics which are necessary to explain the Gabor-like features of simple cells in the [[primary visual cortex]].
 ==See also==

v t e Hebbian learning
True Hebbian learning	Contrastive Hebbian learning Generalized Hebbian algorithm Oja's rule
Related concepts	Engram (neuropsychology) Feedforward neural network Backpropagation Leabra GeneRec