Wikipedia - User contributions [en]

Compressed sensing

2016-03-25T08:21:15Z

Deepalgo: /* Applications */

{{pp-semi|small=yes}}
'''Compressed sensing''' (also known as '''compressive sensing''', '''compressive sampling''', or '''sparse sampling''') is a [[signal processing]] technique for efficiently acquiring and reconstructing a [[Signal (electronics)|signal]], by finding solutions to [[Underdetermined system|underdetermined linear systems]]. This is based on the principle that, through optimization, the sparsity of a signal can be exploited to recover it from far fewer samples than required by the [[Nyquist–Shannon sampling theorem|Shannon-Nyquist sampling theorem]]. There are two conditions under which recovery is possible.<ref>[http://nuit-blanche.blogspot.com/2009/09/cs.html CS: Compressed Genotyping, DNA Sudoku - Harnessing high throughput sequencing for multiplexed specimen analysis]</ref> The first one is [[sparsity]] which requires the signal to be sparse in some domain. The second one is [[Coherence (signal processing)|incoherence]] which is applied through the isometric property which is sufficient for sparse signals.<ref>{{cite journal | last1 = Donoho | first1 = David L | year = 2006 | title = For most large underdetermined systems of linear equations the minimal 1-norm solution is also the sparsest solution| url = | journal = Communications on pure and applied mathematics | volume = 59 | issue = | pages = 797–829 | doi = 10.1002/cpa.20132 }}</ref><ref>[http://www.brainshark.com/brainshark/brainshark.net/portal/title.aspx?pid=zCdz10BfTRz0z0 M. Davenport, "The Fundamentals of Compressive Sensing", SigView, April 12, 2013.]</ref>

== Overview ==
A common goal of the engineering field of [[signal processing]] is to reconstruct a signal from a series of sampling measurements. In general, this task is impossible because there is no way to reconstruct a signal during the times that the signal is not measured. Nevertheless, with prior knowledge or assumptions about the signal, it turns out to be possible to perfectly reconstruct a signal from a series of measurements. Over time, engineers have improved their understanding of which assumptions are practical and how they can be generalized.

An early breakthrough in signal processing was the [[Nyquist–Shannon sampling theorem]]. It states that if the signal's highest frequency is less than half of the sampling rate, then the signal can be reconstructed perfectly. The main idea is that with prior knowledge about constraints on the signal’s frequencies, fewer samples are needed to reconstruct the signal.

Around 2004, [[Emmanuel Candès]], [[Terence Tao]], and [[David Donoho]] proved that given knowledge about a signal's [[sparsity]], the signal may be reconstructed with even fewer samples than the sampling theorem requires.<ref>{{Cite journal|doi=10.1002/cpa.20124|url=http://www-stat.stanford.edu/~candes/papers/StableRecovery.pdf|title=Stable signal recovery from incomplete and inaccurate measurements|year=2006|last1=Candès|first1=Emmanuel J.|last2=Romberg|first2=Justin K.|last3=Tao|first3=Terence|journal=Communications on Pure and Applied Mathematics|volume=59|issue=8|pages=1207–1223}}</ref><ref name=Donoho>{{Cite journal|doi=10.1109/TIT.2006.871582|title=Compressed sensing|year=2006|last1=Donoho|first1=D.L.|journal=IEEE Transactions on Information Theory|volume=52|issue=4|pages=1289–1306}}</ref> This idea is the basis of compressed sensing.

==History==
Compressed sensing relies on [[Lp space|L1]] techniques, which several other scientific fields have used historically.<ref>[http://2.bp.blogspot.com/_0ZCyAOBrUtA/TTwqLEeLvdI/AAAAAAAAEXI/7S0_SnWoC0E/s1600/l1-minimization.JPG List of L1 regularization ideas] from Vivek Goyal, Alyson Fletcher, Sundeep Rangan, [http://www.math.uiuc.edu/%7Elaugesen/imaha10/goyal_talk.pdf The Optimistic Bayesian: Replica Method Analysis of Compressed Sensing]</ref> In statistics, the [[least squares]] method was complemented by the [[Lp norm|<math>L^1</math>-norm]], which was introduced by [[Pierre-Simon Laplace|Laplace]]. Following the introduction of [[linear programming]] and [[George Dantzig|Dantzig]]'s [[simplex algorithm]], the <math>L^1</math>-norm was used in [[computational statistics]]. In statistical theory, the <math>L^1</math>-norm was used by [[George W. Brown]] and later writers on [[median-unbiased estimator]]s. It was used by Peter J. Huber and others working on [[robust statistics]]. The <math>L^1</math>-norm was also used in signal processing, for example, in the 1970s, when seismologists constructed images of reflective layers within the earth based on data that did not seem to satisfy the [[Nyquist–Shannon sampling theorem|Nyquist–Shannon criterion]].<ref>{{Cite journal |doi = 10.1511/2009.79.276 |title = The Best Bits |year = 2009 |last1 = Hayes |first1 = Brian |journal = American Scientist |volume = 97 |issue = 4 |pages = 276 }}</ref> It was used in [[matching pursuit]] in 1993, the [[Lasso regression|LASSO estimator]] by [[Robert Tibshirani]] in 1996<ref>{{Cite journal |url = http://www-stat.stanford.edu/~tibs/lasso.html |first = Robert |last = Tibshirani |title = Regression shrinkage and selection via the lasso |journal = [[Journal of the Royal Statistical Society, Series B]] |volume = 58 |issue = 1 |pages = 267–288 }}</ref> and [[basis pursuit]] in 1998.<ref>"Atomic decomposition by basis pursuit", by Scott Shaobing Chen, David L. Donoho, Michael, A. Saunders. SIAM Journal on Scientific Computing</ref> There were theoretical results describing when these algorithms recovered sparse solutions, but the required type and number of measurements were sub-optimal and subsequently greatly improved by compressed sensing.{{citation needed|date=May 2013}}

At first glance, compressed sensing might seem to violate [[Nyquist–Shannon sampling theorem|the sampling theorem]], because compressed sensing depends on the [[Sparse matrix|sparsity]] of the signal in question and not its highest frequency. This is a misconception, because the sampling theorem guarantees perfect reconstruction given sufficient, not necessary, conditions. A sampling method fundamentally different from classical fixed-rate sampling cannot "violate" the sampling theorem. Sparse signals with high frequency components can be highly under-sampled using compressed sensing compared to classical fixed-rate sampling.<ref>{{Cite journal |url = http://www-stat.stanford.edu/~candes/papers/ExactRecovery.pdf |title = Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Fourier Information |year = 2006 |last1 = Candès |first1 = Emmanuel J. |last2 = Romberg |first2 = Justin K. |last3 = Tao |first3 = Terence |journal = IEEE Trans. Inf. Theory |volume = 52 |issue = 8 |pages = 489–509 |doi=10.1109/tit.2005.862083}}</ref>

==Method==

===Underdetermined linear system===
An [[underdetermined system]] of linear equations has more unknowns than equations and generally has an infinite number of solutions. In order to choose a solution to such a system, one must impose extra constraints or conditions (such as smoothness) as appropriate.

In compressed sensing, one adds the constraint of sparsity, allowing only solutions which have a small number of nonzero coefficients. Not all underdetermined systems of linear equations have a sparse solution. However, if there is a unique sparse solution to the underdetermined system, then the compressed sensing framework allows the recovery of that solution.

===Solution / reconstruction method===
Compressed sensing takes advantage of the redundancy in many interesting signals—they are not pure noise. In particular, many signals are [[sparse matrix|sparse]], that is, they contain many coefficients close to or equal to zero, when represented in some domain.<ref>Candès, E.J., & Wakin, M.B., ''An Introduction To Compressive Sampling'', IEEE Signal Processing Magazine, V.21, March 2008 [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4472240&isnumber=4472102]</ref> This is the same insight used in many forms of [[lossy compression]].

Compressed sensing typically starts with taking a weighted linear combination of samples also called compressive measurements in a [[Basis (linear algebra)|basis]] different from the basis in which the signal is known to be sparse. The results found by [[Emmanuel Candès]], [[Justin Romberg]], [[Terence Tao]] and [[David Donoho]], showed that the number of these compressive measurements can be small and still contain nearly all the useful information. Therefore, the task of converting the image back into the intended domain involves solving an underdetermined [[matrix equation]] since the number of compressive measurements taken is smaller than the number of pixels in the full image. However, adding the constraint that the initial signal is sparse enables one to solve this underdetermined [[system of linear equations]].

The least-squares solution to such problems is to minimize the [[L2 norm|<math>L^2</math> norm]]—that is, minimize the amount of energy in the system. This is usually simple mathematically (involving only a [[matrix multiplication]] by the [[pseudo-inverse]] of the basis sampled in). However, this leads to poor results for many practical applications, for which the unknown coefficients have nonzero energy.

To enforce the sparsity constraint when solving for the underdetermined system of linear equations, one can minimize the number of nonzero components of the solution. The function counting the number of non-zero components of a vector was called the [[L0 norm|<math>L^0</math> "norm"]] by David Donoho{{refn|group=note|The quotation marks served two warnings. First, the number-of-nonzeros <math>L^0</math>-"norm" is not a proper [[F-space|F-norm]], because it is not continuous in its scalar argument: ''nnzs''(α''x'') is constant as α approaches zero. Unfortunately, authors now neglect the quotation marks and [[abuse of terminology|abused terminology]]—clashing with the established use of the <math>L^0</math> norm for the space of measurable functions (equipped with an appropriate metric) or for the [[F-space|space]] of sequences with [[F-space|F–norm]] <math>(x_n) \mapsto \sum_n{2^{-n} x_n/(1+x_n)}</math>.<ref>Stefan Rolewicz. ''Metric Linear Spaces''.</ref>}}.

[[Emmanuel Candès|Candès]]. et al., proved that for many problems it is probable that the [[L1 norm|<math>L^1</math> norm]] is equivalent to the [[L0 norm|<math>L^0</math> norm]], in a technical sense: This equivalence result allows one to solve the <math>L^1</math> problem, which is easier than the <math>L^0</math> problem. Finding the candidate with the smallest <math>L^1</math> norm can be expressed relatively easily as a [[linear program]], for which efficient solution methods already exist.<ref>[http://www.acm.caltech.edu/l1magic/ L1-MAGIC is a collection of MATLAB routines]</ref> When measurements may contain a finite amount of noise, [[basis pursuit denoising]] is preferred over linear programming, since it preserves sparsity in the face of noise and can be solved faster than an exact linear program.

=== Total Variation based CS reconstruction ===

==== Motivation and Applications ====

===== Role of TV regularization =====
[[Total variation]] can be seen as a [[non-negative]] [[real number|real]]-valued [[functional (mathematics)|functional]] defined on the space of [[real number|real-valued]] [[function (mathematics)|function]]s (for the case of functions of one variable) or on the space of [[integrable function]]s (for the case of functions of several variables). For signals, especially, [[total variation]] refers to the integral of the absolute [[gradient]] of the signal. In signal and image reconstruction, it is applied as [[total variation regularization]] where the underlying principle is that signals with excessive details have high total variation and that removing these details, while retaining important information such as edges, would reduce the total variation of the signal and make the signal subject closer to the original signal in the problem.

For the purpose of signal and image reconstruction, <math>l1</math> minimization models are used. Other approaches also include the least-squares as has been discussed before in this article. These methods are extremely slow and return a not-so-perfect reconstruction of the signal. The current CS Regularization models attempt to address this problem by incorporating sparsity priors of the original image, one of which is the total variation (TV). Conventional TV approaches are designed to give piece-wise constant solutions. Some of these include (as discussed ahead) - constrained l1-minimization which uses an iterative scheme. This method, though fast, subsequently leads to over-smoothing of edges resulting in blurred image edges.<ref name = "EPTV" /> TV methods with iterative re-weighting have been implemented to reduce the influence of large gradient value magnitudes in the images. This has been used in [[Tomography|computed tomography]] (CT) reconstruction as a method known as edge-preserving total variation. However, as gradient magnitudes are used for estimation of relative penalty weights between the data fidelity and regularization terms, this method is not robust to noise and artifacts and accurate enough for CS image/signal reconstruction and, therefore, fails to preserve smaller structures.

Recent progress on this problem involves using an iteratively directional TV refinement for CS reconstruction.<ref name = "Orientation and directional refinement" /> This method would have 2 stages: the first stage would estimate and refine the initial orientation field - which is defined as a noisy point-wise initial estimate, through edge-detection, of the given image. In the second stage, the CS reconstruction model is presented by utilizing directional TV regularizer. More details about these TV-based approaches - iteratively reweighted l1 minimization, edge-preserving TV and iterative model using directional orientation field and TV- are provided below.

==== Existing approaches ====

=====Iteratively reweighted <math>l_{1}</math> minimization <ref name="Original source for IRLS">{{cite journal | last1 = Candes | first1 = E. J. | last2 = Wakin | first2 = M. B. | last3 = Boyd | first3 = S. P. | year = 2008 | title = Enhancing sparsity by reweighted l1 minimization | url = | journal = J. Fourier Anal. Applicat | volume = 14 | issue = 5-6| pages = 877–905 | doi=10.1007/s00041-008-9045-x}}</ref> =====
[[File:IRLS.png|thumb|iteratively reweighted l1 minimization method for CS]]
In the CS reconstruction models using constrained <math>l_{1}</math> minimization, larger coefficients are penalized heavily in the <math>l_{1}</math> norm. It was proposed to have a weighted formulation of <math>l_{1}</math> minimization designed to more democratically penalize nonzero coefficients. An iterative algorithm is used for constructing the appropriate weights.<ref name="Iteration">Lange, K.: Optimization, Springer Texts in Statistics. Springer, New York (2004)</ref> Each iteration requires solving one <math>l_{1}</math> minimization problem by finding the local minimum of a concave penalty function that more closely resembles the <math>l_{0}</math> norm. An additional parameter, usually to avoid any sharp transitions in the penalty function curve, is introduced into the iterative equation to ensure stability and so that a zero estimate in one iteration does not necessarily lead to a zero estimate in the next iteration. The method essentially involves using the current solution for computing the weights to be used in the next iteration.

====== Advantages and disadvantages ======
Early iterations may find inaccurate sample estimates, however this method will down-sample these at a later stage to give more weight to the smaller non-zero signal estimates. One of the disadvantages is the need for defining a valid starting point as a global minimum might not be obtained every time due to the concavity of the function. Another disadvantage is that this method tends to uniformly penalize the image gradient irrespective of the underlying image structures. This causes over-smoothing of edges, especially those of low contrast regions,subsequently leading to loss of low contrast information.The advantages of this method include: reduction of the sampling rate for sparse signals; reconstruction of the image while being robust to the removal of noise and other artifacts; and use of very few iterations. This can also help in recovering images with sparse gradients.

In the figure shown below, '''P1''' refers to the first-step of the iterative reconstruction process, of the projection matrix '''P''' of the fan-beam geometry, which is constrained by the data fidelity term. This may contain noise and artifacts as no regularization is performed. The minimization of '''P1''' is solved through the conjugate gradient least squares method. '''P2''' refers to the second step of the iterative reconstruction process wherein it utilizes the edge-preserving total variation regularization term to remove noise and artifacts, and thus improve the quality of the reconstructed image/signal. The minimization of '''P2''' is done through a simple gradient descent method. Convergence is determined by testing, after each iteration, for image positivity, by checking if <math>f^{k-1} = 0</math> for the case when <math>f^{k-1} < 0</math> (Note that <math>f</math> refers to the different x-ray linear attenuation coefficients at different voxels of the patient image).

=====Edge-preserving total variation (TV) based compressed sensing<ref name ="EPTV">{{cite journal | last1 = Tian | first1 = Z. | last2 = Jia | first2 = X. | last3 = Yuan | first3 = K. | last4 = Pan | first4 = T. | last5 = Jiang | first5 = S. B. | year = 2011 | title = Low-dose CT reconstruction via edge preserving total variation regularization | url = | journal = Phys Med Biol. | volume = 56 | issue = 18| pages = 5949–5967 | doi=10.1088/0031-9155/56/18/011}}</ref>=====
[[File:Edge preserving TV.png|thumb|Flow diagram figure for edge preserving total variation method for compressed sensing]]
This is an iterative CT reconstruction algorithm with edge-preserving TV regularization to reconstruct CT images from highly undersampled data obtained at low dose CT through low current levels (milliampere). In order to reduce the imaging dose, one of the approaches used is to reduce the number of x-ray projections acquired by the scanner detectors. However, this insufficient projection data which is used to reconstruct the CT image can cause streaking artifacts. Furthermore, using these insufficient projections in standard TV algorithms end up making the problem under-determined and thus leading to infinitely many possible solutions. In this method, an additional penalty weighted function is assigned to the original TV norm. This allows for easier detection of sharp discontinuities in intensity in the images and thereby adapt the weight to store the recovered edge information during the process of signal/image reconstruction. The parameter <math>\sigma</math> controls the amount of smoothing applied to the pixels at the edges to differentiate them from the non-edge pixels. The value of <math>\sigma</math> is changed adaptively based on the values of the histogram of the gradient magnitude so that a certain percentage of pixels have gradient values larger than <math>\sigma</math>. The edge-preserving total variation term, thus, becomes sparser and this speeds up the implementation. A two-step iteration process known as forward-backward splitting algorithm is used.<ref name = "Forward-Backward">{{cite journal | last1 = Combettes | first1 = P | last2 = Wajs | first2 = V | year = 2005 | title = Signal recovery by proximal forward-backward splitting | url = | journal = Multiscale Model Simul | volume = 4 | issue = | pages = 1168–200 | doi=10.1137/050626090}}</ref> The optimization problem is split into two sub-problems which are then solved with the conjugate gradient least squares method<ref name="CGLS">{{cite journal | last1 = Hestenes | first1 = M | last2 = Stiefel | first2 = E | year = 1952 | title = Methods of conjugate gradients for solving linear systems | url = | journal = J Res Natl Bur Stand | volume = 49 | issue = | pages = 409–36 | doi=10.6028/jres.049.044}}</ref> and the simple gradient descent method respectively. The method is stopped when the desired convergence has been achieved or if the maximum number of iterations is reached.

===== Advantages and disadvantages =====
Some of the disadvantages of this method are the absence of smaller structures in the reconstructed image and degradation of image resolution. This edge preserving TV algorithm, however, requires fewer iterations than the conventional TV algorithm.<ref name ="EPTV" /> Analyzing the horizontal and vertical intensity profiles of the reconstructed images, it can be seen that there are sharp jumps at edge points and negligible, minor fluctuation at non-edge points. Thus, this method leads to low relative error and higher correlation as compared to the TV method. It also effectively suppresses and removes any form of image noise and image artifacts such as streaking.

=====Iterative model using a directional orientation field and directional total variation<ref name="Orientation and directional refinement">http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6588871</ref>=====
To prevent over-smoothing of edges and texture details and to obtain a reconstructed CS image which is accurate and robust to noise and artifacts, this method is used. First, an initial estimate of the noisy point-wise orientation field of the image <math>I</math>, <math>\hat{d}</math>, is obtained. This noisy orientation field is defined so that it can be refined at a later stage to reduce the noise influences in orientation field estimation.A coarse orientation field estimation is then introduced based on structure tensor which is formulated as:<ref name="Structure tensor">{{cite journal | last1 = Brox | first1 = T. | last2 = Weickert | first2 = J. | last3 = Burgeth | first3 = B. | last4 = Mrázek | first4 = P. | year = 2006 | title = Nonlinear structure tensors | url = | journal = Image Vis. Comput | volume = 24 | issue = 1| pages = 41–55 | doi=10.1016/j.imavis.2005.09.010}}</ref> <math> J_\rho(\nabla I_{\sigma}) = G_\rho * (\nabla I_{\sigma} \otimes \nabla I_{\sigma}) = \begin{pmatrix}J_{11} & J_{12}\\J_{12} & J_{22}\end{pmatrix}</math>. Here, <math> J_\rho </math> refers to the structure tensor related with the image pixel point (i,j) having standard deviation <math>\rho</math>. <math>G</math> refers to the Gaussian kernel <math>(0, \rho ^2)</math> with standard deviation <math>\rho</math>. <math>\sigma</math> refers to the manually defined parameter for the image <math>I</math> below which the edge detection is insensitive to noise. <math>\nabla I_{\sigma}</math> refers to the gradient of the image <math>I</math> and <math>(\nabla I_{\sigma} \otimes \nabla I_{\sigma})</math> refers to the tensor product obtained by using this gradient.

The structure tensor obtained is convolved with a Gaussian kernel <math>G</math> to improve the accuracy of the orientation estimate with <math>\sigma</math> being set to high values to account for the unknown noise levels. For every pixel (i,j) in the image, the structure tensor J is a symmetric and positive semi-definite matrix. Convolving all the pixels in the image with <math>G</math>, gives orthonormal eigen vectors ω and υ of the <math>J</math> matrix. ω points in the direction of the dominant orientation having the largest contrast and υ points in the direction of the structure orientation having the smallest contrast. The orientation field coarse initial estimation <math>\hat{d}</math> is defined as <math>\hat{d}</math> = υ. This estimate is accurate at strong edges. However, at weak edges or on regions with noise, its reliability decreases.

To overcome this drawback, a refined orientation model is defined in which the data term reduces the effect of noise and improves accuracy while the second penalty term with the L2-norm is a fidelity term which ensures accuracy of initial coarse estimation.

This orientation field is introduced into the directional total variation optimization model for CS reconstruction through the equation: <math>min_\Chi\lVert \nabla \Chi \bullet d \rVert _{1} + \frac{\lambda}{2}\ \lVert Y - \Phi\Chi \rVert ^2_{2}</math>. <math>\Chi</math> is the objective signal which needs to be recovered. Y is the corresponding measurement vector, d is the iterative refined orientation field and <math>\Phi</math> is the CS measurement matrix. This method undergoes a few iterations ultimately leading to convergence.<math>\hat{d}</math> is the orientation field approximate estimation of the reconstructed image <math>X^{k-1}</math> from the previous iteration (in order to check for convergence and the subsequent optical performance, the previous iteration is used). For the two vector fields represented by <math>\Chi</math> and <math>d</math>, <math>\Chi \bullet d</math> refers to the multiplication of respective horizontal and vertical vector elements of <math>\Chi</math> and <math>d</math> followed by their subsequent addition. These equations are reduced to a series of convex minimization problems which are then solved with a combination of variable splitting and augmented Lagrangian (FFT-based fast solver with a closed form solution) methods.<ref name = "Orientation and directional refinement" /> It (Augmented Lagrangian) is considered equivalent to the split Bregman iteration which ensures convergence of this method. The orientation field, d is defined as being equal to <math>(d_{h}, d_{v})</math>, where <math>d_{h}, d_{v}</math> define the horizontal and vertical estimates of <math>d</math>.

[[File:Augmented Lagrangian.png|thumb|right|Augmented Lagrangian method for orientation field and iterative directional field refinement models]]

The Augmented Lagrangian method for the orientation field, <math>min_\Chi\lVert \nabla \Chi \bullet d \rVert _{1} + \frac{\lambda}{2}\ \lVert Y - \Phi\Chi \rVert ^2_{2}</math>, involves initializing <math>d_{h}, d_{v}, H, V</math> and then finding the approximate minimizer of <math>L_{1}</math> with respect to these variables. The Lagrangian multipliers are then updated and the iterative process is stopped when convergence is achieved. For the iterative directional total variation refinement model, the augmented lagrangian method involves initializing <math>\Chi, P, Q, \lambda_{P}, \lambda_{Q}</math>.<ref name="TV">{{cite journal | last1 = Goldluecke | first1 = B. | last2 = Strekalovskiy | first2 = E. | last3 = Cremers | first3 = D. | last4 = Siims | first4 = P.-T. A. I. | year = 2012 | title = The natural vectorial total variation which arises from geometric measure theory | url = | journal = SIAM J. Imag Sci | volume = 5 | issue = 2| pages = 537–563 | doi=10.1137/110823766}}</ref>

Here, <math>H, V, P, Q</math> are newly introduced variables where <math>H</math> = <math>\nabla d_{h}</math>, <math>V</math> = <math>\nabla d_{v}</math>, <math>P</math> = <math>\nabla \Chi</math>, and <math>Q</math> = <math>P \bullet d</math>. <math>\lambda_{H}, \lambda_{V}, \lambda_{P}, \lambda_{Q}</math> are the Lagrangian multipliers for <math>H, V, P, Q</math>. For each iteration, the approximate minimizer of <math>L_{2}</math> with respect to variables (<math>\Chi, P, Q</math>) is calculated. And as in the field refinement model, the lagrangian multipliers are updated and the iterative process is stopped when convergence is achieved.

For the orientation field refinement model, the Lagrangian multipliers are updated in the iterative process as follows:

<math>(\lambda_{H})^k = (\lambda_{H})^{k-1} + \gamma_{H}(H^k - \nabla (d_{h})^k)</math>

<math>(\lambda_{V})^k = (\lambda_{V})^{k-1} + \gamma_{V}(V^k - \nabla (d_{v})^k)</math>

For the iterative directional total variation refinement model, the Lagrangian multipliers are updated as follows:

<math>(\lambda_{P})^k = (\lambda_{P})^{k-1} + \gamma_{P}(P^k - \nabla (\Chi)^k)</math>

<math>(\lambda_{Q})^k = (\lambda_{Q})^{k-1} + \gamma_{Q}(Q^k - P^{k} \bullet d)</math>

Here, <math>\gamma_{H}, \gamma_{V}, \gamma_{P}, \gamma_{Q}</math> are positive constants.

=====Advantages and disadvantages=====

Based on Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) metrics and known ground-truth images for testing performance, it is concluded that iterative directional total variation has a better reconstructed performance than the non-iterative methods in preserving edge and texture areas. The orientation field refinement model plays a major role in this improvement in performance as it increases the number of directionless pixels in the flat area while enhancing the orientation field consistency in the regions with edges.

==Applications==
The field of compressive sensing is related to several topics in signal processing and computational mathematics, such as [[underdetermined system|underdetermined linear-system]]s, [[group testing]], heavy hitters, [[sparse coding]], [[multiplexing]], sparse sampling, and finite rate of innovation. Its broad scope and generality has enabled several innovative CS-enhanced approaches in signal processing and compression, solution of inverse problems, design of radiating systems, radar and through-the-wall imaging, and antenna characterization.<ref>{{Cite journal|author = Andrea Massa, Paolo Rocca, Giacomo Oliveri|title = Compressive Sensing in Electromagnetics - A Review|journal = IEEE Antennas and Propagation Magazine|volume = 57|number = 1|year = 2015|pp = 224–238|doi = 10.1109/MAP.2015.2397092|url = http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7046378}}</ref> Imaging techniques having a strong affinity with compressive sensing include [[coded aperture]] and [[computational photography]]. Implementations of compressive sensing in hardware at different [[technology readiness level]]s is available.<ref>Compressive Sensing Hardware, http://sites.google.com/site/igorcarron2/compressedsensinghardware</ref>

Conventional CS reconstruction uses sparse signals (usually sampled at a rate less than the Nyquist sampling rate) for reconstruction through constrained <math>l_{1}</math> minimization. One of the earliest applications of such an approach was in reflection seismology which used sparse reflected signals from band-limited data for tracking changes between sub-surface layers.<ref name="Seismic sparse signals">Taylor, H.L., Banks, S.C., McCoy, J.F. "Deconvolution with the 1 norm. ''Geophysics'' 44(1), 39–52 (1979)</ref> When the LASSO model came into prominence in the 1990s as a statistical method for selection of sparse models,<ref name="LASSO">Tibshirani, R. "Regression shrinkage and selection via the lasso. ''J. R. Stat. Soc. B'' 58(1), 267–288 (1996)</ref> this method was further used in computational harmonic analysis for sparse signal representation from over-complete dictionaries. Some of the other applications include incoherent sampling of radar pulses. The work by ''Boyd et al.''<ref name = "Original source for IRLS" /> has applied the LASSO model- for selection of sparse models- towards analog to digital converters (the current ones use a sampling rate higher than the Nyquist rate along with the quantized Shannon representation). This would involve a parallel architecture in which the polarity of the analog signal changes at a high rate followed by digitizing the integral at the end of each time-interval to obtain the converted digital signal.

===Photography===
Compressed sensing is used in a mobile phone camera sensor. The approach allows a reduction in image acquisition energy per image by as much as a factor of 15 at the cost of complex decompression algorithms; the computation may require an off-device implementation.<ref>{{cite journal|title=New Camera Chip Captures Only What It Needs|author=David Schneider|journal=IEEE Spectrum|date=March 2013|url=http://spectrum.ieee.org/semiconductors/optoelectronics/camera-chip-makes-alreadycompressed-images|accessdate=2013-03-20}}</ref>

Compressed sensing is used in single-pixel cameras from [[Rice University]].<ref name=cscamera>{{cite web|url=http://dsp.rice.edu/cscamera |title=Compressive Imaging: A New Single-Pixel Camera | Rice DSP |publisher=Dsp.rice.edu |date= |accessdate=2013-06-04}}</ref> [[Bell Labs]] employed the technique in a lensless single-pixel camera that takes stills using repeated snapshots of randomly chosen apertures from a grid. Image quality improves with the number of snapshots, and generally requires a small fraction of the data of conventional imaging, while eliminating lens/focus-related aberrations.<ref>{{cite web|author=The Physics arXiv Blog June 3, 2013 |url=http://www.technologyreview.com/view/515651/bell-labs-invents-lensless-camera/ |title=Bell Labs Invents Lensless Camera | MIT Technology Review |publisher=Technologyreview.com |date=2013-05-25 |accessdate=2013-06-04}}</ref><ref>{{cite journal|author1=Gang Huang|author2=Hong Jiang|author3=Kim Matthews|author4=Paul Wilford|title=Lensless Imaging by Compressive Sensing|year=2013|volume=2393|journal=IEEE International Conference on Image Processing, ICIP , Paper #|arxiv=1305.7181}}</ref>

===Holography===
Compressed sensing can be used to improve image reconstruction in [[holography]] by increasing the number of [[voxel]]s one can infer from a single hologram.<ref>{{cite journal | last1 = Brady | first1 = David | last2 = Choi | first2 = Kerkil | last3 = Marks | first3 = Daniel | last4 = Horisaki | first4 = Ryoichi | last5 = Lim | first5 = Sehoon | year = 2009 | title = Compressive holography | url = | journal = Optics Express | volume = 17 | issue = | pages = 13040–13049 | doi=10.1364/oe.17.013040}}</ref><ref>{{cite journal | last1 = Rivenson | first1 = Y. | last2 = Stern | first2 = A. | last3 = Javidi | first3 = B. | year = 2010 | title = Compressive fresnel holography | url = | journal = Display Technology, Journal of | volume = 6 | issue = 10| pages = 506–509 | doi=10.1109/jdt.2010.2042276}}</ref><ref>{{cite journal | last1 = Denis | first1 = Loic | last2 = Lorenz | first2 = Dirk | last3 = Thibaut | first3 = Eric | last4 = Fournier | first4 = Corinne | last5 = Trede | first5 = Dennis | year = 2009 | title = Inline hologram reconstruction with sparsity constraints | url = | journal = Opt. Lett. | volume = 34 | issue = 22| pages = 3475–3477 | doi=10.1364/ol.34.003475}}</ref> It is also used for image retrieval from undersampled measurements in optical <ref>{{cite journal | last1 = Marim | first1 = M. | last2 = Angelini | first2 = E. | last3 = Olivo-Marin | first3 = J. C. | last4 = Atlan | first4 = M. | year = 2011 | title = Off-axis compressed holographic microscopy in low-light conditions | url = http://arxiv.org/abs/1101.1735 | journal = Optics Letters | volume = 36 | issue = 1| pages = 79–81 | doi=10.1364/ol.36.000079}}</ref><ref>{{cite journal | last1 = Marim | first1 = M. M. | last2 = Atlan | first2 = M. | last3 = Angelini | first3 = E. | last4 = Olivo-Marin | first4 = J. C. | year = 2010 | title = Compressed sensing with off-axis frequency-shifting holography | url = http://arxiv.org/abs/1004.5305 | journal = Optics Letters | volume = 35 | issue = 6| pages = 871–873 | doi=10.1364/ol.35.000871}}</ref> and millimeter-wave <ref>{{cite journal | last1 = Fernandez Cull | first1 = Christy | last2 = Wikner | first2 = David A. | last3 = Mait | first3 = Joseph N. | last4 = Mattheiss | first4 = Michael | last5 = Brady | first5 = David J. | year = 2010 | title = Millimeter-wave compressive holography | url = | journal = Appl. Opt. | volume = 49 | issue = 19| pages = E67–E82 | doi=10.1364/ao.49.000e67}}</ref> holography.

===Facial recognition===
Compressed sensing is being used in facial recognition applications.<ref>[http://www.wired.com/science/discoveries/news/2008/03/new_face_recognition Engineers Test Highly Accurate Face Recognition]</ref>

===Computed Tomography===
Compressed sensing has been proposed for low dose [[Computed Tomography]] acquisition<ref name="ata">Barkan, O; Weill, J; Averbuch, A; Dekel, S. [http://www.cv-foundation.org/openaccess/content_cvpr_2013/papers/Barkan_Adaptive_Compressed_Tomography_2013_CVPR_paper.pdf "Adaptive Compressed Tomography Sensing"]. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2013 (pp. 2195-2202).</ref>. The proposed algorithm iterates between selective limited acquisition and improved reconstruction, with the goal of applying only the dose level required for sufficient image quality. The theoretical foundation of the algorithm is nonlinear Ridgelet approximation and a discrete form of Ridgelet analysis is used to compute the selective acquisition steps that best capture the image edges.

===Magnetic resonance imaging===
Compressed sensing has been used <ref name="dx.doi.org">Sparse MRI: The application of compressed sensing for rapid MR imaging; See Lustig, Michael and Donoho, David and Pauly, John M, Magnetic resonance in medicine, 58(6), 1182-1195 (2007) {{DOI|10.1002/mrm.21391}}</ref><ref name="Compressed Sensing MRI 2008">{{cite journal | last1 = Lustig | first1 = M. | last2 = Donoho | first2 = D.L. | last3 = Santos | first3 = J.M. | last4 = Pauly | first4 = J.M. | year = 2008 | title = Compressed Sensing MRI; | url = | journal = Signal Processing Magazine, IEEE | volume = 25 | issue = 2| pages = 72–82 | doi = 10.1109/MSP.2007.914728 }}</ref> to shorten [[magnetic resonance imaging]] scanning sessions on conventional hardware.<ref>{{cite web|author=Jordan EllenbergEmail Author |url=http://www.wired.com/magazine/2010/02/ff_algorithm/all/1 |title=Fill in the Blanks: Using Math to Turn Lo-Res Datasets Into Hi-Res Samples | Wired Magazine |publisher=Wired.com |date=2010-03-04 |accessdate=2013-06-04}}</ref><ref>[http://nuit-blanche.blogspot.com/2010/03/why-compressed-sensing-is-not-csi.html Why Compressed Sensing is NOT a CSI "Enhance" technology ... yet !]</ref><ref>[http://nuit-blanche.blogspot.com/2010/03/surely-you-must-be-joking-mr.html Surely You Must Be Joking Mr. Screenwriter]</ref> Reconstruction methods include
* ISTA
* FISTA
* SISTA
* ePRESS <ref>{{cite journal|last1=Zhang|first1=Y.|last2=Peterson|first2=B.|title=Energy Preserved Sampling for Compressed Sensing MRI|journal=Computational and Mathematical Methods in Medicine|date=2014|volume=2014|doi=10.1155/2014/546814|url=http://www.hindawi.com/journals/cmmm/2014/546814|pages=1–12}}</ref>
* EWISTA <ref name=Zhang_2015>{{cite journal|last1=Zhang|first1=Y.|title=Exponential Wavelet Iterative Shrinkage Thresholding Algorithm for Compressed Sensing Magnetic Resonance Imaging|journal=Information Sciences|date=2015|volume=322|pages=115–132|url=http://www.sciencedirect.com/science/article/pii/S0020025515004491|doi=10.1016/j.ins.2015.06.017}}</ref>
* EWISTARS <ref>{{cite journal|last1=Zhang|first1=Y.|last2=Wang|first2=S.|title=Exponential Wavelet Iterative Shrinkage Thresholding Algorithm with Random Shift for Compressed Sensing Magnetic Resonance Imaging|journal=IEEJ Transactions on Electrical and Electronic Engineering|date=2015|volume=10|issue=1|pages=116–117|url=http://onlinelibrary.wiley.com/doi/10.1002/tee.22059/abstract|doi=10.1002/tee.22059}}</ref> etc.

Compressed sensing addresses the issue of high scan time by enabling faster acquisition by measuring fewer Fourier coefficients. This produces a high-quality image with relatively lower scan time. Another application (also discussed ahead) is for CT reconstruction with fewer X-ray projections. Compressed sensing, in this case, removes the high spatial gradient parts - mainly, image noise and artifacts. This holds tremendous potential as one can obtain high-resolution CT images at low radiation doses (through lower current-mA settings).<ref name="MRI">{{cite journal | last1 = Figueiredo | first1 = M. | last2 = Bioucas-Dias | first2 = J.M. | last3 = Nowak | first3 = R.D. | year = 2007 | title = Majorization–minimization algorithms for wavelet-based image restoration | url = | journal = IEEE Trans. Image Process | volume = 16 | issue = 12| pages = 2980–2991 | doi=10.1109/tip.2007.909318}}</ref>

===Network tomography===
Compressed sensing has showed outstanding results in the application of [[network tomography]] to [[network management]]. [[Network delay]] estimation and [[network congestion]] detection can both be modeled as underdetermined [[System of linear equations|systems of linear equations]] where the coefficient matrix is the network routing matrix. Moreover, in the [[Internet]], network routing matrices usually satisfy the criterion for using compressed sensing.<ref>[Network tomography via compressed sensing|http://www.ee.washington.edu/research/funlab/Publications/2010/CS-Tomo.pdf]</ref>

===Shortwave-infrared cameras===
Commercial shortwave-infrared cameras based upon compressed sensing are available.<ref>{{cite web|title=InView web site|publisher=http://www.inviewcorp.com/products}}</ref> These cameras have light sensitivity from 0.9 [[µm]] to 1.7 µm, which are wavelengths invisible to the human eye.

===Aperture synthesis in radio astronomy===
In the field of [[radio astronomy]], compressed sensing has been proposed for deconvolving an interferometric image.<ref>[http://mnras.oxfordjournals.org/content/395/3/1733|Compressed sensing imaging techniques for radio interferometry]</ref> In fact, the [[CLEAN (algorithm)|Högbom CLEAN algorithm]] that has been in use for the deconvolution of radio images since 1974, is similar to compressed sensing's matching pursuit algorithm.

==See also==
*[[Noiselet]]
*[[Sparse approximation]]
*[[Sparse coding]]
*[[Low-density parity-check code]]

==Notes==
{{reflist|group=note}}

==References==
{{reflist|30em}}

==Further reading==
* "The Fundamentals of Compressive Sensing" [http://www.brainshark.com/brainshark/brainshark.net/portal/title.aspx?pid=zCdz10BfTRz0z0 Part 1], [http://www.brainshark.com/brainshark/brainshark.net/portal/title.aspx?pid=zCgzXgcEKz0z0 Part 2] and [http://www.brainshark.com/brainshark/brainshark.net/portal/title.aspx?pid=zAvz9F41cz0z0 Part 3]: video tutorial by Mark Davenport, Georgia Tech. at [http://www.brainshark.com/sps SigView, the IEEE Signal Processing Society Tutorial Library].
* [http://www.wired.com/magazine/2010/02/ff_algorithm/all/1 Using Math to Turn Lo-Res Datasets Into Hi-Res Samples] Wired Magazine article
* [http://dsp.rice.edu/cs Compressive Sensing Resources] at [[Rice University]].
* [http://igorcarron.googlepages.com/cs Compressed Sensing: The Big Picture]
* [http://igorcarron.googlepages.com/compressedsensinghardware A list of different hardware implementation of Compressive Sensing]
* [http://compressedsensing.googlepages.com/home Compressed Sensing 2.0 ]
* [http://www.ams.org/happening-series/hap7-pixel.pdf Compressed Sensing Makes Every Pixel Count] – article in the AMS ''What's Happening in the Mathematical Sciences'' series
* [http://nuit-blanche.blogspot.com/search/label/CS Nuit Blanche] A blog on Compressive Sensing featuring the most recent information on the subject (preprints, presentations, Q/As)
* [http://igorcarron.googlepages.com/csvideos Online Talks focused on Compressive Sensing]
* [http://ugcs.caltech.edu/~srbecker/wiki/Main_Page Wiki on sparse reconstruction]
* [http://stemblab.github.io/intuitive-cs/ Intuitive Compressive Sensing]

{{DEFAULTSORT:Compressed Sensing}}
[[Category:Information theory]]
[[Category:Signal processing]]
[[Category:Linear algebra]]
[[Category:Regression analysis]]
[[Category:Mathematical optimization]]

Word embedding

2016-03-23T06:58:43Z

Deepalgo:

{{machine learning bar}}

'''Word embedding''' is the collective name for a set of [[language model]]ing and [[feature learning]] techniques in [[natural language processing]] where words or phrases from the vocabulary are mapped to vectors of real numbers in a low-dimensional space relative to the vocabulary size ("continuous space").

Methods to generate this mapping include [[neural net language model|neural networks]],<ref>{{cite arXiv |eprint=1310.4546 |last1=Mikolov |first1=Tomas |title=Distributed Representations of Words and Phrases and their Compositionality |last2=Sutskever |first2=Ilya |last3=Chen |first3=Kai |last4=Corrado |first4=Greg |last5=Dean |first5=Jeffrey |class=cs.CL| year=2013}}</ref><ref>{{cite arXiv |eprint=1603.06571 |last1=Barkan |first1=Oren |title=Bayesian Neural Word Embedding |class=cs.CL| year=2015}}</ref> [[dimensionality reduction]] on the word co-occurrence matrix,<ref>{{cite arXiv |eprint=1312.5542 |last1=Lebret |first1=Rémi |title=Word Emdeddings through Hellinger PCA |last2=Collobert |first2=Ronan |class=cs.CL |year=2013}}</ref><ref>{{Cite conference |url=http://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization.pdf |title=Neural Word Embedding as Implicit Matrix Factorization |last=Levy |first=Omer |conference=NIPS |year=2014 |last2=Goldberg |first2=Yoav}}</ref><ref>{{Cite conference |url=http://ijcai.org/papers15/Papers/IJCAI15-513.pdf |title=Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective |last=Li |first=Yitan |conference=Int'l J. Conf. on Artificial Intelligence (IJCAI) |year=2015 |last2=Xu |first2=Linli}}</ref> and explicit representation in terms of the context in which words appear.<ref>{{cite conference |last1=Levy |first1=Omer |last2=Goldberg |first2=Yoav |title=Linguistic Regularities in Sparse and Explicit Word Representations |conference=CoNLL |pages=171–180 |year=2014 |url=https://levyomer.files.wordpress.com/2014/04/linguistic-regularities-in-sparse-and-explicit-word-representations-conll-2014.pdf}}</ref>

Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as [[syntactic parsing]]<ref>{{cite conference |last1=Socher |first1=Richard |last2=Bauer |first2=John |last3=Manning |first3=Christopher |last4=Ng |first4=Andrew |title=Parsing with compositional vector grammars |conference=Proc. ACL Conf. |year=2013 |url=http://www.socher.org/uploads/Main/SocherBauerManningNg_ACL2013.pdf}}</ref> and [[sentiment analysis]].<ref>{{cite conference |last1=Socher |first1=Richard |last2=Perelygin |first2=Alex |last3=Wu |first3=Jean |last4=Chuang |first4=Jason |last5=Manning |first5=Chris |last6=Ng |first6=Andrew |last7=Potts |first7=Chris |title=Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank |conference=EMNLP |year=2013 |url=http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf}}</ref>

== Software ==
Software for training and using word embeddings includes [[Google]]'s [[Word2vec]], Stanford University's GloVe<ref>{{cite web |url=http://nlp.stanford.edu/projects/glove/ |title=GloVe}}</ref> and [[Deeplearning4j]].

== See also ==
* [[Brown clustering]]

== References ==
{{Reflist}}

[[Category:Language modeling]]
[[Category:Artificial neural networks]]

Word embedding

2016-03-22T08:40:10Z

Deepalgo:

{{machine learning bar}}

'''Word embedding''' is the collective name for a set of [[language model]]ing and [[feature learning]] techniques in [[natural language processing]] where words or phrases from the vocabulary are mapped to vectors of real numbers in a low-dimensional space relative to the vocabulary size ("continuous space").

Methods to generate this mapping include [[neural net language model|neural networks]],<ref>{{cite arXiv |eprint=1310.4546 |last1=Mikolov |first1=Tomas |title=Distributed Representations of Words and Phrases and their Compositionality |last2=Sutskever |first2=Ilya |last3=Chen |first3=Kai |last4=Corrado |first4=Greg |last5=Dean |first5=Jeffrey |class=cs.CL| year=2013}}</ref><ref name="bsg">Barkan, Oren (2015). [https://www.researchgate.net/profile/Oren_Barkan/publication/298785900_Bayesian_Neural_Word_Embedding/links/56f039f108ae70bdd6c94644.pdf "Bayesian Neural Word Embedding"].</ref> [[dimensionality reduction]] on the word co-occurrence matrix,<ref>{{cite arXiv |eprint=1312.5542 |last1=Lebret |first1=Rémi |title=Word Emdeddings through Hellinger PCA |last2=Collobert |first2=Ronan |class=cs.CL |year=2013}}</ref><ref>{{Cite conference |url=http://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization.pdf |title=Neural Word Embedding as Implicit Matrix Factorization |last=Levy |first=Omer |conference=NIPS |year=2014 |last2=Goldberg |first2=Yoav}}</ref><ref>{{Cite conference |url=http://ijcai.org/papers15/Papers/IJCAI15-513.pdf |title=Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective |last=Li |first=Yitan |conference=Int'l J. Conf. on Artificial Intelligence (IJCAI) |year=2015 |last2=Xu |first2=Linli}}</ref> and explicit representation in terms of the context in which words appear.<ref>{{cite conference |last1=Levy |first1=Omer |last2=Goldberg |first2=Yoav |title=Linguistic Regularities in Sparse and Explicit Word Representations |conference=CoNLL |pages=171–180 |year=2014 |url=https://levyomer.files.wordpress.com/2014/04/linguistic-regularities-in-sparse-and-explicit-word-representations-conll-2014.pdf}}</ref>

Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as [[syntactic parsing]]<ref>{{cite conference |last1=Socher |first1=Richard |last2=Bauer |first2=John |last3=Manning |first3=Christopher |last4=Ng |first4=Andrew |title=Parsing with compositional vector grammars |conference=Proc. ACL Conf. |year=2013 |url=http://www.socher.org/uploads/Main/SocherBauerManningNg_ACL2013.pdf}}</ref> and [[sentiment analysis]].<ref>{{cite conference |last1=Socher |first1=Richard |last2=Perelygin |first2=Alex |last3=Wu |first3=Jean |last4=Chuang |first4=Jason |last5=Manning |first5=Chris |last6=Ng |first6=Andrew |last7=Potts |first7=Chris |title=Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank |conference=EMNLP |year=2013 |url=http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf}}</ref>

== Software ==
Software for training and using word embeddings includes [[Google]]'s [[Word2vec]], Stanford University's GloVe<ref>{{cite web |url=http://nlp.stanford.edu/projects/glove/ |title=GloVe}}</ref> and [[Deeplearning4j]].

== See also ==
* [[Brown clustering]]

== References ==
{{Reflist}}

[[Category:Language modeling]]
[[Category:Artificial neural networks]]

Gaussian process

2016-03-21T20:09:53Z

Deepalgo: Undid revision 711075308 by 174.3.155.181 (talk) It is not appropriate to reference a paper in external links, please check 'deep learning' there are many arxiv references that were publi

In [[probability theory]] and [[statistics]], a '''Gaussian process''' is a [[statistical distribution]] where [[random variate|observations]] occur in a continuous domain, e.g. time or space. In a Gaussian process, every point in some continuous input space is associated with a [[normal distribution|normally distributed]] [[random variable]]. Moreover, every finite collection of those random variables has a [[multivariate normal distribution]]. The distribution of a Gaussian process is the joint distribution of all those (infinitely many) random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

The concept of Gaussian processes is named after [[Carl Friedrich Gauss]] because it is based on the notion of the Gaussian distribution ([[normal distribution]]). Gaussian processes can be seen as an infinite-dimensional generalization of multivariate normal distributions.

Gaussian processes are important in [[statistical model]]ling because of properties inherited from the normal. For example, if a random process is modeled as a Gaussian process, the distributions of various derived quantities can be obtained explicitly. Such quantities include the average value of the process over a range of times and the error in estimating the average using sample values at a small set of times.

==Definition==
A '''Gaussian process''' is a [[statistical distribution]] ''X''''t'', ''t'' ∈ ''T'', for which any finite [[linear combination]] of [[Sampling (statistics)|samples]] has a [[multivariate normal distribution|joint Gaussian distribution]]. More accurately, any linear [[functional (mathematics)|functional]] applied to the sample function ''X''''t'' will give a normally distributed result. Notation-wise, one can write ''X'' ~ GP(''m'',''K''), meaning the [[random function]] ''X'' is distributed as a GP with mean function ''m'' and [[covariance function]] ''K''.<ref>{{Cite book | last1 = Rasmussen | first1 = C. E. | chapter = Gaussian Processes in Machine Learning | doi = 10.1007/978-3-540-28650-9_4 | title = Advanced Lectures on Machine Learning | series = Lecture Notes in Computer Science | volume = 3176 | pages = 63–71 | year = 2004 | isbn = 978-3-540-23122-6 | pmid = | pmc = }}</ref> When the input vector ''t'' is two- or multi-dimensional, a Gaussian process might be also known as a ''[[Gaussian random field]]''.<ref name="prml">{{cite book |last=Bishop |first=C.M. |title= Pattern Recognition and Machine Learning |year=2006 |publisher=[[Springer Science+Business Media|Springer]] |isbn=0-387-31073-8}}</ref>

Some authors<ref>{{cite book |last=Simon |first=Barry |title=Functional Integration and Quantum Physics |year=1979 |publisher=Academic Press}}</ref> assume the [[random variable]]s ''X''''t'' have mean zero; this greatly simplifies calculations [[without loss of generality]] and allows the mean square properties of the process to be ''entirely'' determined by the [[covariance function]] ''K''.<ref name="seegerGPML">{{cite journal |last1= Seeger| first1= Matthias |year= 2004 |title= Gaussian Processes for Machine Learning|journal= International Journal of Neural Systems|volume= 14|issue= 2|pages= 69–104 |doi=10.1142/s0129065704001899}}</ref>

==Alternative definitions==
Alternatively, a time continuous [[stochastic process]] is Gaussian [[if and only if]] for every [[finite set]] of [[indexed family|indices]] <math>t_1,\ldots,t_k</math> in the index set <math>T</math>

:<math>{\mathbf{X}}_{t_1, \ldots, t_k} = (\mathbf{X}_{t_1}, \ldots, \mathbf{X}_{t_k}) </math>

is a [[multivariate normal distribution|multivariate Gaussian]] [[random variable]]. Using [[Characteristic function (probability theory)|characteristic functions]] of random variables, the Gaussian property can be formulated as follows: <math>\left\{X_t ; t\in T\right\}</math> is Gaussian if and only if, for every finite set of indices <math>t_1,\ldots,t_k</math>, there are real valued <math>\sigma_{\ell j}</math>, <math>\mu_\ell</math> with <math>\sigma_{jj} > 0</math> such that the following equality holds for all <math>s_1,s_2,...s_k\in\mathbb{R}</math>

:<math> \operatorname{E}\left(\exp\left(i \ \sum_{\ell=1}^k s_\ell \ \mathbf{X}_{t_\ell}\right)\right) = \exp \left(-\frac{1}{2} \, \sum_{\ell, j} \sigma_{\ell j} s_\ell s_j + i \sum_\ell \mu_\ell s_\ell\right). </math>

where <math>i</math> denotes the imaginary number <math>\sqrt{-1}</math>.

The numbers <math>\sigma_{\ell j}</math> and <math>\mu_\ell</math> can be shown to be the [[covariance]]s and [[mean (mathematics)|means]] of the variables in the process.<ref>{{cite book |last=Dudley |first=R.M. |title=Real Analysis and Probability |year=1989 |publisher=Wadsworth and Brooks/Cole}}</ref>

==Covariance functions==
A key fact of Gaussian processes is that they can be completely defined by their second-order statistics.<ref name="prml"/> Thus, if a Gaussian process is assumed to have mean zero, defining the [[covariance function]] completely defines the process' behaviour. Importantly the non-negative definiteness of this function enables its spectral decomposition using the [[Karhunen–Loeve expansion]]. Basic aspects that can be defined through the covariance function are the process' [[stationary process|stationarity]], [[isotropy]], [[smoothness]] and [[periodic function|periodicity]].<ref name="brml">{{cite book |last=Barber |first=David |title=Bayesian Reasoning and Machine Learning |url=http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=Brml.HomePage |year=2012 |publisher=[[Cambridge University Press]] |isbn=978-0-521-51814-7}}</ref><ref name="gpml">{{cite book |last=Rasmussen |first=C.E. |author2=Williams, C.K.I |title=Gaussian Processes for Machine Learning |url=http://www.gaussianprocess.org/gpml/ |year=2006 |publisher=[[MIT Press]] |isbn=0-262-18253-X}}</ref>

[[stationary process|Stationarity]] refers to the process' behaviour regarding the separation of any two points ''x'' and ''x' ''. If the process is stationary, it depends on their separation, ''x'' − ''x''<nowiki>'</nowiki>, while if non-stationary it depends on the actual position of the points ''x'' and ''x''<nowiki>'</nowiki>. On the contrary, the special case of an Ornstein–Uhlenbeck process, a [[Brownian motion]] process, is non-stationary.

If the process depends only on |''x'' − ''x''<nowiki>'</nowiki>|, the Euclidean distance (not the direction) between ''x'' and ''x''', then the process is considered isotropic. A process that is concurrently stationary and isotropic is considered to be [[homogeneous]];<ref name="PRP">{{cite book |last=Grimmett |first=Geoffrey |author2=David Stirzaker|title= Probability and Random Processes| year=2001 |publisher=[[Oxford University Press]] |isbn=0198572220}}</ref> in practice these properties reflect the differences (or rather the lack of them) in the behaviour of the process given the location of the observer.

Ultimately Gaussian processes translate as taking priors on functions and the smoothness of these priors can be induced by the covariance function.<ref name ="brml"/> If we expect that for "near-by" input points ''x'' and ''x' '' their corresponding output points ''y'' and ''y' '' to be "near-by" also, then the assumption of continuity is present. If we wish to allow for significant displacement then we might choose a rougher covariance function. Extreme examples of the behaviour is the Ornstein–Uhlenbeck covariance function and the squared exponential where the former is never differentiable and the latter infinitely differentiable.

Periodicity refers to inducing periodic patterns within the behaviour of the process. Formally, this is achieved by mapping the input ''x'' to a two dimensional vector ''u''(''x'') = (cos(''x''), sin(''x'')).

===Usual covariance functions===
[[File:Gaussian process draws from prior distribution.png|thumbnail|right|The effect of choosing different kernels on the prior function distribution of the Gaussian process. Left is a squared exponential kernel. Middle is Brownian. Right is quadratic.]]
There are a number of common covariance functions:<ref name="gpml"/>
*Constant : <math> K_\text{C}(x,x') = C </math>
*Linear: <math> K_\text{L}(x,x') = x^T x'</math>
*Gaussian Noise: <math> K_\text{GN}(x,x') = \sigma^2 \delta_{x,x'}</math>
*Squared Exponential: <math> K_\text{SE}(x,x') = \exp \Big(-\frac{||d||^2}{2l^2} \Big)</math>
*Ornstein–Uhlenbeck: <math> K_\text{OU}(x,x') = \exp \Big(-\frac{|d| }{l} \Big)</math>
*Matérn: <math> K_\text{Matern}(x,x') = \frac{2^{1-\nu}}{\Gamma(\nu)} \Big(\frac{\sqrt{2\nu}|d|}{l} \Big)^\nu K_{\nu}\Big(\frac{\sqrt{2\nu}|d|}{l} \Big)</math>
*Periodic: <math> K_\text{P}(x,x') = \exp\Big(-\frac{ 2\sin^2(\frac{d}{2})}{ l^2} \Big)</math>
*Rational Quadratic: <math> K_\text{RQ}(x,x') = (1+|d|^2)^{-\alpha}, \quad \alpha \geq 0</math>

Here <math>d = x- x'</math>. The parameter <math>l</math> is the characteristic length-scale of the process (practically, "how close" two points <math>x</math> and <math>x'</math> have to be to influence each other significantly), δ is the [[Kronecker delta]] and σ the [[standard deviation]] of the noise fluctuations. Moreover, <math>K_\nu</math> is the [[modified Bessel function]] of order <math>\nu</math> and <math>\Gamma(\nu)</math> is the [[gamma function]] evaluated at <math>\nu</math>. Importantly, a complicated covariance function can be defined as a linear combination of other simpler covariance functions in order to incorporate different insights about the data-set at hand.

Clearly, the inferential results are dependent on the values of the hyperparameters θ (e.g. <math>l</math> and ''σ'') defining the model's behaviour. A popular choice for θ is to provide ''[[maximum a posteriori]]'' (MAP) estimates of it with some chosen prior. If the prior is very near uniform, this is the same as maximizing the [[marginal likelihood]] of the process; the marginalization being done over the observed process values <math>y</math>.<ref name= "gpml"/> This approach is also known as ''maximum likelihood II'', ''evidence maximization'', or ''[[Empirical Bayes]]''.<ref name= "seegerGPML"/>

==Brownian Motion as the Integral of Gaussian processes==
A [[Wiener process]] (aka brownian motion) is the integral of a white noise Gaussian process. It is not [[stationary process|stationary]], but it has stationary increments.

The [[Ornstein–Uhlenbeck process]] is a [[stationary process|stationary]] Gaussian process.

The [[Brownian bridge]] is the integral of a Gaussian process whose increments are not [[statistical independence|independent]].

The [[fractional Brownian motion]] is the integral of a Gaussian process whose covariance function is a generalisation of Wiener process.

==Applications==
A Gaussian process can be used as a [[prior probability distribution]] over [[Function (mathematics)|functions]] in [[Bayesian inference]].<ref name="gpml"/><ref>{{cite book |last=Liu |first=W. |author2=Principe, J.C. |author3=Haykin, S. |title=Kernel Adaptive Filtering: A Comprehensive Introduction |url=http://www.cnel.ufl.edu/~weifeng/publication.htm |year=2010 |publisher=[[John Wiley & Sons|John Wiley]] |isbn=0-470-44753-2}}</ref> Given any set of ''N'' points in the desired domain of your functions, take a [[multivariate Gaussian]] whose covariance [[matrix (mathematics)|matrix]] parameter is the [[Gram matrix]] of your ''N'' points with some desired [[stochastic kernel|kernel]], and [[sampling (mathematics)|sample]] from that Gaussian.

Inference of continuous values with a Gaussian process prior is known as Gaussian process regression, or [[kriging]]; extending Gaussian process regression to [[Kernel methods for vector output|multiple target variables]] is known as ''cokriging''.<ref>{{cite book |last=Stein |first=M.L. |title=Interpolation of Spatial Data: Some Theory for Kriging |year=1999 |publisher = [[Springer Science+Business Media|Springer]]}}</ref> Gaussian processes are thus useful as a powerful non-linear multivariate [[interpolation]] and out of sample extension<ref name="gpr"> Barkan, O., Weill, J., & Averbuch, A. (2016). [http://arxiv.org/abs/1603.02194 "Gaussian Process Regression for Out-of-Sample Extension"]. arXiv preprint arXiv:1603.02194.‏ </ref> tool. Gaussian process regression can be further extended to address learning tasks in both [[Supervised learning|supervised]] (e.g. probabilistic classification<ref name="gpml"/>) and [[Unsupervised learning|unsupervised]] (e.g. [[manifold learning]]<ref name= "prml"/>) learning frameworks.
===Gaussian process prediction, or kriging===
[[File:Gaussian Process Regression.png|thumbnail|right|Gaussian Process Regression (prediction) with a squared exponential kernel. Left plot are draws from the prior function distribution. Middle are draws from the posterior. Right is mean prediction with one standard deviation shaded.]]
When concerned with a general Gaussian process regression problem, it is assumed that for a Gaussian process ''f'' observed at coordinates x, the vector of values ''f(x)'' is just one sample from a multivariate Gaussian distribution of dimension equal to number of observed coordinates ''|x|''. Therefore under the assumption of a zero-mean distribution, ''f (x) ∼ N (0, K(θ,x,x'))'', where ''K(θ,x,x')'' is the covariance matrix between all possible pairs ''(x,x')'' for a given set of hyperparameters θ.<ref name= "gpml"/>
As such the log marginal likelihood is:
:<math>\log p(f(x)|\theta,x) = -\frac{1}{2}f(x)^T K(\theta,x,x')^{-1} f(x) -\frac{1}{2} \log \det(K(\theta,x,x')) - \frac{|x|}{2} \log 2\pi </math>
and maximizing this marginal likelihood towards θ provides the complete specification of the Gaussian process ''f''. One can briefly note at this point that the first term corresponds to a penalty term for a model's failure to fit observed values and the second term to a penalty term that increases proportionally to a model's complexity. Having specified ''θ'' making predictions about unobserved values ''f(x*)'' at coordinates ''x*'' is then only a matter of drawing samples from the predictive distribution ''p(y*|x*,f(x),x) = N(y*|A,B)'' where the posterior mean estimate A is defined as:
:<math>A = K(\theta,x^*,x) K(\theta,x,x')^{-1} f(x)</math>
and the posterior variance estimate B is defined as:
:<math>B = K(\theta,x^*,x^*) - K(\theta,x^*,x) K(\theta,x,x')^{-1} K(\theta,x^*,x)^T </math>
where ''K(θ,x*,x)'' is the covariance between the new coordinate of estimation ''x*'' and all other observed coordinates ''x'' for a given hyperparameter vector θ, ''K(θ,x,x')'' and ''f(x)'' are defined as before and ''K(θ,x*,x*)'' is the variance at point ''x*'' as dictated by ''θ''. It is important to note that practically the posterior mean estimate ''f(x*)'' (the "point estimate") is just a linear combination of the observations ''f(x)''; in a similar manner the variance of ''f(x*)'' is actually independent of the observations ''f(x)''. A known bottleneck in Gaussian process prediction is that the computational complexity of prediction is cubic in the number of points ''|x|'' and as such can become unfeasible for larger data sets.<ref name= "brml"/> Works on sparse Gaussian processes, that usually are based on the idea of building a ''representative set'' for the given process ''f'', try to circumvent this issue.<ref name="smolaSparse">{{cite journal |last1= Smola| first1= A.J.| last2=Schoellkopf | first2= B. |year= 2000 |title= Sparse greedy matrix approximation for machine learning |journal= Proceedings of the Seventeenth International Conference on Machine Learning| pages=911–918}}</ref><ref name="CsatoSparse">{{cite journal |last1= Csato| first1=L.| last2=Opper | first2= M. |year= 2002 |title= Sparse on-line Gaussian processes |journal= Neural Computation |number=3| volume= 14 | pages=641–668 | doi=10.1162/089976602317250933}}</ref>

==See also==
* [[Bayes linear statistics]]
* [[Bayesian interpretation of regularization]]

==Notes==
{{Reflist}}

==External links==
* [http://www.GaussianProcess.com www.GaussianProcess.com ]
* [http://www.GaussianProcess.org The Gaussian Processes Web Site, including the text of Rasmussen and Williams' Gaussian Processes for Machine Learning]
* [http://arxiv.org/abs/1505.02965 A gentle introduction to Gaussian processes]
* [http://publications.nr.no/917_Rapport.pdf A Review of Gaussian Random Fields and Correlation Functions]

===Software===
* [http://sourceforge.net/projects/kriging STK: a Small (Matlab/Octave) Toolbox for Kriging and GP modeling]
* [http://www.uqlab.com/ Kriging module in UQLab framework (Matlab)]
* [https://github.com/Yelp/MOE Yelp MOE - A black box optimization engine using Gaussian process learning]
* [http://www.sumo.intec.ugent.be/ooDACE ooDACE] - A flexible object-oriented Kriging matlab toolbox.
* [http://becs.aalto.fi/en/research/bayes/gpstuff/ GPstuff - Gaussian process toolbox for Matlab and Octave]
* [https://github.com/SheffieldML/GPy GPy - A Gaussian processes framework in Python]
* [http://www.tmpl.fi/gp/ Interactive Gaussian process regression demo]
* [https://github.com/ChristophJud/GPR Basic Gaussian process library written in C++11]

===Video tutorials===
* [http://videolectures.net/gpip06_mackay_gpb Gaussian Process Basics by David MacKay]
* [http://videolectures.net/epsrcws08_rasmussen_lgp Learning with Gaussian Processes by Carl Edward Rasmussen]
* [http://videolectures.net/mlss07_rasmussen_bigp Bayesian inference and Gaussian processes by Carl Edward Rasmussen]

{{Stochastic processes}}

{{Authority control}}

{{DEFAULTSORT:Gaussian Process}}
[[Category:Stochastic processes]]
[[Category:Kernel methods for machine learning]]
[[Category:Nonparametric Bayesian statistics]]

Word embedding

2016-03-21T18:33:40Z

Deepalgo:

{{machine learning bar}}

'''Word embedding''' is the collective name for a set of [[language model]]ing and [[feature learning]] techniques in [[natural language processing]] where words or phrases from the vocabulary are mapped to vectors of real numbers in a low-dimensional space relative to the vocabulary size ("continuous space").

Methods to generate this mapping include [[neural net language model|neural networks]]<ref>{{cite arXiv |eprint=1310.4546 |last1=Mikolov |first1=Tomas |title=Distributed Representations of Words and Phrases and their Compositionality |last2=Sutskever |first2=Ilya |last3=Chen |first3=Kai |last4=Corrado |first4=Greg |last5=Dean |first5=Jeffrey |class=cs.CL| year=2013}}</ref><ref name="bsg"> Barkan, Oren (8 August 2015). [https://www.researchgate.net/profile/Oren_Barkan/publication/298785900_Bayesian_Neural_Word_Embedding/links/56f039f108ae70bdd6c94644.pdf "Bayesian Neural Word Embedding"].</ref>, [[dimensionality reduction]] on the word co-occurrence matrix,<ref>{{cite arXiv |eprint=1312.5542 |last1=Lebret |first1=Rémi |title=Word Emdeddings through Hellinger PCA |last2=Collobert |first2=Ronan |class=cs.CL |year=2013}}</ref><ref>{{Cite conference |url=http://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization.pdf |title=Neural Word Embedding as Implicit Matrix Factorization |last=Levy |first=Omer |conference=NIPS |year=2014 |last2=Goldberg |first2=Yoav}}</ref><ref>{{Cite conference |url=http://ijcai.org/papers15/Papers/IJCAI15-513.pdf |title=Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective |last=Li |first=Yitan |conference=Int'l J. Conf. on Artificial Intelligence (IJCAI) |year=2015 |last2=Xu |first2=Linli}}</ref> and explicit representation in terms of the context in which words appear.<ref>{{cite conference |last1=Levy |first1=Omer |last2=Goldberg |first2=Yoav |title=Linguistic Regularities in Sparse and Explicit Word Representations |conference=CoNLL |pages=171–180 |year=2014 |url=https://levyomer.files.wordpress.com/2014/04/linguistic-regularities-in-sparse-and-explicit-word-representations-conll-2014.pdf}}</ref>

Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as [[syntactic parsing]]<ref>{{cite conference |last1=Socher |first1=Richard |last2=Bauer |first2=John |last3=Manning |first3=Christopher |last4=Ng |first4=Andrew |title=Parsing with compositional vector grammars |conference=Proc. ACL Conf. |year=2013 |url=http://www.socher.org/uploads/Main/SocherBauerManningNg_ACL2013.pdf}}</ref> and [[sentiment analysis]].<ref>{{cite conference |last1=Socher |first1=Richard |last2=Perelygin |first2=Alex |last3=Wu |first3=Jean |last4=Chuang |first4=Jason |last5=Manning |first5=Chris |last6=Ng |first6=Andrew |last7=Potts |first7=Chris |title=Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank |conference=EMNLP |year=2013 |url=http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf}}</ref>

== Software ==
Software for training and using word embeddings includes [[Google]]'s [[Word2vec]], Stanford University's GloVe<ref>{{cite web |url=http://nlp.stanford.edu/projects/glove/ |title=GloVe}}</ref> and [[Deeplearning4j]].

== See also ==
* [[Brown clustering]]

== References ==
{{Reflist}}

[[Category:Language modeling]]
[[Category:Artificial neural networks]]

Gaussian process

2016-03-20T12:37:48Z

Deepalgo: /* Applications */

Nonlinear dimensionality reduction

2016-03-20T12:18:36Z

Deepalgo: /* Manifold learning algorithms */

[[High-dimensional]] data, meaning data that requires more than two or three dimensions to represent, can be [[Curse of Dimensionality|difficult to interpret]]. One approach to simplification is to assume that the data of interest lie on an [[Embedding|embedded]] non-linear [[manifold]] within the higher-dimensional space. If the manifold is of low enough dimension, the data can be visualised in the low-dimensional space.

[[File:Lle hlle swissroll.png|thumb|right|300px|Top-left: a 3D dataset of 1000 points in a spiraling band (a.k.a. the [[Swiss roll]]) with a rectangular hole in the middle. Top-right: the original 2D manifold used to generate the 3D dataset. Bottom left and right: 2D recoveries of the manifold respectively using the [[Nonlinear dimensionality reduction#Locally-linear embedding|LLE]] and [[Nonlinear dimensionality reduction#Hessian Locally-Linear Embedding (Hessian LLE)|Hessian LLE]] algorithms as implemented by the Modular Data Processing toolkit.]]

Below is a summary of some of the important algorithms from the history of manifold learning and '''nonlinear dimensionality reduction''' (NLDR).<ref>John A. Lee, Michel Verleysen, Nonlinear Dimensionality Reduction, Springer, 2007.</ref> Many of these non-linear [[dimensionality reduction]] methods are related to the linear methods listed below. Non-linear methods can be broadly classified into two groups: those that provide a mapping (either from the high-dimensional space to the low-dimensional embedding or vice versa), and those that just give a visualisation. In the context of [[machine learning]], mapping methods may be viewed as a preliminary [[feature extraction]] step, after which [[Pattern recognition#Algorithms | pattern recognition algorithm]]s are applied. Typically those that just give a visualisation are based on proximity data – that is, [[distance]] measurements.

==Related Linear Decomposition Methods==

* [[Independent component analysis]] (ICA).
* [[Principal component analysis]] (PCA) (also called [[Karhunen–Loève transform]] — KLT).
* [[Singular value decomposition]] (SVD).
* [[Factor analysis]].

== Applications of NLDR ==

Consider a dataset represented as a matrix (or a database table), such that each row represents a set of attributes (or features or dimensions) that describe a particular instance of something. If the number of attributes is large, then the space of unique possible rows is exponentially large. Thus, the larger the dimensionality, the more difficult it becomes to sample the space. This causes many problems. Algorithms that operate on high-dimensional data tend to have a very high time complexity. Many machine learning algorithms, for example, struggle with high-dimensional data. This has become known as the [[curse of dimensionality]]. Reducing data into fewer dimensions often makes analysis algorithms more efficient, and can help machine learning algorithms make more accurate predictions.

Humans often have difficulty comprehending data in many dimensions. Thus, reducing data to a small number of dimensions is useful for visualization purposes.

[[File:nldr.jpg|thumb|right|500px| Plot of the two-dimensional points that results from using a NLDR algorithm. In this case, Manifold Sculpting used to reduce the data into just two dimensions (rotation and scale).]]

The reduced-dimensional representations of data are often referred to as "intrinsic variables". This description implies that these are the values from which the data was produced. For example, consider a dataset that contains images of a letter 'A', which has been scaled and rotated by varying amounts. Each image has 32x32 pixels. Each image can be represented as a vector of 1024 pixel values. Each row is a sample on a two-dimensional manifold in 1024-dimensional space (a [[Hamming space]]). The intrinsic dimensionality is two, because two variables (rotation and scale) were varied in order to produce the data. Information about the shape or look of a letter 'A' is not part of the intrinsic variables because it is the same in every instance. Nonlinear dimensionality reduction will discard the correlated information (the letter 'A') and recover only the varying information (rotation and scale). The image to the right shows sample images from this dataset (to save space, not all input images are shown), and a plot of the two-dimensional points that results from using a NLDR algorithm (in this case, Manifold Sculpting was used) to reduce the data into just two dimensions.

[[File:Letters pca.png|thumb|right|500px|PCA (a linear dimensionality reduction algorithm) is used to reduce this same dataset into two dimensions, the resulting values are not so well organized.]]

By comparison, if PCA (a linear dimensionality reduction algorithm) is used to reduce this same dataset into two dimensions, the resulting values are not so well organized. This demonstrates that the high-dimensional vectors (each representing a letter 'A') that sample this manifold vary in a non-linear manner.

It should be apparent, therefore, that NLDR has several applications in the field of computer-vision. For example, consider a robot that uses a camera to navigate in a closed static environment. The images obtained by that camera can be considered to be samples on a manifold in high-dimensional space, and the intrinsic variables of that manifold will represent the robot's position and orientation. This utility is not limited to robots. [[Dynamical systems]], a more general class of systems, which includes robots, are defined in terms of a manifold. Active research in NLDR seeks to unfold the observation manifolds associated with dynamical systems to develop techniques for modeling such systems and enable them to operate autonomously.<ref>Gashler, M. and Martinez, T., [http://axon.cs.byu.edu/papers/gashler2011ijcnn2.pdf Temporal Nonlinear Dimensionality Reduction], In ''Proceedings of the International Joint Conference on Neural Networks IJCNN'11'', pp. 1959–1966, 2011</ref>

==Manifold learning algorithms==

Some of the more prominent manifold learning algorithms are listed below (in approximately chronological order). An algorithm may learn an ''internal model'' of the data, which can be used to map points unavailable at training time into the embedding in a process often called out-of-sample extension <ref name="gpr"> Barkan, O., Weill, J., & Averbuch, A. (2016). [http://arxiv.org/abs/1603.02194 "Gaussian Process Regression for Out-of-Sample Extension"]. arXiv preprint arXiv:1603.02194.‏ </ref>.

=== Sammon's mapping ===

[[Sammon's mapping]] is one of the first and most popular NLDR techniques.

[[File:SOMsPCA.PNG|thumb|200px|left|Approximation of a principal curve by one-dimensional [[Self-organizing map|SOM]] (a [[broken line]] with red squares, 20 nodes). The first [[Principal component analysis|principal component]] is presented by a blue straight line. Data points are the small grey circles. For PCA, the [[Fraction of variance unexplained]] in this example is 23.23%, for SOM it is 6.86%.<ref>The illustration is prepared using free software: E.M. Mirkes, [http://www.math.le.ac.uk/people/ag153/homepage/PCA_SOM/PCA_SOM.html Principal Component Analysis and Self-Organizing Maps: applet]. University of Leicester, 2011</ref>]]

=== Self-organizing map ===

The [[self-organizing map]] (SOM, also called ''Kohonen map'') and its probabilistic variant [[generative topographic mapping]] (GTM) use a point representation in the embedded space to form a [[latent variable model]] based on a non-linear mapping from the embedded space to the high-dimensional space.<ref>Yin, Hujun; [http://pca.narod.ru/contentsgkwz.htm ''Learning Nonlinear Principal Manifolds by Self-Organising Maps''], in A.N. Gorban, B. Kégl, D.C. Wunsch, and A. Zinovyev (Eds.), ''Principal Manifolds for Data Visualization and Dimension Reduction'', Lecture Notes in Computer Science and Engineering (LNCSE), vol. 58, Berlin, Germany: Springer, 2007, Ch. 3, pp. 68-95. ISBN 978-3-540-73749-0</ref> These techniques are related to work on [[density networks]], which also are based around the same probabilistic model.

=== Principal curves and manifolds ===

[[File:SlideQualityLife.png|thumb|300px| Application of principal curves: Nonlinear quality of life index.<ref>A. N. Gorban, A. Zinovyev, [http://arxiv.org/abs/1001.1122 Principal manifolds and graphs in practice: from molecular biology to dynamical systems], [[International Journal of Neural Systems]], Vol. 20, No. 3 (2010) 219–232.</ref> Points represent data of the [[United Nations|UN]] 171 countries in 4-dimensional space formed by the values of 4 indicators: [[Gross domestic product|gross product per capita]], [[life expectancy]], [[infant mortality]], [[tuberculosis]] incidence. Different forms and colors correspond to various geographical locations. Red bold line represents the '''principal curve''', approximating the dataset. This principal curve was produced by the method of [[elastic map]]. Software is available for free non-commercial use.<ref>A. Zinovyev, [http://bioinfo-out.curie.fr/projects/vidaexpert/ ViDaExpert] - Multidimensional Data Visualization Tool (free for non-commercial use). [[Curie Institute (Paris)|Institut Curie]], Paris.</ref><ref>A. Zinovyev, [http://www.ihes.fr/~zinovyev/vida/ViDaExpert/ViDaOverView.pdf ViDaExpert overview], [http://www.ihes.fr IHES] ([[Institut des Hautes Études Scientifiques]]), Bures-Sur-Yvette, Île-de-France.</ref>]]

'''[[Principal curve]]s and manifolds''' give the natural geometric framework for nonlinear dimensionality reduction and extend the geometric interpretation of PCA by explicitly constructing an embedded manifold, and by encoding using standard geometric projection onto the manifold. This approach was proposed by [[Trevor Hastie]] in his thesis (1984)<ref>T. Hastie, Principal Curves and Surfaces, Ph.D Dissertation, Stanford Linear Accelerator Center, Stanford University, Stanford, California, US, November 1984.</ref> and developed further by many authors.<ref>[[Alexander Nikolaevich Gorban|A.N. Gorban]], B. Kégl, D.C. Wunsch, A. Zinovyev (Eds.), [http://pca.narod.ru/contentsgkwz.htm Principal Manifolds for Data Visualisation and Dimension Reduction], Lecture Notes in Computer Science and Engineering (LNCSE), Vol. 58, Springer, Berlin – Heidelberg – New York, 2007. ISBN 978-3-540-73749-0</ref>
How to define the "simplicity" of the manifold is problem-dependent, however, it is commonly measured by the intrinsic dimensionality and/or the smoothness of the manifold. Usually, the principal manifold is defined as a solution to an optimization problem. The objective function includes a quality of data approximation and some penalty terms for the bending of the manifold. The popular initial approximations are generated by linear PCA, Kohonen's SOM or autoencoders. The [[elastic map]] method provides the [[expectation-maximization algorithm]] for principal [[manifold learning]] with minimization of quadratic energy functional at the "maximization" step.

=== Autoencoders ===

An [[autoencoder]] is a feed-forward [[neural network]] which is trained to approximate the identity function. That is, it is trained to map from a vector of values to the same vector. When used for dimensionality reduction purposes, one of the hidden layers in the network is limited to contain only a small number of network units. Thus, the network must learn to encode the vector into a small number of dimensions and then decode it back into the original space. Thus, the first half of the network is a model which maps from high to low-dimensional space, and the second half maps from low to high-dimensional space. Although the idea of autoencoders is quite old, training of deep autoencoders has only recently become possible through the use of [[restricted Boltzmann machine]]s and stacked denoising autoencoders. Related to autoencoders is the [[NeuroScale]] algorithm, which uses stress functions inspired by [[multidimensional scaling]] and [[Sammon mapping]]s (see below) to learn a non-linear mapping from the high-dimensional to the embedded space. The mappings in NeuroScale are based on [[radial basis function network]]s.

=== Gaussian process latent variable models ===

[[Gaussian process latent variable model]]s (GPLVM)<ref>N. Lawrence, [http://jmlr.csail.mit.edu/papers/v6/lawrence05a.html Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models], Journal of Machine Learning Research 6(Nov):1783–1816, 2005.</ref> are probabilistic dimensionality reduction methods that use Gaussian Processes (GPs) to find a lower dimensional non-linear embedding of high dimensional data. They are an extension of the Probabilistic formulation of PCA. The model is defined probabilistically and the latent variables are then marginalized and parameters are obtained by maximizing the likelihood. Like kernel PCA they use a kernel function to form a non linear mapping (in the form of a [[Gaussian process]]). However in the GPLVM the mapping is from the embedded(latent) space to the data space (like density networks and GTM) whereas in kernel PCA it is in the opposite direction. It was originally proposed for visualization of high dimensional data but has been extended to construct a shared manifold model between two observation spaces.

=== Curvilinear component analysis ===

[[Curvilinear component analysis]] (CCA)<ref name="Demart">P. Demartines and J. Hérault, Curvilinear Component Analysis: A Self-Organizing Neural Network for Nonlinear Mapping of Data Sets, IEEE Transactions on Neural Networks, Vol. 8(1), 1997, pp. 148–154</ref> looks for the configuration of points in the output space that preserves original distances as much as possible while focusing on small distances in the output space (conversely to [[Sammon's mapping]] which focus on small distances in original space).

It should be noticed that CCA, as an iterative learning algorithm, actually starts with focus on large distances (like the Sammon algorithm), then gradually change focus to small distances. The small distance information will overwrite the large distance information, if compromises between the two have to be made.

The stress function of CCA is related to a sum of right Bregman divergences<ref name="Jigang">Jigang Sun, Malcolm Crowe, and Colin Fyfe, [http://www.dice.ucl.ac.be/Proceedings/esann/esannpdf/es2010-107.pdf Curvilinear component analysis and Bregman divergences], In European Symposium on Artificial Neural Networks (Esann), pages 81–86. d-side publications, 2010</ref>

=== Curvilinear distance analysis ===

CDA<ref name="Demart"/> trains a self-organizing neural network to fit the manifold and seeks to preserve [[geodesic distance]]s in its embedding. It is based on Curvilinear Component Analysis (which extended Sammon's mapping), but uses geodesic distances instead.

=== Diffeomorphic dimensionality reduction ===

Diffeomorphic Dimensionality Reduction or ''Diffeomap''<ref>Christian Walder and Bernhard Schölkopf, Diffeomorphic Dimensionality Reduction, Advances in Neural Information Processing Systems 22, 2009, pp. 1713–1720, MIT Press</ref> learns a smooth diffeomorphic mapping which transports the data onto a lower-dimensional linear subspace. The methods solves for a smooth time indexed vector field such that flows along the field which start at the data points will end at a lower-dimensional linear subspace, thereby attempting to preserve pairwise differences under both the forward and inverse mapping.

=== Kernel principal component analysis ===

Perhaps the most widely used algorithm for manifold learning is [[kernel principal component analysis|kernel PCA]].<ref>B. Schölkopf, A. Smola, K.-R. Müller, Nonlinear Component Analysis as a Kernel Eigenvalue Problem. ''Neural Computation ''10(5):1299-1319, 1998, [[MIT Press]] Cambridge, MA, USA, [[doi:10.1162/089976698300017467]]</ref> It is a combination of [[Principal component analysis]] and the [[kernel trick]]. PCA begins by computing the covariance matrix of the <math>m \times n</math> matrix <math>\mathbf{X}</math>

: <math>C = \frac{1}{m}\sum_{i=1}^m{\mathbf{x}_i\mathbf{x}_i^\mathsf{T}}.</math>

It then projects the data onto the first ''k'' eigenvectors of that matrix. By comparison, KPCA begins by computing the covariance matrix of the data after being transformed into a higher-dimensional space,

: <math>C = \frac{1}{m}\sum_{i=1}^m{\Phi(\mathbf{x}_i)\Phi(\mathbf{x}_i)^\mathsf{T}}.</math>

It then projects the transformed data onto the first ''k'' eigenvectors of that matrix, just like PCA. It uses the kernel trick to factor away much of the computation, such that the entire process can be performed without actually computing <math>\Phi(\mathbf{x})</math>. Of course <math>\Phi</math> must be chosen such that it has a known corresponding kernel. Unfortunately, it is not trivial to find a good kernel for a given problem, so KPCA does not yield good results with some problems when using standard kernels. For example, it is known to perform poorly with these kernels on the [[Swiss roll]] manifold. However, one can view certain other methods that perform well in such settings (e.g., Laplacian Eigenmaps, LLE) as special cases of kernel PCA by constructing a data-dependent kernel matrix.<ref>Jihun Ham, Daniel D. Lee, Sebastian Mika, Bernhard Schölkopf. A kernel view of the dimensionality reduction of manifolds. Proceedings of the 21st International Conference on Machine Learning, Banff, Canada, 2004. [[doi:10.1145/1015330.1015417]]</ref>

KPCA has an internal model, so it can be used to map points onto its embedding that were not available at training time.

=== Isomap ===

[[Isomap]]<ref>J. B. Tenenbaum, V. de Silva, J. C. Langford, A Global Geometric Framework for Nonlinear Dimensionality Reduction, Science 290, (2000), 2319–2323.</ref> is a combination of the [[Floyd–Warshall algorithm]] with classic [[Multidimensional Scaling]]. Classic Multidimensional Scaling (MDS) takes a matrix of pair-wise distances between all points, and computes a position for each point. Isomap assumes that the pair-wise distances are only known between neighboring points, and uses the Floyd–Warshall algorithm to compute the pair-wise distances between all other points. This effectively estimates the full matrix of pair-wise [[geodesic distance]]s between all of the points. Isomap then uses classic MDS to compute the reduced-dimensional positions of all the points.

Landmark-Isomap is a variant of this algorithm that uses landmarks to increase speed, at the cost of some accuracy.

=== Locally-linear embedding ===

[[Locally-Linear Embedding]] (LLE)<ref>S. T. Roweis and L. K. Saul, Nonlinear Dimensionality Reduction by Locally Linear Embedding, Science Vol 290, 22 December 2000, 2323–2326.</ref> was presented at approximately the same time as Isomap. It has several advantages over Isomap, including faster optimization when implemented to take advantage of [[sparse matrix]] algorithms, and better results with many problems. LLE also begins by finding a set of the nearest neighbors of each point. It then computes a set of weights for each point that best describe the point as a linear combination of its neighbors. Finally, it uses an eigenvector-based optimization technique to find the low-dimensional embedding of points, such that each point is still described with the same linear combination of its neighbors. LLE tends to handle non-uniform sample densities poorly because there is no fixed unit to prevent the weights from drifting as various regions differ in sample densities. LLE has no internal model.

LLE computes the barycentric coordinates of a point ''X''''i'' based on its neighbors ''X''''j''. The original point is reconstructed by a linear combination, given by the weight matrix ''W''''ij'', of its neighbors. The reconstruction error is given by the cost function ''E''(''W'').

: <math> E(W) = \sum_i |{\mathbf{X}_i - \sum_j {\mathbf{W}_{ij}\mathbf{X}_j}|}^\mathsf{2} </math>

The weights ''W''''ij'' refer to the amount of contribution the point ''X''''j'' has while reconstructing the point ''X''''i''. The cost function is minimized under two constraints:
(a) Each data point ''X''''i'' is reconstructed only from its neighbors, thus enforcing ''W''''ij'' to be zero if point ''X''''j'' is not a neighbor of the point ''X''''i'' and
(b) The sum of every row of the weight matrix equals 1.

: <math> \sum_j {\mathbf{W}_{ij}} = 1 </math>

The original data points are collected in a ''D'' dimensional space and the goal of the algorithm is to reduce the dimensionality to ''d'' such that ''D'' >> ''d''. The same weights ''W''''ij'' that reconstructs the ''i''th data point in the ''D'' dimensional space will be used to reconstruct the same point in the lower ''d'' dimensional space. A neighborhood preserving map is created based on this idea. Each point Xi in the ''D'' dimensional space is mapped onto a point Yi in the ''d'' dimensional space by minimizing the cost function

: <math> C(Y) = \sum_i |{\mathbf{Y}_i - \sum_j {\mathbf{W}_{ij}\mathbf{Y}_j}|}^\mathsf{2} </math>

In this cost function, unlike the previous one, the weights Wij are kept fixed and the minimization is done on the points Yi to optimize the coordinates. This minimization problem can be solved by solving a sparse ''N'' X ''N'' [[Eigendecomposition of a matrix|eigen value problem]] (''N'' being the number of data points), whose bottom ''d'' nonzero eigen vectors provide an orthogonal set of coordinates. Generally the data points are reconstructed from ''K'' nearest neighbors, as measured by [[Euclidean distance]]. For such an implementation the algorithm has only one free parameter ''K,'' which can be chosen by cross validation.

=== Laplacian eigenmaps ===

{{see also|Manifold regularization}}

Laplacian Eigenmaps<ref>Mikhail Belkin and [[Partha Niyogi]], Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering, Advances in Neural Information Processing Systems 14, 2001, p. 586–691, MIT Press</ref> uses spectral techniques to perform dimensionality reduction. This technique relies on the basic assumption that the data lies in a low-dimensional manifold in a high-dimensional space.<ref>Mikhail Belkin Problems of Learning on Manifolds, PhD Thesis, Department of Mathematics, The University Of Chicago, August 2003</ref> This algorithm cannot embed out of sample points, but techniques based on [[Reproducing kernel Hilbert space]] regularization exist for adding this capability.<ref>Bengio et al. "Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering" in Advances in Neural Information Processing Systems (2004)</ref> Such techniques can be applied to other nonlinear dimensionality reduction algorithms as well.

Traditional techniques like principal component analysis do not consider the intrinsic geometry of the data. Laplacian eigenmaps builds a graph from neighborhood information of the data set. Each data point serves as a node on the graph and connectivity between nodes is governed by the proximity of neighboring points (using e.g. the [[k-nearest neighbor algorithm]]). The graph thus generated can be considered as a discrete approximation of the low-dimensional manifold in the high-dimensional space. Minimization of a cost function based on the graph ensures that points close to each other on the manifold are mapped close to each other in the low-dimensional space, preserving local distances. The eigenfunctions of the [[Laplace–Beltrami operator]] on the manifold serve as the embedding dimensions, since under mild conditions this operator has a countable spectrum that is a basis for square integrable functions on the manifold (compare to [[Fourier series]] on the unit circle manifold). Attempts to place Laplacian eigenmaps on solid theoretical ground have met with some success, as under certain nonrestrictive assumptions, the graph Laplacian matrix has been shown to converge to the Laplace–Beltrami operator as the number of points goes to infinity.<ref>Mikhail Belkin Problems of Learning on Manifolds, PhD Thesis, Department of Mathematics, The [[University Of Chicago]], August 2003</ref> Matlab code for Laplacian Eigenmaps can be found in algorithms<ref>[http://www.cse.ohio-state.edu/~mbelkin/algorithms/algorithms.html Ohio-state.edu]</ref> and the PhD thesis of Belkin can be found at the [[Ohio State University]].<ref>[http://www.cse.ohio-state.edu/~mbelkin/papers/papers.html#thesis Ohio-state.edu]</ref>

In classification applications, low dimension manifolds can be used to model data classes which can be defined from sets of observed instances. Each observed instance can be described by two independent factors termed ’content’ and ’style’, where ’content’ is the invariant factor related to the essence of the class and ’style’ expresses variations in that class between instances.<ref>J. Tenenbaum and W. Freeman, Separating style and content with bilinear models, Neural Computation, vol. 12, 2000.</ref> Unfortunately, Laplacian Eigenmaps may fail to produce a coherent representation of a class of interest when training data consist of instances varying signiﬁcantly in terms of style.<ref>M. Lewandowski, J. Martinez-del Rincon, D. Makris, and J.-C. Nebel, Temporal extension of laplacian eigenmaps for unsupervised dimensionality reduction of time series, Proceedings of the International Conference on Pattern Recognition (ICPR), 2010</ref> In the case of classes which are represented by multivariate sequences, Structural Laplacian Eigenmaps has been proposed to overcome this issue by adding additional constraints within the Laplacian Eigenmaps neighborhood information graph to better reﬂect the intrinsic structure of the class.<ref name="ReferenceB">M. Lewandowski, D. Makris, S.A. Velastin and J.-C. Nebel, Structural Laplacian Eigenmaps for Modeling Sets of Multivariate Sequences, IEEE Transactions on Cybernetics, 44(6): 936-949, 2014</ref> More specifically, the graph is used to encode both the sequential structure of the multivariate sequences and, to minimise stylistic variations, proximity between data points of different sequences or even within a sequence, if it contains repetitions. Using [[dynamic time warping]], proximity is detected by ﬁnding correspondences between and within sections of the multivariate sequences that exhibit high similarity. Experiments conducted on [[vision-based activity recognition]], object orientation classification and human 3D pose recovery applications have demonstrate the added value of Structural Laplacian Eigenmaps when dealing with multivariate sequence data.<ref name="ReferenceB"/> An extension of Structural Laplacian Eigenmaps, Generalized Laplacian Eigenmaps led to the generation of manifolds where one of the dimensions specifically represents variations in style. This has proved particularly valuable in applications such as tracking of the human articulated body and silhouette extraction.<ref>J. Martinez-del-Rincon, M. Lewandowski, J.-C. Nebel and D. Makris, Generalized Laplacian Eigenmaps for Modeling and Tracking Human Motions, IEEE Transactions on Cybernetics, 44(9), pp 1646-1660, 2014</ref>

=== Manifold alignment ===
[[Manifold alignment]] takes advantage of the assumption that disparate data sets produced by similar generating processes will share a similar underlying manifold representation. By learning projections from each original space to the shared manifold, correspondences are recovered and knowledge from one domain can be transferred to another. Most manifold alignment techniques consider only two data sets, but the concept extends to arbitrarily many initial data sets.<ref>{{cite conference|last=Wang|first=Chang|author2=Mahadevan, Sridhar |title=Manifold Alignment using Procrustes Analysis|conference=The 25th International Conference on Machine Learning|date=July 2008|pages=1120–1127|url=http://people.cs.umass.edu/~chwang/papers/ICML-2008.pdf}}</ref>

=== Diffusion maps ===
[[Diffusion map]]s leverages the relationship between heat diffusion and a random walk ([[Markov Chain]]); an analogy is drawn between the diffusion operator on a manifold and a Markov transition matrix operating on functions defined on the graph whose nodes were sampled from the manifold.<ref>Diffusion Maps and Geometric Harmonics, Stephane Lafon, PhD Thesis, [[Yale University]], May 2004</ref> In particular let a data set be represented by <math> \mathbf{X} = [x_1,x_2,\ldots,x_n] \in \Omega \subset \mathbf {R^D}</math>. The underlying assumption of diffusion map is that the data although high-dimensional, lies on a low-dimensional manifold of dimensions <math> \mathbf{d} </math>.'''X''' represents the data set and let <math> \mu </math> represent the distribution of the data points on '''X'''. In addition to this lets define a '''kernel''' which represents some notion of affinity of the points in '''X'''. The kernel <math> \mathit{k} </math> has the following properties<ref name="ReferenceA">Diffusion Maps, Ronald R. Coifman and Stephane Lafon,: Science, 19 June 2006</ref>

: <math>k(x,y) = k(y,x), \, </math>

''k'' is symmetric

: <math> k(x,y) \geq 0\qquad \forall x,y, k </math>

''k'' is positivity preserving

Thus one can think of the individual data points as the nodes of a graph and the kernel ''k'' defining some sort of affinity on that graph. The graph is symmetric by construction since the kernel is symmetric. It is easy to see here that from the tuple {'''X''','''k'''} one can construct a reversible [[Markov Chain]]. This technique is fairly popular in a variety of fields and is known as the graph laplacian.

The graph '''K''' = (''X'',''E'') can be constructed for example using a Gaussian kernel.

: <math> K_{ij} = \begin{cases}
e^{-\|x_i -x_j\|^2_2/\sigma ^2} & \text{if } x_i \sim x_j \\
0 & \text{otherwise}
\end{cases}
</math>

In this above equation <math> x_i \sim x_j </math> denotes that <math> x_i </math> is a nearest neighbor of <math>x_j </math>. In reality [[Geodesic]] distance should be used to actually measure distances on the [[manifold]]. Since the exact structure of the manifold is not available, the geodesic distance is approximated by euclidean distances with only nearest neighbors. The choice <math> \sigma </math> modulates our notion of proximity in the sense that if <math> \|x_i - x_j\|_2 \gg \sigma </math> then <math> K_{ij} = 0 </math> and if <math> \|x_i - x_j\|_2 \ll \sigma </math> then <math> K_{ij} = 1 </math>. The former means that very little diffusion has taken place while the latter implies that the diffusion process is nearly complete. Different strategies to choose <math> \sigma </math> can be found in.<ref>B. Bah, "Diffusion Maps: Applications and Analysis", Masters Thesis, University of Oxford</ref> If <math> K </math> has to faithfully represent a Markov matrix, then it has to be normalized by the corresponding [[degree matrix]] <math> D </math>:

: <math> P = D^{-1}K. \, </math>

<math> P </math> now represents a Markov chain. <math> P(x_i,x_j) </math> is the probability of transitioning from <math> x_i </math> to <math> x_j </math> in one a time step. Similarly the probability of transitioning from <math> x_i </math> to <math> x_j </math> in '''t''' time steps is given by <math> P^t (x_i,x_j) </math>. Here <math> P^t </math> is the matrix <math> P </math> multiplied to itself t times. Now the Markov matrix <math> P </math> constitutes some notion of local geometry of the data set '''X'''. The major difference between diffusion maps and [[principal component analysis]] is that only local features of the data is considered in diffusion maps as opposed to taking correlations of the entire data set.

<math> K </math> defines a random walk on the data set which means that the kernel captures some local geometry of data set. The Markov chain defines fast and slow directions of propagation, based on the values taken by the kernel, and as one propagates the walk forward in time, the local geometry information aggregates in the same way as local transitions (defined by differential equations) of the dynamical system.<ref name="ReferenceA"/> The concept of diffusion arises from the definition of a family diffusion distance {<math> D_t </math>}<math>_{t \in N} </math>

: <math> D_t^2(x,y) = ||p_t(x,\cdot) - p_t(y,\cdot)||^2 </math>

For a given value of t <math> D_t </math> defines a distance between any two points of the data set. This means that the value of <math> D_t(x,y) </math> will be small if there are many paths that connect '''x''' to '''y''' and vice versa. The quantity <math> D_t(x,y) </math> involves summing over of all paths of length t, as a result of which <math> D_t </math> is extremely robust to noise in the data as opposed to geodesic distance. <math> D_t </math> takes into account all the relation between points x and y while calculating the distance and serves as a better notion of proximity than just [[Euclidean distance]] or even geodesic distance.

=== Hessian Locally-Linear Embedding (Hessian LLE) ===

Like LLE, [[Hessian LLE]]<ref>D. Donoho and C. Grimes, "Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data" Proc Natl Acad Sci U S A. 2003 May 13; 100(10): 5591–5596</ref> is also based on sparse matrix techniques. It tends to yield results of a much higher quality than LLE. Unfortunately, it has a very costly computational complexity, so it is not well-suited for heavily-sampled manifolds. It has no internal model.

=== Modified Locally-Linear Embedding (MLLE) ===

Modified LLE (MLLE)<ref>Z. Zhang and J. Wang, "MLLE: Modified Locally Linear Embedding Using Multiple Weights" http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.70.382</ref> is another LLE variant which uses multiple weights in each neighborhood to address the local weight matrix conditioning problem which leads to distortions in LLE maps. MLLE produces robust projections similar to Hessian LLE, but without the significant additional computational cost.

=== Relational perspective map ===
Relational perspective map is a [[multidimensional scaling]] algorithm. The algorithm finds a configuration of data points on a manifold by simulating a multi-particle dynamic system on a closed manifold, where data points are mapped to particles and distances (or dissimilarity) between data points represent a repulsive force. As the manifold gradually grows in size the multi-particle system cools down gradually and converges to a configuration that reflects the distance information of the data points.

Relational perspective map was inspired by a physical model in which positively charged particles move freely on the surface of a ball. Guided by the [[Charles-Augustin de Coulomb|Coulomb]] [[Coulomb's law|force]] between particles, the minimal energy configuration of the particles will reflect the strength of repulsive forces between the particles.

The Relational perspective map was introduced in.<ref>James X. Li, [http://www.palgrave-journals.com/ivs/journal/v3/n1/pdf/9500051a.pdf Visualizing high-dimensional data with relational perspective map], Information Visualization (2004) 3, 49–59</ref>
The algorithm firstly used the flat [[torus]] as the image manifold, then it has been extended (in the software [http://www.VisuMap.com VisuMap] to use other types of closed manifolds, like the [[sphere]], [[projective space]], and [[Klein bottle]], as image manifolds.

=== Local tangent space alignment ===
{{Main|Local tangent space alignment}}
[[local tangent space alignment|LTSA]]<ref>{{Cite journal |last=Zhang |first=Zhenyue |author2=Hongyuan Zha |title=Principal Manifolds and Nonlinear Dimension Reduction via Local Tangent Space Alignment |journal=SIAM Journal on Scientific Computing |volume=26 |issue=1 |year=2005 |pages=313–338 |doi=10.1137/s1064827502419154}}</ref> is based on the intuition that when a manifold is correctly unfolded, all of the tangent hyperplanes to the manifold will become aligned. It begins by computing the ''k''-nearest neighbors of every point. It computes the tangent space at every point by computing the ''d''-first principal components in each local neighborhood. It then optimizes to find an embedding that aligns the tangent spaces.

=== Local multidimensional scaling ===

Local Multidimensional Scaling<ref>J Venna and S Kaski, Local multidimensional scaling, Neural Networks, 2006</ref> performs [[multidimensional scaling]] in local regions, and then uses convex optimization to fit all the pieces together.

=== Maximum variance unfolding ===

[[Maximum Variance Unfolding]] was formerly known as Semidefinite Embedding. The intuition for this algorithm is that when a manifold is properly unfolded, the variance over the points is maximized. This algorithm also begins by finding the ''k''-nearest neighbors of every point. It then seeks to solve the problem of maximizing the distance between all non-neighboring points, constrained such that the distances between neighboring points are preserved. The primary contribution of this algorithm is a technique for casting this problem as a semidefinite programming problem. Unfortunately, semidefinite programming solvers have a high computational cost. The Landmark–MVU variant of this algorithm uses landmarks to increase speed with some cost to accuracy. It has no model.

=== Nonlinear PCA ===

Nonlinear PCA<ref>Scholz, M. Kaplan, F. Guy, C. L. Kopka, J. Selbig, J., Non-linear PCA: a missing data approach, In ''Bioinformatics'', Vol. 21, Number 20, pp. 3887–3895, Oxford University Press, 2005</ref> (NLPCA) uses [[backpropagation]] to train a multi-layer perceptron to fit to a manifold. Unlike typical MLP training, which only updates the weights, NLPCA updates both the weights and the inputs. That is, both the weights and inputs are treated as latent values. After training, the latent inputs are a low-dimensional representation of the observed vectors, and the MLP maps from that low-dimensional representation to the high-dimensional observation space.

=== Data-driven high-dimensional scaling ===

Data-Driven High Dimensional Scaling (DD-HDS)<ref>S. Lespinats, M. Verleysen, A. Giron, B. Fertil, DD-HDS: a tool for visualization and exploration of high-dimensional data, IEEE Transactions on Neural Networks 18 (5) (2007) 1265–1279.</ref> is closely related to [[Sammon's mapping]] and curvilinear component analysis except that (1) it simultaneously penalizes false neighborhoods and tears by focusing on small distances in both original and output space, and that (2) it accounts for [[concentration of measure]] phenomenon by adapting the weighting function to the distance distribution.

=== Manifold sculpting ===

Manifold Sculpting<ref>Gashler, M. and Ventura, D. and Martinez, T., ''[http://axon.cs.byu.edu/papers/gashler2007nips.pdf Iterative Non-linear Dimensionality Reduction with Manifold Sculpting]'', In Platt, J.C. and Koller, D. and Singer, Y. and Roweis, S., editor, Advances in Neural Information Processing Systems 20, pp. 513–520, MIT Press, Cambridge, MA, 2008</ref> uses [[graduated optimization]] to find an embedding. Like other algorithms, it computes the ''k''-nearest neighbors and tries to seek an embedding that preserves relationships in local neighborhoods. It slowly scales variance out of higher dimensions, while simultaneously adjusting points in lower dimensions to preserve those relationships. If the rate of scaling is small, it can find very precise embeddings. It boasts higher empirical accuracy than other algorithms with several problems. It can also be used to refine the results from other manifold learning algorithms. It struggles to unfold some manifolds, however, unless a very slow scaling rate is used. It has no model.

=== t-distributed stochastic neighbor embedding ===

[[t-distributed stochastic neighbor embedding]] (t-SNE) <ref>{{cite journal|last=van der Maaten|first=L.J.P.|author2=Hinton, G.E. |title=Visualizing High-Dimensional Data Using t-SNE|journal=Journal of Machine Learning Research 9|date=Nov 2008|pages=2579–2605|url=http://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf}}</ref> is widely used. It is one of a family of stochastic neighbor embedding methods.

=== RankVisu ===

RankVisu<ref>Lespinats S., Fertil B., Villemain P. and Herault J., Rankvisu: Mapping from the neighbourhood network, Neurocomputing, vol. 72 (13–15), pp. 2964–2978, 2009.</ref> is designed to preserve rank of neighborhood rather than distance. RankVisu is especially useful on difficult tasks (when the preservation of distance cannot be achieved satisfyingly). Indeed, the rank of neighborhood is less informative than distance (ranks can be deduced from distances but distances cannot be deduced from ranks) and its preservation is thus easier.

=== Topologically constrained isometric embedding ===

[[Topologically Constrained Isometric Embedding]] (TCIE)<ref>Rosman G., Bronstein M. M., Bronstein A. M. and Kimmel R., Nonlinear Dimensionality Reduction by Topologically Constrained Isometric Embedding, International Journal of Computer Vision, Volume 89, Number 1, 56–68, 2010</ref> is an algorithm based approximating geodesic distances after filtering geodesics inconsistent with the Euclidean metric. Aimed at correcting the distortions caused when Isomap is used to map intrinsically non-convex data, TCIE uses weight least-squares MDS in order to obtain a more accurate mapping. The TCIE algorithm first detects possible boundary points in the data, and during computation of the geodesic length marks inconsistent geodesics, to be given a small weight in the weighted [[Stress majorization]] that follows.

==Methods based on proximity matrices==

A method based on proximity matrices is one where the data is presented to the algorithm in the form of a [[similarity matrix]] or a [[distance matrix]]. These methods all fall under the broader class of [[Multidimensional scaling#Types|metric multidimensional scaling]]. The variations tend to be differences in how the proximity data is computed; for example, [[Isomap]], [[locally linear embeddings]], [[maximum variance unfolding]], and [[Sammon's projection|Sammon mapping]] (which is not in fact a mapping) are examples of metric multidimensional scaling methods.

==See also==
* [[Discriminant analysis]]
* [[Elastic map]]<ref>[http://bioinfo-out.curie.fr/projects/elmap/ ELastic MAPs]</ref>
* [[Feature learning]]
* [[Growing self-organizing map]] (GSOM)
* [[Pairwise distance methods]]
* [[Self-organizing map]] (SOM)

==References==
{{reflist}}

==External links==
* [http://isomap.stanford.edu/ Isomap]
* [http://www.ncrg.aston.ac.uk/GTM/ Generative Topographic Mapping]
* [http://www.miketipping.com/thesis.htm Mike Tipping's Thesis]
* [http://www.dcs.shef.ac.uk/~neil/gplvm/ Gaussian Process Latent Variable Model]
* [http://www.cs.toronto.edu/~roweis/lle/ Locally Linear Embedding]
* [http://www.visumap.net/index.aspx?p=Resources/RpmOverview Relational Perspective Map]
* [http://waffles.sourceforge.net/ Waffles] is an open source C++ library containing implementations of LLE, Manifold Sculpting, and some other manifold learning algorithms.
* [http://shogun-toolbox.org/edrt/ Efficient Dimensionality Reduction Toolkit homepage]
* [http://sy.lespi.free.fr/DD-HDS-homepage.html DD-HDS homepage]
* [http://sy.lespi.free.fr/RankVisu-homepage.html RankVisu homepage]
* [http://tx.technion.ac.il/~rc/diffusion_maps.pdf Short review of Diffusion Maps]
* [http://www.nlpca.org/ Nonlinear PCA by autoencoder neural networks]

{{DEFAULTSORT:Nonlinear Dimensionality Reduction}}
[[Category:Multivariate statistics]]
[[Category:Dimension]]
[[Category:Dimension reduction]]

Item-item collaborative filtering

2016-03-20T12:11:12Z

Deepalgo:

{{recommender systems}}
'''Item-item collaborative filtering''', or '''item-based''', or '''item-to-item''', is a form of [[collaborative filtering]] based on the similarity between items calculated using people's ratings of those items. Item-item collaborative filtering was first published in 2001, and in 2003 the e-commerce website [[Amazon.com|Amazon]] stated this algorithm powered its recommender system.

Earlier collaborative filtering systems based on [[Star (classification)|rating]] similarity between users (known as [[user-user collaborative filtering]]) had several problems:
* systems performed poorly when they had many items but comparatively few ratings
* computing similarities between all pairs of users was expensive
* user profiles changed quickly and the entire system model had to be recomputed

Item-item models resolve these problems in systems that have more users than items. Item-item models use rating distributions ''per item'', not ''per user''. With more users than items, each item tends to have more ratings than each user, so an item's average rating usually doesn't change quickly. This leads to more stable rating distributions in the model, so the model doesn't have to be rebuilt as often. When users consume and then rate an item, that item's similar items are picked from the existing system model and added to the user's recommendations.

Recently, a method named Item2Vec <ref name="item2vec">Barkan, O; Koenigstein, N (14 March 2016).[http://arxiv.org/abs/1603.04259 "Item2Vec: Neural Item Embedding for Collaborative Filtering"]. arXiv:1603.04259.</ref> was introduced for a scalable item-item collaborative filtering. Item2Vec produces low dimensional representation for items, where the affinity between items can be measured by cosine similarity. The method is based on the Word2Vec method that was successfully applied to natural language processing applications.

==Method==
First, the system executes a model-building stage by finding the similarity between all pairs of items. This [[Similarity measure|similarity function]] can take many forms, such as correlation between ratings or cosine of those rating vectors. As in user-user systems, similarity functions can use [[Normalization (statistics)|normalized]] ratings (correcting, for instance, for each user's average rating).

Second, the system executes a [[recommender system|recommendation]] stage. It uses the most similar items to a user's already-rated items to generate a list of recommendations. Usually this calculation is a [[Weight function|weighted sum]] or [[linear regression]]. This form of recommendation is analogous to "people who rate item X highly, like you, also tend to rate item Y highly, and you haven't rated item Y yet, so you should try it".

==Results==
Item-item collaborative filtering had less error than user-user collaborative filtering. In addition, its less-dynamic model was computed less often and stored in a smaller matrix, so item-item system performance was better than user-user systems.

==See also==
* [[Slope One]], a family of item-item collaborative filtering algorithms designed to reduce model [[overfitting]] problems

==Bibliography==
* {{cite journal|url=http://dl.acm.org/citation.cfm?id=372071|title=Item-based collaborative filtering recommendation algorithms|journal=Proceedings of the 10th international conference on the World Wide Web|pages=285-295 |date=2001 |isbn=1-58113-348-0 |doi=10.1145/371920.372071|first1=Badrul |last1= Sarwar |first2= George |last2= Karypis |first3= Joseph |last3=Konstan|first4= John |last4=Riedl |authorlink4=John Riedl|publisher=[[Association for Computing Machinery|ACM]]}}
* {{cite journal|url=http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1167344|title=Amazon.com recommendations: item-to-item collaborative filtering|journal=IEEE Internet Computing|pages=76-80 |date=22 January 2003 |issn=1089-7801 |publisher=[[IEEE]] |volume=7 |issue=1 |doi=10.1109/MIC.2003.1167344|first1=G |last1= Linden |first2= B |last2= Smith |first3= J |last3=York}}
* Barkan, O; Koenigstein, N (14 March 2016). [[arxiv:1603.04259|"Item2Vec: Neural Item Embedding for Collaborative Filtering"]]. arXiv:1603.04259.
{{reflist}}

[[Category:Recommender systems]]

Item-item collaborative filtering

2016-03-20T12:10:35Z

Deepalgo:

{{recommender systems}}
'''Item-item collaborative filtering''', or '''item-based''', or '''item-to-item''', is a form of [[collaborative filtering]] based on the similarity between items calculated using people's ratings of those items. Item-item collaborative filtering was first published in 2001, and in 2003 the e-commerce website [[Amazon.com|Amazon]] stated this algorithm powered its recommender system.

Earlier collaborative filtering systems based on [[Star (classification)|rating]] similarity between users (known as [[user-user collaborative filtering]]) had several problems:
* systems performed poorly when they had many items but comparatively few ratings
* computing similarities between all pairs of users was expensive
* user profiles changed quickly and the entire system model had to be recomputed

Item-item models resolve these problems in systems that have more users than items. Item-item models use rating distributions ''per item'', not ''per user''. With more users than items, each item tends to have more ratings than each user, so an item's average rating usually doesn't change quickly. This leads to more stable rating distributions in the model, so the model doesn't have to be rebuilt as often. When users consume and then rate an item, that item's similar items are picked from the existing system model and added to the user's recommendations.

Recently, a method named <ref name="item2vec">Barkan, O; Koenigstein, N (14 March 2016).[http://arxiv.org/abs/1603.04259 "Item2Vec: Neural Item Embedding for Collaborative Filtering"]. arXiv:1603.04259.</ref> was proposed for a scalable item-item collaborative filtering. Item2Vec produces low dimensional representation for items, where the affinity between items can be measured by cosine similarity. The method is based on the Word2Vec method that was successfully applied to natural language processing applications.

==Method==
First, the system executes a model-building stage by finding the similarity between all pairs of items. This [[Similarity measure|similarity function]] can take many forms, such as correlation between ratings or cosine of those rating vectors. As in user-user systems, similarity functions can use [[Normalization (statistics)|normalized]] ratings (correcting, for instance, for each user's average rating).

Second, the system executes a [[recommender system|recommendation]] stage. It uses the most similar items to a user's already-rated items to generate a list of recommendations. Usually this calculation is a [[Weight function|weighted sum]] or [[linear regression]]. This form of recommendation is analogous to "people who rate item X highly, like you, also tend to rate item Y highly, and you haven't rated item Y yet, so you should try it".

==Results==
Item-item collaborative filtering had less error than user-user collaborative filtering. In addition, its less-dynamic model was computed less often and stored in a smaller matrix, so item-item system performance was better than user-user systems.

==See also==
* [[Slope One]], a family of item-item collaborative filtering algorithms designed to reduce model [[overfitting]] problems

==Bibliography==
* {{cite journal|url=http://dl.acm.org/citation.cfm?id=372071|title=Item-based collaborative filtering recommendation algorithms|journal=Proceedings of the 10th international conference on the World Wide Web|pages=285-295 |date=2001 |isbn=1-58113-348-0 |doi=10.1145/371920.372071|first1=Badrul |last1= Sarwar |first2= George |last2= Karypis |first3= Joseph |last3=Konstan|first4= John |last4=Riedl |authorlink4=John Riedl|publisher=[[Association for Computing Machinery|ACM]]}}
* {{cite journal|url=http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1167344|title=Amazon.com recommendations: item-to-item collaborative filtering|journal=IEEE Internet Computing|pages=76-80 |date=22 January 2003 |issn=1089-7801 |publisher=[[IEEE]] |volume=7 |issue=1 |doi=10.1109/MIC.2003.1167344|first1=G |last1= Linden |first2= B |last2= Smith |first3= J |last3=York}}
* Barkan, O; Koenigstein, N (14 March 2016). [[arxiv:1603.04259|"Item2Vec: Neural Item Embedding for Collaborative Filtering"]]. arXiv:1603.04259.
{{reflist}}

[[Category:Recommender systems]]

Collaborative filtering

2016-03-20T12:08:55Z

Deepalgo:

{{external links|date=November 2013}}
{{Use dmy dates|date=June 2013}}
{{Recommender systems}}
[[File:Collaborative filtering.gif|300px|thumb|

This image shows an example of predicting of the user's rating using [[Collaborative software|collaborative]] filtering. At first, people rate different items (like videos, images, games). After that, the system is making [[prediction]]s about user's rating for an item, which the user hasn't rated yet. These predictions are built upon the existing ratings of other users, who have similar ratings with the active user. For instance, in our case the system has made a prediction, that the active user won't like the video.]]

'''Collaborative filtering''' ('''CF''') is a technique used by some [[recommender system]]s.<ref name="handbook">Francesco Ricci and Lior Rokach and Bracha Shapira, [http://www.inf.unibz.it/~ricci/papers/intro-rec-sys-handbook.pdf Introduction to Recommender Systems Handbook], Recommender Systems Handbook, Springer, 2011, pp. 1-35</ref> [[Collaborative software|Collaborative]] filtering has two senses, a narrow one and a more general one.<ref name=recommender>{{cite web|title=Beyond Recommender Systems: Helping People Help Each Other|url=http://www.grouplens.org/papers/pdf/rec-sys-overview.pdf|publisher=Addison-Wesley|accessdate=16 January 2012|page=6|year=2001|last1=Terveen|first1=Loren|last2=Hill|first2=Will|authorlink1=Loren Terveen}}</ref> In general, collaborative filtering is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc.<ref name="recommender" /> Applications of collaborative filtering typically involve very large data sets. Collaborative filtering methods have been applied to many different kinds of data including: sensing and monitoring data, such as in mineral exploration, environmental sensing over large areas or multiple sensors; financial data, such as financial service institutions that integrate many financial sources; or in electronic commerce and web applications where the focus is on user data, etc. The remainder of this discussion focuses on collaborative filtering for user data, although some of the methods and approaches may apply to the other major applications as well.

In the newer, narrower sense, collaborative filtering is a method of making automatic [[prediction]]s (filtering) about the interests of a user by collecting preferences or [[taste (sociology)|taste]] information from [[crowdsourcing|many users]] (collaborating). The underlying assumption of the collaborative filtering approach is that if a person ''A'' has the same opinion as a person ''B'' on an issue, A is more likely to have B's opinion on a different issue ''x'' than to have the opinion on x of a person chosen randomly. For example, a collaborative filtering recommendation system for [[television]] tastes could make predictions about which television show a user should like given a partial list of that user's tastes (likes or dislikes).<ref>[http://www.redbeemedia.com/insights/integrated-approach-tv-vod-recommendations An integrated approach to TV & VOD Recommendations] {{wayback|url=http://www.redbeemedia.com/insights/integrated-approach-tv-vod-recommendations |date=20120606225352 |df=y }}</ref> Note that these predictions are specific to the user, but use information gleaned from many users. This differs from the simpler approach of giving an [[average]] (non-specific) score for each item of interest, for example based on its number of [[vote]]s.

==Introduction==
The [[internet growth|growth]] of the [[Internet]] has made it much more difficult to effectively [[information extraction|extract useful information]] from all the available [[online information]]. The overwhelming amount of data necessitates mechanisms for efficient [[information filtering]]. One of the techniques used for dealing with this problem is called collaborative filtering.

The motivation for collaborative filtering comes from the idea that people often get the best recommendations from someone with [[similarity|similar]] tastes to themselves. Collaborative filtering explores techniques for matching people with similar interests and making [[recommendation]]s on this basis.

Collaborative filtering algorithms often require (1) users’ active participation, (2) an easy way to represent users’ interests to the system, and (3) algorithms that are able to match people with similar interests.

Typically, the workflow of a collaborative filtering system is:
# A user expresses his or her preferences by rating items (e.g. books, movies or CDs) of the system. These ratings can be viewed as an approximate representation of the user's interest in the corresponding domain.
# The system matches this user’s ratings against other users’ and finds the people with most "similar" tastes.
# With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user (presumably the absence of rating is often considered as the unfamiliarity of an item)
A key problem of collaborative filtering is how to combine and weight the preferences of user neighbors. Sometimes, users can immediately rate the recommended items. As a result, the system gains an increasingly accurate representation of user preferences over time.

==Methodology==

[[File:Collaborative Filtering in Recommender Systems.jpg|thumb|Collaborative Filtering in Recommender Systems]]

Collaborative filtering systems have many forms, but many common systems can be reduced to two steps:
# Look for users who share the same rating patterns with the active user (the user whom the prediction is for).
# Use the ratings from those like-minded users found in step 1 to calculate a prediction for the active user
This falls under the category of user-based collaborative filtering. A specific application of this is the user-based [[K-nearest neighbor algorithm|Nearest Neighbor algorithm]].

Alternatively, [[item-item collaborative filtering|item-based collaborative filtering]] (users who bought x also bought y), proceeds in an item-centric manner:
# Build an item-item matrix determining relationships between pairs of items
# Infer the tastes of the current user by examining the matrix and matching that user's data
See, for example, the [[Slope One]] item-based collaborative filtering family.

Another form of collaborative filtering can be based on implicit observations of normal user behavior (as opposed to the artificial behavior imposed by a rating task). These systems observe what a user has done together with what all users have done (what music they have listened to, what items they have bought) and use that data to predict the user's behavior in the future, or to predict how a user might like to behave given the chance. These predictions then have to be filtered through [[business logic]] to determine how they might affect the actions of a business system. For example, it is not useful to offer to sell somebody a particular album of music if they already have demonstrated that they own that music.

Relying on a scoring or rating system which is averaged across all users ignores specific demands of a user, and is particularly poor in tasks where there is large variation in interest (as in the recommendation of music). However, there are other methods to combat information explosion, such as [[WWW|web]] search and [[data clustering]].

==Types==

===Memory-based===
This approach uses user rating data to compute the similarity between users or items. This is used for making recommendations. This was an early approach used in many commercial systems. It's effective and easy to implement. Typical examples of this approach are neighbourhood-based CF and item-based/user-based top-N recommendations. For example, in user based approaches, the value of ratings user 'u' gives to item 'i' is calculated as an [[aggregation]] of some similar users' rating of the item:
:<math>r_{u,i} = \operatorname{aggr}_{u^\prime \in U} r_{u^\prime, i}</math>

where 'U' denotes the set of top 'N' users that are most similar to user 'u' who rated item 'i'. Some examples of the aggregation function includes:
:<math>r_{u,i} = \frac{1}{N}\sum\limits_{u^\prime \in U}r_{u^\prime, i}</math>
:<math>r_{u,i} = k\sum\limits_{u^\prime \in U}\operatorname{simil}(u,u^\prime)r_{u^\prime, i}</math>
:<math>r_{u,i} = \bar{r_u} + k\sum\limits_{u^\prime \in U}\operatorname{simil}(u,u^\prime)(r_{u^\prime, i}-\bar{r_{u^\prime}} )</math>

where k is a normalizing factor defined as <math>k =1/\sum_{u^\prime \in U}|\operatorname{simil}(u,u^\prime)| </math>. and <math>\bar{r_u}</math> is the average rating of user u for all the items rated by u.

The neighborhood-based algorithm calculates the similarity between two users or items, produces a prediction for the user by taking the [[weighted average]] of all the ratings. Similarity computation between items or users is an important part of this approach. Multiple measures, such as [[Pearson product-moment correlation coefficient|Pearson correlation]] and [[Cosine similarity|vector cosine]] based similarity are used for this.

The Pearson correlation similarity of two users x, y is defined as
:<math> \operatorname{simil}(x,y) = \frac{\sum\limits_{i \in I_{xy}}(r_{x,i}-\bar{r_x})(r_{y,i}-\bar{r_y})}{\sqrt{\sum\limits_{i \in I_{xy}}(r_{x,i}-\bar{r_x})^2\sum\limits_{i \in I_{xy}}(r_{y,i}-\bar{r_y})^2}} </math>

where Ixy is the set of items rated by both user x and user y.

The cosine-based approach defines the cosine-similarity between two users x and y as:<ref name="Breese1999">John S. Breese, David Heckerman, and Carl Kadie, [http://uai.sis.pitt.edu/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=231&proceeding_id=14 Empirical Analysis of Predictive Algorithms for Collaborative Filtering], 1998 {{wayback|url=http://uai.sis.pitt.edu/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=231&proceeding_id=14 |date=20131019134152 |df=y }}</ref>
:<math>\operatorname{simil}(x,y) = \cos(\vec x,\vec y) = \frac{\vec x \cdot \vec y}{||\vec x|| \times ||\vec y||} = \frac{\sum\limits_{i \in I_{xy}}r_{x,i}r_{y,i}}{\sqrt{\sum\limits_{i \in I_{x}}r_{x,i}^2}\sqrt{\sum\limits_{i \in I_{y}}r_{y,i}^2}}</math>

The user based top-N recommendation algorithm uses a similarity-based vector model to identify the k most similar users to an active user. After the k most similar users are found, their corresponding user-item matrices are aggregated to identify the set of items to be recommended. A popular method to find the similar users is the [[Locality-sensitive hashing]], which implements the [[Nearest neighbor search|nearest neighbor mechanism]] in linear time.

The advantages with this approach include: the explainability of the results, which is an important aspect of recommendation systems; easy creation and use; easy facilitation of new data; content-independence of the items being recommended; good scaling with co-rated items.

There are also several disadvantages with this approach. Its performance decreases when [[sparsity|data gets sparse]], which occurs frequently with web-related items. This hinders the [[scalability]] of this approach and creates problems with large datasets. Although it can efficiently handle new users because it relies on a [[data structure]], adding new items becomes more complicated since that representation usually relies on a specific [[vector space]]. Adding new items requires inclusion of the new item and the re-insertion of all the elements in the structure.

===Model-based===
Models are developed using [[data mining]], [[machine learning]] algorithms to find patterns based on training data. These are used to make predictions for real data. There are many model-based CF algorithms. These include [[Bayesian networks]], neural embedding models,<ref name="item2vec">Barkan, O; Koenigstein, N (14 March 2016). [http://arxiv.org/abs/1603.04259 "Item2Vec: Neural Item Embedding for Collaborative Filtering"]. arXiv:1603.04259.</ref> [[Cluster Analysis|clustering models]], [[Latent Semantic Indexing|latent semantic models]] such as [[singular value decomposition]], [[probabilistic latent semantic analysis]], multiple multiplicative factor, [[latent Dirichlet allocation]] and [[Markov decision process]] based models.<ref name="Suetal2009">Xiaoyuan Su, Taghi M. Khoshgoftaar, [http://www.hindawi.com/journals/aai/2009/421425/ A survey of collaborative filtering techniques], Advances in Artificial Intelligence archive, 2009.</ref>

This approach has a more holistic goal to uncover latent factors that explain observed ratings.<ref>[http://research.yahoo.com/pub/2435 Factor in the Neighbors: Scalable and Accurate Collaborative Filtering] {{wayback|url=http://research.yahoo.com/pub/2435 |date=20101023032716 |df=y }}</ref> Most of the models are based on creating a classification or clustering technique to identify the user based on the training set. The number of the parameters can be reduced based on types of [[Principal Component Analysis|principal component analysis]].

There are several advantages with this paradigm. It handles the sparsity better than memory based ones. This helps with scalability with large data sets. It improves the prediction performance. It gives an intuitive rationale for the recommendations.

The disadvantages with this approach are in the expensive model building. One needs to have a tradeoff between prediction performance and scalability. One can lose useful information due to reduction models. A number of models have difficulty explaining the predictions.

===Hybrid===
A number of applications combines the memory-based and the model-based CF algorithms. These overcome the limitations of native CF approaches. It improves the prediction performance. Importantly, it overcomes the CF problems such as sparsity and loss of information. However, they have increased complexity and are expensive to implement.<ref>{{cite journal | url = http://www.sciencedirect.com/science/article/pii/S0020025512002587 | doi=10.1016/j.ins.2012.04.012 | volume=208 | title=Kernel-Mapping Recommender system algorithms | journal=Information Sciences | pages=81–104}}
</ref> Usually most of the commercial recommender systems are hybrid, for example, Google news recommender system.<ref>{{cite web|url=http://dl.acm.org/citation.cfm?id=1242610|title=Google news personalization|publisher=}}</ref>

==Application on social web==
Unlike the traditional model of mainstream media, in which there are few editors who set guidelines, collaboratively filtered social media can have a very large number of editors, and content improves as the number of participants increases. Services like [[Reddit]], [[YouTube]], and [[Last.fm]] are typical example of collaborative filtering based media.<ref>[http://www.readwriteweb.com/archives/collaborative_filtering_social_web.php Collaborative Filtering: Lifeblood of The Social Web]</ref>

One scenario of collaborative filtering application is to recommend interesting or popular information as judged by the community. As a typical example, stories appear in the front page of [[Digg]] as they are "voted up" (rated positively) by the community. As the community becomes larger and more diverse, the promoted stories can better reflect the average interest of the community members.

Another aspect of collaborative filtering systems is the ability to generate more personalized recommendations by analyzing information from the past activity of a specific user, or the history of other users deemed to be of similar taste to a given user. These resources are used as user profiling and helps the site recommend content on a user-by-user basis. The more a given user makes use of the system, the better the recommendations become, as the system gains data to improve its model of that user.

===Problems===
A collaborative filtering system does not necessarily succeed in automatically matching content to one's preferences. Unless the platform achieves unusually good diversity and independence of opinions, one point of view will always dominate another in a particular community. As in the personalized recommendation scenario, the introduction of new users or new items can cause the [[cold start]] problem, as there will be insufficient data on these new entries for the collaborative filtering to work accurately. In order to make appropriate recommendations for a new user, the system must first learn the user's preferences by analysing past voting or rating activities. The collaborative filtering system requires a substantial number of users to rate a new item before that item can be recommended.

==Challenges of collaborative filtering==

===Data sparsity===
In practice, many commercial recommender systems are based on large datasets. As a result, the user-item matrix used for collaborative filtering could be extremely large and sparse, which brings about the challenges in the performances of the recommendation.

One typical problem caused by the data sparsity is the [[cold start]] problem. As collaborative filtering methods recommend items based on users’ past preferences, new users will need to rate sufficient number of items to enable the system to capture their preferences accurately and thus provides reliable recommendations.

Similarly, new items also have the same problem. When new items are added to system, they need to be rated by substantial number of users before they could be recommended to users who have similar tastes with the ones rated them. The new item problem does not limit the [[Content-based filtering|content-based recommendation]], because the recommendation of an item is based on its discrete set of descriptive qualities rather than its ratings.

===Scalability===
As the numbers of users and items grow, traditional CF algorithms will suffer serious scalability problems{{Citation needed|date=April 2013}}. For example, with tens of millions of customers <math>O(M)</math> and millions of items <math>O(N)</math>, a CF algorithm with the complexity of <math>n</math> is already too large. As well, many systems need to react immediately to online requirements and make recommendations for all users regardless of their purchases and ratings history, which demands a higher scalability of a CF system. Large web companies such as Twitter use clusters of machines to scale recommendations for their millions of users, with most computations happening in very large memory machines.<ref name="twitterwtf">Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, and Reza Bosagh Zadeh [http://dl.acm.org/citation.cfm?id=2488433 WTF: The who-to-follow system at Twitter], Proceedings of the 22nd international conference on World Wide Web</ref>

Recently, a method named [[arxiv:1603.04259|Item2Vec]] <ref name=item2vec /> was introduced for a scalable item-based Collaborative Filtering. Item2Vec produces embedding for items in a latent space and is capable of inferring item-to-item relations even when user information is not available.

===Synonyms===
[[Synonyms]] refers to the tendency of a number of the same or very similar items to have different names or entries. Most recommender systems are unable to discover this latent association and thus treat these products differently.

For example, the seemingly different items "children movie" and "children film" are actually referring to the same item. Indeed, the degree of variability in descriptive term usage is greater than commonly suspected.{{citation needed|date=September 2013}} The prevalence of synonyms decreases the recommendation performance of CF systems. Topic Modeling (like the [[Latent Dirichlet Allocation]] technique) could solve this by grouping different words belonging to the same topic.{{citation needed|date=September 2013}}

===Gray sheep===
Gray sheep refers to the users whose opinions do not consistently agree or disagree with any group of people and thus do not benefit from collaborative filtering. [[Black sheep]] are the opposite group whose idiosyncratic tastes make recommendations nearly impossible. Although this is a failure of the recommender system, non-electronic recommenders also have great problems in these cases, so black sheep is an acceptable failure.

===Shilling attacks===
In a recommendation system where everyone can give the ratings, people may give lots of positive ratings for their own items and negative ratings for their competitors. It is often necessary for the collaborative filtering systems to introduce precautions to discourage such kind of manipulations.

===Diversity and the Long Tail===
Collaborative filters are expected to increase diversity because they help us discover new products. Some algorithms, however, may unintentionally do the opposite. Because collaborative filters recommend products based on past sales or ratings, they cannot usually recommend products with limited historical data. This can create a rich-get-richer effect for popular products, akin to [[positive feedback]]. This bias toward popularity can prevent what are otherwise better consumer-product matches. A [[Wharton School of the University of Pennsylvania|Wharton]] study details this phenomenon along with several ideas that may promote diversity and the "[[long tail]]."<ref>{{cite journal| last1= Fleder | first1= Daniel | first2= Kartik |last2= Hosanagar | title=Blockbuster Culture's Next Rise or Fall: The Impact of Recommender Systems on Sales Diversity|journal=Management Science |date=May 2009|url=http://papers.ssrn.com/sol3/papers.cfm?abstract_id=955984 | doi = 10.1287/mnsc.1080.0974 }}</ref> Several collaborative filtering algorithms have been developed to promote diversity and the "[[long tail]]" by recommending novel, unexpected,<ref>{{cite journal| last1= Adamopoulos | first1= Panagiotis | first2= Alexander |last2= Tuzhilin | title=On Unexpectedness in Recommender Systems: Or How to Better Expect the Unexpected|journal=ACM Transactions on Intelligent Systems and Technology |date=January 2015|url=http://dl.acm.org/citation.cfm?id=2559952 | doi = 10.1145/2559952}}</ref> and serendipitous items.<ref>{{cite journal| last1= Adamopoulos | first1= Panagiotis | title=Beyond rating prediction accuracy: on new perspectives in recommender systems|journal=Proceedings of the 7th ACM conference on Recommender systems |date=October 2013|url=http://dl.acm.org/citation.cfm?id=2508073| doi = 10.1145/2507157.2508073}}</ref>

==Innovations==
{{Prose|date=May 2012}}
* New algorithms have been developed for CF as a result of the [[Netflix prize]].
* Cross-System Collaborative Filtering where user profiles across multiple [[recommender systems]] are combined in a privacy preserving manner.
* Robust Collaborative Filtering, where recommendation is stable towards efforts of manipulation. This research area is still active and not completely solved.<ref>{{cite web|url=http://dl.acm.org/citation.cfm?id=1297240 |title=Robust collaborative filtering |doi=10.1145/1297231.1297240 |publisher=Portal.acm.org |date=19 October 2007 |accessdate=2012-05-15}}</ref>

==See also==
* [[Attention Profiling Mark-up Language|Attention Profiling Mark-up Language (APML)]]
* [[Cold start]]
* [[Collaborative model]]
* [[Collaborative search engine]]
* [[Collective intelligence]]
* [[Customer engagement]]
* [[Delegative Democracy]], the same principle applied to voting rather than filtering
* [[Enterprise bookmarking]]
* [[Firefly (website)]], a defunct website which was based on collaborative filtering
* [[Long tail]]
* [[Preference elicitation]]
* [[Recommendation system]]
* [[Relevance (information retrieval)]]
* [[Reputation system]]
* [[Robust collaborative filtering]]
* [[Similarity search]]
* [[Slope One]]
* [[Social translucence]]

==References==
{{Reflist|30em}}

==External links==
*[http://arxiv.org/abs/1603.04259 Item2Vec: Neural Item Embedding for Collaborative Filtering], Barkan, O; Koenigstein, N (14 March 2016) arXiv:1603.04259.
*[http://www.grouplens.org/papers/pdf/rec-sys-overview.pdf ''Beyond Recommender Systems: Helping People Help Each Other''], page 12, 2001
*[http://www.prem-melville.com/publications/recommender-systems-eml2010.pdf Recommender Systems.] Prem Melville and Vikas Sindhwani. In Encyclopedia of Machine Learning, Claude Sammut and Geoffrey Webb (Eds), Springer, 2010.
*[http://arxiv.org/abs/1203.4487 Recommender Systems in industrial contexts - PHD thesis (2012) including a comprehensive overview of many collaborative recommender systems]
*[http://web.archive.org/web/20080602151647/http://ieeexplore.ieee.org:80/xpls/abs_all.jsp?arnumber=1423975 Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions]. Adomavicius, G. and Tuzhilin, A. IEEE Transactions on Knowledge and Data Engineering 06.2005
*[https://web.archive.org/web/20060527214435/http://ectrl.itc.it/home/laboratory/meeting/download/p5-l_herlocker.pdf Evaluating collaborative filtering recommender systems] ([http://www.doi.org/ DOI]: [http://dx.doi.org/10.1145/963770.963772 10.1145/963770.963772])
*[http://www.grouplens.org/publications.html GroupLens research papers].
*[http://www.cs.utexas.edu/users/ml/papers/cbcf-aaai-02.pdf Content-Boosted Collaborative Filtering for Improved Recommendations.] Prem Melville, Raymond J. Mooney, and Ramadass Nagarajan. Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI-2002), pp. 187–192, Edmonton, Canada, July 2002.
*[http://agents.media.mit.edu/projects.html A collection of past and present "information filtering" projects (including collaborative filtering) at MIT Media Lab]
*[http://www.ieor.berkeley.edu/~goldberg/pubs/eigentaste.pdf Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001.]
*[http://downloads.hindawi.com/journals/aai/2009/421425.pdf A Survey of Collaborative Filtering Techniques] Su, Xiaoyuan and Khoshgortaar, Taghi. M
*[http://dl.acm.org/citation.cfm?id=1242610 Google News Personalization: Scalable Online Collaborative Filtering] Abhinandan Das, Mayur Datar, Ashutosh Garg, and Shyam Rajaram. International World Wide Web Conference, Proceedings of the 16th international conference on World Wide Web
*[http://web.archive.org/web/20101023032716/http://research.yahoo.com:80/pub/2435 Factor in the Neighbors: Scalable and Accurate Collaborative Filtering] Yehuda Koren, Transactions on Knowledge Discovery from Data (TKDD) (2009)
*[http://webpages.uncc.edu/~asaric/ISMIS09.pdf Rating Prediction Using Collaborative Filtering]
*[http://www.cis.upenn.edu/~ungar/CF/ Recommender Systems]
*[http://www2.sims.berkeley.edu/resources/collab/ Berkeley Collaborative Filtering]

{{Authority control}}

{{DEFAULTSORT:Collaborative Filtering}}
[[Category:Collaboration]]
[[Category:Collaborative software]]
[[Category:Collective intelligence]]
[[Category:Information retrieval techniques]]
[[Category:Recommender systems]]
[[Category:Social information processing]]
[[Category:Behavioral and social facets of systemic risk]]

Similarity learning

2016-03-20T12:07:07Z

Deepalgo:

'''Similarity learning''' is an area of supervised [[machine learning]] in [[artificial intelligence]]. It is closely related to [[regression (machine learning)|regression]] and [[classification in machine learning|classification]], but the goal is to learn from examples a similarity function that measures how similar or related two objects are. It has applications in [[ranking]], in [[recommendation systems]] <ref name="item2vec">Barkan, O; Koenigstein, N (14 March 2016).[http://arxiv.org/abs/1603.04259 "Item2Vec: Neural Item Embedding for Collaborative Filtering"]. arXiv:1603.04259.</ref> and face verification <ref name="vmrs"> Barkan O, Weill J, Wolf L, Aronowitz H. [http://www.cv-foundation.org/openaccess/content_iccv_2013/papers/Barkan_Fast_High_Dimensional_2013_ICCV_paper.pdf "Fast high dimensional vector multiplication face recognition"]. In Proceedings of the IEEE International Conference on Computer Vision 2013 (pp. 1960-1967).</ref>.

== Learning setup ==

There are four common setups for similarity and metric distance learning.

* ''[[Regression (machine learning)|Regression]] similarity learning''. In this setup, pairs of objects are given <math> (x_i^1, x_i^2) </math> together with a measure of their similarity <math> y_i \in R </math>. The goal is to learn a function that approximates <math> f(x_i^1, x_i^2) \sim y_i </math> for every new labeled triplet example <math>(x_i^1, x_i^2, y_i)</math>. This is typically achieved by minimizing a regularized loss <math> min_W \sum_i loss(w;x_i^1, x_i^2,y_i) + reg(w)</math>.
* ''[[Classification in machine learning|Classification]] similarity learning''. Given are pairs of similar objects <math>(x_i, x_i^+) </math> and non similar objects <math>(x_i, x_i^-)</math>. An equivalent formulation is that every pair <math>(x_i^1, x_i^2)</math> is given together with a binary label <math>y_i \in \{0,1\}</math> that determines if the two objects are similar or not. The goal is again to learn a classifier that can decide if a new pair of objects is similar or not.
* ''Ranking similarity learning''. Given are triplets of objects <math>(x_i, x_i^+, x_i^-)</math> whose relative similarity obey a predefined order: <math>x_i</math> is known to be more similar to <math>x_i^+</math> than to <math>x_i^-</math>. The goal is to learn a function <math>f</math> such that for any new triplet of objects <math>(x, x^+, x^-)</math>, it obeys <math>f(x, x^+) > f(x, x^-)</math>. This setup assumes a weaker form of supervision than in regression, because instead of providing an exact measure of similarity, one only has to provide the relative order of similarity. For this reason, ranking-based similarity learning is easier to apply in real large scale applications.<ref>{{cite journal| last1 = Chechik | first1 = G. | last2 = Sharma | first2 = V. | last3 = Shalit | first3 = U. | last4 = Bengio | first4 = S. | title=Large Scale Online Learning of Image Similarity Through Ranking|journal=Journal of Machine Learning research|year=2010|volume=11|pages=1109–1135|url=http://www.jmlr.org/papers/volume11/chechik10a/chechik10a.pdf}}</ref>
* [[Locality sensitive hashing]] - LSH <ref>Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." VLDB. Vol. 99. No. 6. 1999.</ref> [[Hash Function|hashes]] input items so that similar items map to the same “buckets” in memory with high probability (the number of buckets being much smaller than the universe of possible input items). It is often applied in nearest neighbor search on large scale high-dimensional data, e.g., image databases, document collections, time-series databases, and genome databases <ref>{{cite web
| first1 = A.|last1=Rajaraman |first2= J.|last2=Ullman|author2-link=Jeffrey Ullman
| url = http://infolab.stanford.edu/~ullman/mmds.html
| title=Mining of Massive Datasets, Ch. 3.
| year = 2010
}}</ref>

A common approach for learning similarity, is to model the similarity function as a [[bilinear form]]. For example, in the case of ranking similarity learning, one aims to learn a matrix W that parametrizes the similarity function <math> f_W(x, z) = x^T W z </math>.

== Metric learning ==

Similarity learning is closely related to ''distance metric learning''. Metric learning is the task of learning a distance function over objects. A [[Metric (mathematics)|metric]] or [[distance function]] has to obey four axioms: [[non-negative|non-negativity]], [[Identity of indiscernibles]], [[symmetry]] and [[subadditivity]] / triangle inequality. In practice, metric learning algorithms ignore the condition of identity of indiscernibles and learn a pseudo-metric.

When the objects <math>x_i</math> are vectors in <math>R^d</math>, then any matrix <math>W</math> in the symmetric positive semi-definite cone <math>S_+^d</math> defines a distance pseudo-metric of the space of x through the form <math>D_W(x_1, x_2)^2 = (x_1-x_2)^{\top} W (x_1-x_2)</math>. When <math>W</math> is a symmetric positive definite matrix, <math>D_W</math> is a metric. Moreover, as any symmetric positive semi-definite matrix <math>W \in S_+^d</math> can be decomposed as <math>W = L^{\top}L</math> where <math>L \in R^{e \times d}</math> and <math>e \geq rank(W)</math>, the distance function <math>D_W</math> can be rewritten equivalently <math>D_W(x_1, x_2)^2 = (x_1-x_2)^{\top} L^{\top}L (x_1-x_2) = \| L (x_1-x_2) \|_2^2</math>. The distance <math>D_W(x_1, x_2)^2=\| x_1' - x_2' \|_2^2</math> corresponds to the Euclidean distance between the projected feature vectors <math>x_1'= Lx_1</math> and <math>x_2'= Lx_2</math>.
Some well-known approaches for metric learning include [[Large margin nearest neighbor]] <ref name=LMNN>{{cite journal| last1 = Weinberger | first1 = K. Q. | last2 = Blitzer | first2 = J. C. | last3 = Saul | first3 = L. K. | title=Distance Metric Learning for Large Margin Nearest Neighbor Classification|journal=Advances in Neural Information Processing Systems |volume=18|year=2006|pages=1473–1480|url=http://books.nips.cc/papers/files/nips18/NIPS2005_0265.pdf}}</ref>
, Information theoretic metric learning (ITML).<ref name=ITML>{{cite journal | last1 = Davis | first1 = J. V. | last2 = Kulis | first2 = B. | last3 = Jain | first3 = P. | last4 = Sra | first4 = S. | last5 = Dhillon | first5 = I. S. | title=Information-theoretic metric learning | journal=International conference in machine learning (ICML) | year=2007 | pages=209–216 | url=http://www.cs.utexas.edu/users/pjain/itml/}}</ref>

In [[statistics]], the [[covariance]] matrix of the data is sometimes used to define a distance metric called [[Mahalanobis distance]].

== Applications ==
Similarity learning is used in information retrieval for learning to rank, in face verification or face identification,<ref name=GUILLAUMIN>{{cite journal| last1 = Guillaumin | first1 = M. | last2 = Verbeek | first2 = J. | last3 = Schmid | first3 = C. | title=Is that you? Metric learning approaches for face identification|url=http://hal.inria.fr/docs/00/58/50/36/PDF/verbeek09iccv2.pdf|journal=IEEE International Conference on Computer Vision (ICCV)|year=2009}}</ref><ref name=MIGNON>{{cite journal| last1 = Mignon | first1 = A. | last2 = Jurie | first2 = F. | title=PCCA: A new approach for distance learning from sparse pairwise constraints|journal=IEEE Conference on Computer Vision and Pattern Recognition (CVPR)|year=2012|url=http://hal.archives-ouvertes.fr/docs/00/80/60/07/PDF/12_cvpr_ldca.pdf}}</ref> and in [[recommendation systems]]. Also, many machine learning approaches rely on some metric. This includes unsupervised learning such as [[clustering (machine learning)|clustering]], which groups together close or similar objects. It also includes supervised approaches like [[K-nearest neighbor algorithm]] which rely on labels of nearby objects to decide on the label of a new object. Metric learning has been proposed as a preprocessing step for many of these approaches
.<ref name=XING>{{cite journal| last1 = Xing | first1 = E. P. | last2 = Ng | first2 = A. Y. | last3 = Jordan | first3 = M. I. | last4 = Russell | first4 = S. | title=Distance Metric Learning, with Application to Clustering with Side-information | journal=Advances in Neural Information Processing Systems |volume=15 | year=2002| pages = 505–512 | publisher = MIT Press}}</ref>

== Scalability ==

Metric and similarity learning naively scale quadraticly with the dimension of the input space, as can easily see when the learned metric has a bilinear form <math> f_W(x, z) = x^T W z </math>. Scaling to higher dimensions can be achieved by enforcing a sparseness structure over the matrix model, as done with HDSL,<ref name=Liu>{{Cite journal| last1=Liu | last2=Bellet | last3=Sha| title=Similarity Learning for High-Dimensional Sparse Data|year=2015|journal=International Conference on Artificial Intelligence and Statistics (AISTATS)|url=http://jmlr.org/proceedings/papers/v38/liu15.pdf}}</ref> and with COMET
.<ref>{{Cite journal | last1=Atzmon | last2=Shalit | last3=Chechik | title=Learning Sparse Metrics, One Feature at a Time | journal=J. Mach. Learn. Research (JMLR)|year=2015|url=http://jmlr.org/proceedings/papers/v44/atzmon2015.pdf}}</ref>

== Further reading ==
For further information on this topic, see the surveys on metric and similarity learning by Bellet et al.<ref name=survey>{{cite arXiv | last1 = Bellet | first1 = A. | last2 = Habrard | first2 = A. | last3 = Sebban | first3 = M. |eprint=1306.6709 |class=cs.LG |title=A Survey on Metric Learning for Feature Vectors and Structured Data |year=2013}}</ref> and Kulis.<ref name=survey2>{{cite journal| last = Kulis | first = B.| title=Metric Learning: A Survey | journal=Foundations and Trends in Machine Learning | year=2012 | url=http://web.cse.ohio-state.edu/~kulis/pubs/ftml_metric_learning.pdf}}</ref>

==Also see==
[[Latent semantic analysis]]

== References ==
{{reflist}}

[[Category:Machine learning]]

Similarity learning

2016-03-20T12:04:41Z

Deepalgo:

'''Similarity learning''' is an area of supervised [[machine learning]] in [[artificial intelligence]]. It is closely related to [[regression (machine learning)|regression]] and [[classification in machine learning|classification]], but the goal is to learn from examples a similarity function that measures how similar or related two objects are. It has applications in [[ranking]], in [[recommendation systems]] <ref name="item2vec">Barkan, O; Koenigstein, N (14 March 2016).[http://arxiv.org/abs/1603.04259 "Item2Vec: Neural Item Embedding for Collaborative Filtering"]. arXiv:1603.04259.</ref> and face verification <ref name="vmrs"> Barkan O, Weill J, Wolf L, Aronowitz H. Fast high dimensional vector multiplication face recognition. In Proceedings of the IEEE International Conference on Computer Vision 2013 (pp. 1960-1967).</ref>.

== Learning setup ==

There are four common setups for similarity and metric distance learning.

* ''[[Regression (machine learning)|Regression]] similarity learning''. In this setup, pairs of objects are given <math> (x_i^1, x_i^2) </math> together with a measure of their similarity <math> y_i \in R </math>. The goal is to learn a function that approximates <math> f(x_i^1, x_i^2) \sim y_i </math> for every new labeled triplet example <math>(x_i^1, x_i^2, y_i)</math>. This is typically achieved by minimizing a regularized loss <math> min_W \sum_i loss(w;x_i^1, x_i^2,y_i) + reg(w)</math>.
* ''[[Classification in machine learning|Classification]] similarity learning''. Given are pairs of similar objects <math>(x_i, x_i^+) </math> and non similar objects <math>(x_i, x_i^-)</math>. An equivalent formulation is that every pair <math>(x_i^1, x_i^2)</math> is given together with a binary label <math>y_i \in \{0,1\}</math> that determines if the two objects are similar or not. The goal is again to learn a classifier that can decide if a new pair of objects is similar or not.
* ''Ranking similarity learning''. Given are triplets of objects <math>(x_i, x_i^+, x_i^-)</math> whose relative similarity obey a predefined order: <math>x_i</math> is known to be more similar to <math>x_i^+</math> than to <math>x_i^-</math>. The goal is to learn a function <math>f</math> such that for any new triplet of objects <math>(x, x^+, x^-)</math>, it obeys <math>f(x, x^+) > f(x, x^-)</math>. This setup assumes a weaker form of supervision than in regression, because instead of providing an exact measure of similarity, one only has to provide the relative order of similarity. For this reason, ranking-based similarity learning is easier to apply in real large scale applications.<ref>{{cite journal| last1 = Chechik | first1 = G. | last2 = Sharma | first2 = V. | last3 = Shalit | first3 = U. | last4 = Bengio | first4 = S. | title=Large Scale Online Learning of Image Similarity Through Ranking|journal=Journal of Machine Learning research|year=2010|volume=11|pages=1109–1135|url=http://www.jmlr.org/papers/volume11/chechik10a/chechik10a.pdf}}</ref>
* [[Locality sensitive hashing]] - LSH <ref>Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." VLDB. Vol. 99. No. 6. 1999.</ref> [[Hash Function|hashes]] input items so that similar items map to the same “buckets” in memory with high probability (the number of buckets being much smaller than the universe of possible input items). It is often applied in nearest neighbor search on large scale high-dimensional data, e.g., image databases, document collections, time-series databases, and genome databases <ref>{{cite web
| first1 = A.|last1=Rajaraman |first2= J.|last2=Ullman|author2-link=Jeffrey Ullman
| url = http://infolab.stanford.edu/~ullman/mmds.html
| title=Mining of Massive Datasets, Ch. 3.
| year = 2010
}}</ref>

A common approach for learning similarity, is to model the similarity function as a [[bilinear form]]. For example, in the case of ranking similarity learning, one aims to learn a matrix W that parametrizes the similarity function <math> f_W(x, z) = x^T W z </math>.

== Metric learning ==

Similarity learning is closely related to ''distance metric learning''. Metric learning is the task of learning a distance function over objects. A [[Metric (mathematics)|metric]] or [[distance function]] has to obey four axioms: [[non-negative|non-negativity]], [[Identity of indiscernibles]], [[symmetry]] and [[subadditivity]] / triangle inequality. In practice, metric learning algorithms ignore the condition of identity of indiscernibles and learn a pseudo-metric.

When the objects <math>x_i</math> are vectors in <math>R^d</math>, then any matrix <math>W</math> in the symmetric positive semi-definite cone <math>S_+^d</math> defines a distance pseudo-metric of the space of x through the form <math>D_W(x_1, x_2)^2 = (x_1-x_2)^{\top} W (x_1-x_2)</math>. When <math>W</math> is a symmetric positive definite matrix, <math>D_W</math> is a metric. Moreover, as any symmetric positive semi-definite matrix <math>W \in S_+^d</math> can be decomposed as <math>W = L^{\top}L</math> where <math>L \in R^{e \times d}</math> and <math>e \geq rank(W)</math>, the distance function <math>D_W</math> can be rewritten equivalently <math>D_W(x_1, x_2)^2 = (x_1-x_2)^{\top} L^{\top}L (x_1-x_2) = \| L (x_1-x_2) \|_2^2</math>. The distance <math>D_W(x_1, x_2)^2=\| x_1' - x_2' \|_2^2</math> corresponds to the Euclidean distance between the projected feature vectors <math>x_1'= Lx_1</math> and <math>x_2'= Lx_2</math>.
Some well-known approaches for metric learning include [[Large margin nearest neighbor]] <ref name=LMNN>{{cite journal| last1 = Weinberger | first1 = K. Q. | last2 = Blitzer | first2 = J. C. | last3 = Saul | first3 = L. K. | title=Distance Metric Learning for Large Margin Nearest Neighbor Classification|journal=Advances in Neural Information Processing Systems |volume=18|year=2006|pages=1473–1480|url=http://books.nips.cc/papers/files/nips18/NIPS2005_0265.pdf}}</ref>
, Information theoretic metric learning (ITML).<ref name=ITML>{{cite journal | last1 = Davis | first1 = J. V. | last2 = Kulis | first2 = B. | last3 = Jain | first3 = P. | last4 = Sra | first4 = S. | last5 = Dhillon | first5 = I. S. | title=Information-theoretic metric learning | journal=International conference in machine learning (ICML) | year=2007 | pages=209–216 | url=http://www.cs.utexas.edu/users/pjain/itml/}}</ref>

In [[statistics]], the [[covariance]] matrix of the data is sometimes used to define a distance metric called [[Mahalanobis distance]].

== Applications ==
Similarity learning is used in information retrieval for learning to rank, in face verification or face identification,<ref name=GUILLAUMIN>{{cite journal| last1 = Guillaumin | first1 = M. | last2 = Verbeek | first2 = J. | last3 = Schmid | first3 = C. | title=Is that you? Metric learning approaches for face identification|url=http://hal.inria.fr/docs/00/58/50/36/PDF/verbeek09iccv2.pdf|journal=IEEE International Conference on Computer Vision (ICCV)|year=2009}}</ref><ref name=MIGNON>{{cite journal| last1 = Mignon | first1 = A. | last2 = Jurie | first2 = F. | title=PCCA: A new approach for distance learning from sparse pairwise constraints|journal=IEEE Conference on Computer Vision and Pattern Recognition (CVPR)|year=2012|url=http://hal.archives-ouvertes.fr/docs/00/80/60/07/PDF/12_cvpr_ldca.pdf}}</ref> and in [[recommendation systems]]. Also, many machine learning approaches rely on some metric. This includes unsupervised learning such as [[clustering (machine learning)|clustering]], which groups together close or similar objects. It also includes supervised approaches like [[K-nearest neighbor algorithm]] which rely on labels of nearby objects to decide on the label of a new object. Metric learning has been proposed as a preprocessing step for many of these approaches
.<ref name=XING>{{cite journal| last1 = Xing | first1 = E. P. | last2 = Ng | first2 = A. Y. | last3 = Jordan | first3 = M. I. | last4 = Russell | first4 = S. | title=Distance Metric Learning, with Application to Clustering with Side-information | journal=Advances in Neural Information Processing Systems |volume=15 | year=2002| pages = 505–512 | publisher = MIT Press}}</ref>

== Scalability ==

Metric and similarity learning naively scale quadraticly with the dimension of the input space, as can easily see when the learned metric has a bilinear form <math> f_W(x, z) = x^T W z </math>. Scaling to higher dimensions can be achieved by enforcing a sparseness structure over the matrix model, as done with HDSL,<ref name=Liu>{{Cite journal| last1=Liu | last2=Bellet | last3=Sha| title=Similarity Learning for High-Dimensional Sparse Data|year=2015|journal=International Conference on Artificial Intelligence and Statistics (AISTATS)|url=http://jmlr.org/proceedings/papers/v38/liu15.pdf}}</ref> and with COMET
.<ref>{{Cite journal | last1=Atzmon | last2=Shalit | last3=Chechik | title=Learning Sparse Metrics, One Feature at a Time | journal=J. Mach. Learn. Research (JMLR)|year=2015|url=http://jmlr.org/proceedings/papers/v44/atzmon2015.pdf}}</ref>

== Further reading ==
For further information on this topic, see the surveys on metric and similarity learning by Bellet et al.<ref name=survey>{{cite arXiv | last1 = Bellet | first1 = A. | last2 = Habrard | first2 = A. | last3 = Sebban | first3 = M. |eprint=1306.6709 |class=cs.LG |title=A Survey on Metric Learning for Feature Vectors and Structured Data |year=2013}}</ref> and Kulis.<ref name=survey2>{{cite journal| last = Kulis | first = B.| title=Metric Learning: A Survey | journal=Foundations and Trends in Machine Learning | year=2012 | url=http://web.cse.ohio-state.edu/~kulis/pubs/ftml_metric_learning.pdf}}</ref>

==Also see==
[[Latent semantic analysis]]

== References ==
{{reflist}}

[[Category:Machine learning]]

Collaborative filtering

2016-03-19T18:00:29Z

Deepalgo:

{{external links|date=November 2013}}
{{Use dmy dates|date=June 2013}}
{{Recommender systems}}
[[File:Collaborative filtering.gif|300px|thumb|

This image shows an example of predicting of the user's rating using [[Collaborative software|collaborative]] filtering. At first, people rate different items (like videos, images, games). After that, the system is making [[prediction]]s about user's rating for an item, which the user hasn't rated yet. These predictions are built upon the existing ratings of other users, who have similar ratings with the active user. For instance, in our case the system has made a prediction, that the active user won't like the video.]]

'''Collaborative filtering''' ('''CF''') is a technique used by some [[recommender system]]s.<ref name="handbook">Francesco Ricci and Lior Rokach and Bracha Shapira, [http://www.inf.unibz.it/~ricci/papers/intro-rec-sys-handbook.pdf Introduction to Recommender Systems Handbook], Recommender Systems Handbook, Springer, 2011, pp. 1-35</ref> [[Collaborative software|Collaborative]] filtering has two senses, a narrow one and a more general one.<ref name=recommender>{{cite web|title=Beyond Recommender Systems: Helping People Help Each Other|url=http://www.grouplens.org/papers/pdf/rec-sys-overview.pdf|publisher=Addison-Wesley|accessdate=16 January 2012|page=6|year=2001|last1=Terveen|first1=Loren|last2=Hill|first2=Will|authorlink1=Loren Terveen}}</ref> In general, collaborative filtering is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc.<ref name="recommender" /> Applications of collaborative filtering typically involve very large data sets. Collaborative filtering methods have been applied to many different kinds of data including: sensing and monitoring data, such as in mineral exploration, environmental sensing over large areas or multiple sensors; financial data, such as financial service institutions that integrate many financial sources; or in electronic commerce and web applications where the focus is on user data, etc. The remainder of this discussion focuses on collaborative filtering for user data, although some of the methods and approaches may apply to the other major applications as well.

In the newer, narrower sense, collaborative filtering is a method of making automatic [[prediction]]s (filtering) about the interests of a user by collecting preferences or [[taste (sociology)|taste]] information from [[crowdsourcing|many users]] (collaborating). The underlying assumption of the collaborative filtering approach is that if a person ''A'' has the same opinion as a person ''B'' on an issue, A is more likely to have B's opinion on a different issue ''x'' than to have the opinion on x of a person chosen randomly. For example, a collaborative filtering recommendation system for [[television]] tastes could make predictions about which television show a user should like given a partial list of that user's tastes (likes or dislikes).<ref>[http://www.redbeemedia.com/insights/integrated-approach-tv-vod-recommendations An integrated approach to TV & VOD Recommendations] {{wayback|url=http://www.redbeemedia.com/insights/integrated-approach-tv-vod-recommendations |date=20120606225352 |df=y }}</ref> Note that these predictions are specific to the user, but use information gleaned from many users. This differs from the simpler approach of giving an [[average]] (non-specific) score for each item of interest, for example based on its number of [[vote]]s.

==Introduction==
The [[internet growth|growth]] of the [[Internet]] has made it much more difficult to effectively [[information extraction|extract useful information]] from all the available [[online information]]. The overwhelming amount of data necessitates mechanisms for efficient [[information filtering]]. One of the techniques used for dealing with this problem is called collaborative filtering.

The motivation for collaborative filtering comes from the idea that people often get the best recommendations from someone with [[similarity|similar]] tastes to themselves. Collaborative filtering explores techniques for matching people with similar interests and making [[recommendation]]s on this basis.

Collaborative filtering algorithms often require (1) users’ active participation, (2) an easy way to represent users’ interests to the system, and (3) algorithms that are able to match people with similar interests.

Typically, the workflow of a collaborative filtering system is:
# A user expresses his or her preferences by rating items (e.g. books, movies or CDs) of the system. These ratings can be viewed as an approximate representation of the user's interest in the corresponding domain.
# The system matches this user’s ratings against other users’ and finds the people with most "similar" tastes.
# With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user (presumably the absence of rating is often considered as the unfamiliarity of an item)
A key problem of collaborative filtering is how to combine and weight the preferences of user neighbors. Sometimes, users can immediately rate the recommended items. As a result, the system gains an increasingly accurate representation of user preferences over time.

==Methodology==

[[File:Collaborative Filtering in Recommender Systems.jpg|thumb|Collaborative Filtering in Recommender Systems]]

Collaborative filtering systems have many forms, but many common systems can be reduced to two steps:
# Look for users who share the same rating patterns with the active user (the user whom the prediction is for).
# Use the ratings from those like-minded users found in step 1 to calculate a prediction for the active user
This falls under the category of user-based collaborative filtering. A specific application of this is the user-based [[K-nearest neighbor algorithm|Nearest Neighbor algorithm]].

Alternatively, [[item-item collaborative filtering|item-based collaborative filtering]] (users who bought x also bought y), proceeds in an item-centric manner:
# Build an item-item matrix determining relationships between pairs of items
# Infer the tastes of the current user by examining the matrix and matching that user's data
See, for example, the [[Slope One]] item-based collaborative filtering family.

Another form of collaborative filtering can be based on implicit observations of normal user behavior (as opposed to the artificial behavior imposed by a rating task). These systems observe what a user has done together with what all users have done (what music they have listened to, what items they have bought) and use that data to predict the user's behavior in the future, or to predict how a user might like to behave given the chance. These predictions then have to be filtered through [[business logic]] to determine how they might affect the actions of a business system. For example, it is not useful to offer to sell somebody a particular album of music if they already have demonstrated that they own that music.

Relying on a scoring or rating system which is averaged across all users ignores specific demands of a user, and is particularly poor in tasks where there is large variation in interest (as in the recommendation of music). However, there are other methods to combat information explosion, such as [[WWW|web]] search and [[data clustering]].

==Types==

===Memory-based===
This approach uses user rating data to compute the similarity between users or items. This is used for making recommendations. This was an early approach used in many commercial systems. It's effective and easy to implement. Typical examples of this approach are neighbourhood-based CF and item-based/user-based top-N recommendations. For example, in user based approaches, the value of ratings user 'u' gives to item 'i' is calculated as an [[aggregation]] of some similar users' rating of the item:
:<math>r_{u,i} = \operatorname{aggr}_{u^\prime \in U} r_{u^\prime, i}</math>

where 'U' denotes the set of top 'N' users that are most similar to user 'u' who rated item 'i'. Some examples of the aggregation function includes:
:<math>r_{u,i} = \frac{1}{N}\sum\limits_{u^\prime \in U}r_{u^\prime, i}</math>
:<math>r_{u,i} = k\sum\limits_{u^\prime \in U}\operatorname{simil}(u,u^\prime)r_{u^\prime, i}</math>
:<math>r_{u,i} = \bar{r_u} + k\sum\limits_{u^\prime \in U}\operatorname{simil}(u,u^\prime)(r_{u^\prime, i}-\bar{r_{u^\prime}} )</math>

where k is a normalizing factor defined as <math>k =1/\sum_{u^\prime \in U}|\operatorname{simil}(u,u^\prime)| </math>. and <math>\bar{r_u}</math> is the average rating of user u for all the items rated by u.

The neighborhood-based algorithm calculates the similarity between two users or items, produces a prediction for the user by taking the [[weighted average]] of all the ratings. Similarity computation between items or users is an important part of this approach. Multiple measures, such as [[Pearson product-moment correlation coefficient|Pearson correlation]] and [[Cosine similarity|vector cosine]] based similarity are used for this.

The Pearson correlation similarity of two users x, y is defined as
:<math> \operatorname{simil}(x,y) = \frac{\sum\limits_{i \in I_{xy}}(r_{x,i}-\bar{r_x})(r_{y,i}-\bar{r_y})}{\sqrt{\sum\limits_{i \in I_{xy}}(r_{x,i}-\bar{r_x})^2\sum\limits_{i \in I_{xy}}(r_{y,i}-\bar{r_y})^2}} </math>

where Ixy is the set of items rated by both user x and user y.

The cosine-based approach defines the cosine-similarity between two users x and y as:<ref name="Breese1999">John S. Breese, David Heckerman, and Carl Kadie, [http://uai.sis.pitt.edu/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=231&proceeding_id=14 Empirical Analysis of Predictive Algorithms for Collaborative Filtering], 1998 {{wayback|url=http://uai.sis.pitt.edu/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=231&proceeding_id=14 |date=20131019134152 |df=y }}</ref>
:<math>\operatorname{simil}(x,y) = \cos(\vec x,\vec y) = \frac{\vec x \cdot \vec y}{||\vec x|| \times ||\vec y||} = \frac{\sum\limits_{i \in I_{xy}}r_{x,i}r_{y,i}}{\sqrt{\sum\limits_{i \in I_{x}}r_{x,i}^2}\sqrt{\sum\limits_{i \in I_{y}}r_{y,i}^2}}</math>

The user based top-N recommendation algorithm uses a similarity-based vector model to identify the k most similar users to an active user. After the k most similar users are found, their corresponding user-item matrices are aggregated to identify the set of items to be recommended. A popular method to find the similar users is the [[Locality-sensitive hashing]], which implements the [[Nearest neighbor search|nearest neighbor mechanism]] in linear time.

The advantages with this approach include: the explainability of the results, which is an important aspect of recommendation systems; easy creation and use; easy facilitation of new data; content-independence of the items being recommended; good scaling with co-rated items.

There are also several disadvantages with this approach. Its performance decreases when [[sparsity|data gets sparse]], which occurs frequently with web-related items. This hinders the [[scalability]] of this approach and creates problems with large datasets. Although it can efficiently handle new users because it relies on a [[data structure]], adding new items becomes more complicated since that representation usually relies on a specific [[vector space]]. Adding new items requires inclusion of the new item and the re-insertion of all the elements in the structure.

===Model-based===
Models are developed using [[data mining]], [[machine learning]] algorithms to find patterns based on training data. These are used to make predictions for real data. There are many model-based CF algorithms. These include [[Bayesian networks]], neural embedding models <ref name="item2vec">Barkan, O; Koenigstein, N (14 March 2016). "Item2Vec: Neural Item Embedding for Collaborative Filtering". arXiv:1603.04259.</ref>, [[Cluster Analysis|clustering models]], [[Latent Semantic Indexing|latent semantic models]] such as [[singular value decomposition]], [[probabilistic latent semantic analysis]], multiple multiplicative factor, [[latent Dirichlet allocation]] and [[Markov decision process]] based models.<ref name="Suetal2009">Xiaoyuan Su, Taghi M. Khoshgoftaar, [http://www.hindawi.com/journals/aai/2009/421425/ A survey of collaborative filtering techniques], Advances in Artificial Intelligence archive, 2009.</ref>

This approach has a more holistic goal to uncover latent factors that explain observed ratings.<ref>[http://research.yahoo.com/pub/2435 Factor in the Neighbors: Scalable and Accurate Collaborative Filtering] {{wayback|url=http://research.yahoo.com/pub/2435 |date=20101023032716 |df=y }}</ref> Most of the models are based on creating a classification or clustering technique to identify the user based on the training set. The number of the parameters can be reduced based on types of [[Principal Component Analysis|principal component analysis]].

There are several advantages with this paradigm. It handles the sparsity better than memory based ones. This helps with scalability with large data sets. It improves the prediction performance. It gives an intuitive rationale for the recommendations.

The disadvantages with this approach are in the expensive model building. One needs to have a tradeoff between prediction performance and scalability. One can lose useful information due to reduction models. A number of models have difficulty explaining the predictions.

===Hybrid===
A number of applications combines the memory-based and the model-based CF algorithms. These overcome the limitations of native CF approaches. It improves the prediction performance. Importantly, it overcomes the CF problems such as sparsity and loss of information. However, they have increased complexity and are expensive to implement.<ref>{{cite journal | url = http://www.sciencedirect.com/science/article/pii/S0020025512002587 | doi=10.1016/j.ins.2012.04.012 | volume=208 | title=Kernel-Mapping Recommender system algorithms | journal=Information Sciences | pages=81–104}}
</ref> Usually most of the commercial recommender systems are hybrid, for example, Google news recommender system.<ref>{{cite web|url=http://dl.acm.org/citation.cfm?id=1242610|title=Google news personalization|publisher=}}</ref>

==Application on social web==
Unlike the traditional model of mainstream media, in which there are few editors who set guidelines, collaboratively filtered social media can have a very large number of editors, and content improves as the number of participants increases. Services like [[Reddit]], [[YouTube]], and [[Last.fm]] are typical example of collaborative filtering based media.<ref>[http://www.readwriteweb.com/archives/collaborative_filtering_social_web.php Collaborative Filtering: Lifeblood of The Social Web]</ref>

One scenario of collaborative filtering application is to recommend interesting or popular information as judged by the community. As a typical example, stories appear in the front page of [[Digg]] as they are "voted up" (rated positively) by the community. As the community becomes larger and more diverse, the promoted stories can better reflect the average interest of the community members.

Another aspect of collaborative filtering systems is the ability to generate more personalized recommendations by analyzing information from the past activity of a specific user, or the history of other users deemed to be of similar taste to a given user. These resources are used as user profiling and helps the site recommend content on a user-by-user basis. The more a given user makes use of the system, the better the recommendations become, as the system gains data to improve its model of that user.

===Problems===
A collaborative filtering system does not necessarily succeed in automatically matching content to one's preferences. Unless the platform achieves unusually good diversity and independence of opinions, one point of view will always dominate another in a particular community. As in the personalized recommendation scenario, the introduction of new users or new items can cause the [[cold start]] problem, as there will be insufficient data on these new entries for the collaborative filtering to work accurately. In order to make appropriate recommendations for a new user, the system must first learn the user's preferences by analysing past voting or rating activities. The collaborative filtering system requires a substantial number of users to rate a new item before that item can be recommended.

==Challenges of collaborative filtering==

===Data sparsity===
In practice, many commercial recommender systems are based on large datasets. As a result, the user-item matrix used for collaborative filtering could be extremely large and sparse, which brings about the challenges in the performances of the recommendation.

One typical problem caused by the data sparsity is the [[cold start]] problem. As collaborative filtering methods recommend items based on users’ past preferences, new users will need to rate sufficient number of items to enable the system to capture their preferences accurately and thus provides reliable recommendations.

Similarly, new items also have the same problem. When new items are added to system, they need to be rated by substantial number of users before they could be recommended to users who have similar tastes with the ones rated them. The new item problem does not limit the [[Content-based filtering|content-based recommendation]], because the recommendation of an item is based on its discrete set of descriptive qualities rather than its ratings.

===Scalability===
As the numbers of users and items grow, traditional CF algorithms will suffer serious scalability problems{{Citation needed|date=April 2013}}. For example, with tens of millions of customers <math>O(M)</math> and millions of items <math>O(N)</math>, a CF algorithm with the complexity of <math>n</math> is already too large. As well, many systems need to react immediately to online requirements and make recommendations for all users regardless of their purchases and ratings history, which demands a higher scalability of a CF system. Large web companies such as Twitter use clusters of machines to scale recommendations for their millions of users, with most computations happening in very large memory machines.<ref name="twitterwtf">Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, and Reza Bosagh Zadeh [http://dl.acm.org/citation.cfm?id=2488433 WTF: The who-to-follow system at Twitter], Proceedings of the 22nd international conference on World Wide Web</ref>

Recently, a method named [[arxiv:1603.04259|Item2Vec]] <ref name=item2vec /> was introduced for a scalable item-based Collaborative Filtering. Item2Vec produces embedding for items in a latent space and is capable of inferring item-to-item relations even when user information is not available.

===Synonyms===
[[Synonyms]] refers to the tendency of a number of the same or very similar items to have different names or entries. Most recommender systems are unable to discover this latent association and thus treat these products differently.

For example, the seemingly different items "children movie" and "children film" are actually referring to the same item. Indeed, the degree of variability in descriptive term usage is greater than commonly suspected.{{citation needed|date=September 2013}} The prevalence of synonyms decreases the recommendation performance of CF systems. Topic Modeling (like the [[Latent Dirichlet Allocation]] technique) could solve this by grouping different words belonging to the same topic.{{citation needed|date=September 2013}}

===Gray sheep===
Gray sheep refers to the users whose opinions do not consistently agree or disagree with any group of people and thus do not benefit from collaborative filtering. [[Black sheep]] are the opposite group whose idiosyncratic tastes make recommendations nearly impossible. Although this is a failure of the recommender system, non-electronic recommenders also have great problems in these cases, so black sheep is an acceptable failure.

===Shilling attacks===
In a recommendation system where everyone can give the ratings, people may give lots of positive ratings for their own items and negative ratings for their competitors. It is often necessary for the collaborative filtering systems to introduce precautions to discourage such kind of manipulations.

===Diversity and the Long Tail===
Collaborative filters are expected to increase diversity because they help us discover new products. Some algorithms, however, may unintentionally do the opposite. Because collaborative filters recommend products based on past sales or ratings, they cannot usually recommend products with limited historical data. This can create a rich-get-richer effect for popular products, akin to [[positive feedback]]. This bias toward popularity can prevent what are otherwise better consumer-product matches. A [[Wharton School of the University of Pennsylvania|Wharton]] study details this phenomenon along with several ideas that may promote diversity and the "[[long tail]]."<ref>{{cite journal| last1= Fleder | first1= Daniel | first2= Kartik |last2= Hosanagar | title=Blockbuster Culture's Next Rise or Fall: The Impact of Recommender Systems on Sales Diversity|journal=Management Science |date=May 2009|url=http://papers.ssrn.com/sol3/papers.cfm?abstract_id=955984 | doi = 10.1287/mnsc.1080.0974 }}</ref> Several collaborative filtering algorithms have been developed to promote diversity and the "[[long tail]]" by recommending novel, unexpected,<ref>{{cite journal| last1= Adamopoulos | first1= Panagiotis | first2= Alexander |last2= Tuzhilin | title=On Unexpectedness in Recommender Systems: Or How to Better Expect the Unexpected|journal=ACM Transactions on Intelligent Systems and Technology |date=January 2015|url=http://dl.acm.org/citation.cfm?id=2559952 | doi = 10.1145/2559952}}</ref> and serendipitous items.<ref>{{cite journal| last1= Adamopoulos | first1= Panagiotis | title=Beyond rating prediction accuracy: on new perspectives in recommender systems|journal=Proceedings of the 7th ACM conference on Recommender systems |date=October 2013|url=http://dl.acm.org/citation.cfm?id=2508073| doi = 10.1145/2507157.2508073}}</ref>

==Innovations==
{{Prose|date=May 2012}}
* New algorithms have been developed for CF as a result of the [[Netflix prize]].
* Cross-System Collaborative Filtering where user profiles across multiple [[recommender systems]] are combined in a privacy preserving manner.
* Robust Collaborative Filtering, where recommendation is stable towards efforts of manipulation. This research area is still active and not completely solved.<ref>{{cite web|url=http://dl.acm.org/citation.cfm?id=1297240 |title=Robust collaborative filtering |doi=10.1145/1297231.1297240 |publisher=Portal.acm.org |date=19 October 2007 |accessdate=2012-05-15}}</ref>

==See also==
* [[Attention Profiling Mark-up Language|Attention Profiling Mark-up Language (APML)]]
* [[Cold start]]
* [[Collaborative model]]
* [[Collaborative search engine]]
* [[Collective intelligence]]
* [[Customer engagement]]
* [[Delegative Democracy]], the same principle applied to voting rather than filtering
* [[Enterprise bookmarking]]
* [[Firefly (website)]], a defunct website which was based on collaborative filtering
* [[Long tail]]
* [[Preference elicitation]]
* [[Recommendation system]]
* [[Relevance (information retrieval)]]
* [[Reputation system]]
* [[Robust collaborative filtering]]
* [[Similarity search]]
* [[Slope One]]
* [[Social translucence]]

==References==
{{Reflist|30em}}

==External links==
*[http://arxiv.org/abs/1603.04259 Item2Vec: Neural Item Embedding for Collaborative Filtering], Barkan, O; Koenigstein, N (14 March 2016) arXiv:1603.04259.
*[http://www.grouplens.org/papers/pdf/rec-sys-overview.pdf ''Beyond Recommender Systems: Helping People Help Each Other''], page 12, 2001
*[http://www.prem-melville.com/publications/recommender-systems-eml2010.pdf Recommender Systems.] Prem Melville and Vikas Sindhwani. In Encyclopedia of Machine Learning, Claude Sammut and Geoffrey Webb (Eds), Springer, 2010.
*[http://arxiv.org/abs/1203.4487 Recommender Systems in industrial contexts - PHD thesis (2012) including a comprehensive overview of many collaborative recommender systems]
*[http://web.archive.org/web/20080602151647/http://ieeexplore.ieee.org:80/xpls/abs_all.jsp?arnumber=1423975 Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions]. Adomavicius, G. and Tuzhilin, A. IEEE Transactions on Knowledge and Data Engineering 06.2005
*[https://web.archive.org/web/20060527214435/http://ectrl.itc.it/home/laboratory/meeting/download/p5-l_herlocker.pdf Evaluating collaborative filtering recommender systems] ([http://www.doi.org/ DOI]: [http://dx.doi.org/10.1145/963770.963772 10.1145/963770.963772])
*[http://www.grouplens.org/publications.html GroupLens research papers].
*[http://www.cs.utexas.edu/users/ml/papers/cbcf-aaai-02.pdf Content-Boosted Collaborative Filtering for Improved Recommendations.] Prem Melville, Raymond J. Mooney, and Ramadass Nagarajan. Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI-2002), pp. 187–192, Edmonton, Canada, July 2002.
*[http://agents.media.mit.edu/projects.html A collection of past and present "information filtering" projects (including collaborative filtering) at MIT Media Lab]
*[http://www.ieor.berkeley.edu/~goldberg/pubs/eigentaste.pdf Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001.]
*[http://downloads.hindawi.com/journals/aai/2009/421425.pdf A Survey of Collaborative Filtering Techniques] Su, Xiaoyuan and Khoshgortaar, Taghi. M
*[http://dl.acm.org/citation.cfm?id=1242610 Google News Personalization: Scalable Online Collaborative Filtering] Abhinandan Das, Mayur Datar, Ashutosh Garg, and Shyam Rajaram. International World Wide Web Conference, Proceedings of the 16th international conference on World Wide Web
*[http://web.archive.org/web/20101023032716/http://research.yahoo.com:80/pub/2435 Factor in the Neighbors: Scalable and Accurate Collaborative Filtering] Yehuda Koren, Transactions on Knowledge Discovery from Data (TKDD) (2009)
*[http://webpages.uncc.edu/~asaric/ISMIS09.pdf Rating Prediction Using Collaborative Filtering]
*[http://www.cis.upenn.edu/~ungar/CF/ Recommender Systems]
*[http://www2.sims.berkeley.edu/resources/collab/ Berkeley Collaborative Filtering]

{{Authority control}}

{{DEFAULTSORT:Collaborative Filtering}}
[[Category:Collaboration]]
[[Category:Collaborative software]]
[[Category:Collective intelligence]]
[[Category:Information retrieval techniques]]
[[Category:Recommender systems]]
[[Category:Social information processing]]
[[Category:Behavioral and social facets of systemic risk]]

Collaborative filtering

2016-03-18T14:13:34Z

Deepalgo: Added a reference for a method for a scalable item-based CF

{{external links|date=November 2013}}
{{Use dmy dates|date=June 2013}}
{{Recommender systems}}
[[File:Collaborative filtering.gif|300px|thumb|

This image shows an example of predicting of the user's rating using [[Collaborative software|collaborative]] filtering. At first, people rate different items (like videos, images, games). After that, the system is making [[prediction]]s about user's rating for an item, which the user hasn't rated yet. These predictions are built upon the existing ratings of other users, who have similar ratings with the active user. For instance, in our case the system has made a prediction, that the active user won't like the video.]]

'''Collaborative filtering''' ('''CF''') is a technique used by some [[recommender system]]s.<ref name="handbook">Francesco Ricci and Lior Rokach and Bracha Shapira, [http://www.inf.unibz.it/~ricci/papers/intro-rec-sys-handbook.pdf Introduction to Recommender Systems Handbook], Recommender Systems Handbook, Springer, 2011, pp. 1-35</ref> [[Collaborative software|Collaborative]] filtering has two senses, a narrow one and a more general one.<ref name=recommender>{{cite web|title=Beyond Recommender Systems: Helping People Help Each Other|url=http://www.grouplens.org/papers/pdf/rec-sys-overview.pdf|publisher=Addison-Wesley|accessdate=16 January 2012|page=6|year=2001|last1=Terveen|first1=Loren|last2=Hill|first2=Will|authorlink1=Loren Terveen}}</ref> In general, collaborative filtering is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc.<ref name="recommender" /> Applications of collaborative filtering typically involve very large data sets. Collaborative filtering methods have been applied to many different kinds of data including: sensing and monitoring data, such as in mineral exploration, environmental sensing over large areas or multiple sensors; financial data, such as financial service institutions that integrate many financial sources; or in electronic commerce and web applications where the focus is on user data, etc. The remainder of this discussion focuses on collaborative filtering for user data, although some of the methods and approaches may apply to the other major applications as well.

In the newer, narrower sense, collaborative filtering is a method of making automatic [[prediction]]s (filtering) about the interests of a user by collecting preferences or [[taste (sociology)|taste]] information from [[crowdsourcing|many users]] (collaborating). The underlying assumption of the collaborative filtering approach is that if a person ''A'' has the same opinion as a person ''B'' on an issue, A is more likely to have B's opinion on a different issue ''x'' than to have the opinion on x of a person chosen randomly. For example, a collaborative filtering recommendation system for [[television]] tastes could make predictions about which television show a user should like given a partial list of that user's tastes (likes or dislikes).<ref>[http://www.redbeemedia.com/insights/integrated-approach-tv-vod-recommendations An integrated approach to TV & VOD Recommendations] {{wayback|url=http://www.redbeemedia.com/insights/integrated-approach-tv-vod-recommendations |date=20120606225352 |df=y }}</ref> Note that these predictions are specific to the user, but use information gleaned from many users. This differs from the simpler approach of giving an [[average]] (non-specific) score for each item of interest, for example based on its number of [[vote]]s.

==Introduction==
The [[internet growth|growth]] of the [[Internet]] has made it much more difficult to effectively [[information extraction|extract useful information]] from all the available [[online information]]. The overwhelming amount of data necessitates mechanisms for efficient [[information filtering]]. One of the techniques used for dealing with this problem is called collaborative filtering.

The motivation for collaborative filtering comes from the idea that people often get the best recommendations from someone with [[similarity|similar]] tastes to themselves. Collaborative filtering explores techniques for matching people with similar interests and making [[recommendation]]s on this basis.

Collaborative filtering algorithms often require (1) users’ active participation, (2) an easy way to represent users’ interests to the system, and (3) algorithms that are able to match people with similar interests.

Typically, the workflow of a collaborative filtering system is:
# A user expresses his or her preferences by rating items (e.g. books, movies or CDs) of the system. These ratings can be viewed as an approximate representation of the user's interest in the corresponding domain.
# The system matches this user’s ratings against other users’ and finds the people with most "similar" tastes.
# With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user (presumably the absence of rating is often considered as the unfamiliarity of an item)
A key problem of collaborative filtering is how to combine and weight the preferences of user neighbors. Sometimes, users can immediately rate the recommended items. As a result, the system gains an increasingly accurate representation of user preferences over time.

==Methodology==

[[File:Collaborative Filtering in Recommender Systems.jpg|thumb|Collaborative Filtering in Recommender Systems]]

Collaborative filtering systems have many forms, but many common systems can be reduced to two steps:
# Look for users who share the same rating patterns with the active user (the user whom the prediction is for).
# Use the ratings from those like-minded users found in step 1 to calculate a prediction for the active user
This falls under the category of user-based collaborative filtering. A specific application of this is the user-based [[K-nearest neighbor algorithm|Nearest Neighbor algorithm]].

Alternatively, [[item-item collaborative filtering|item-based collaborative filtering]] (users who bought x also bought y), proceeds in an item-centric manner:
# Build an item-item matrix determining relationships between pairs of items
# Infer the tastes of the current user by examining the matrix and matching that user's data
See, for example, the [[Slope One]] item-based collaborative filtering family.

Another form of collaborative filtering can be based on implicit observations of normal user behavior (as opposed to the artificial behavior imposed by a rating task). These systems observe what a user has done together with what all users have done (what music they have listened to, what items they have bought) and use that data to predict the user's behavior in the future, or to predict how a user might like to behave given the chance. These predictions then have to be filtered through [[business logic]] to determine how they might affect the actions of a business system. For example, it is not useful to offer to sell somebody a particular album of music if they already have demonstrated that they own that music.

Relying on a scoring or rating system which is averaged across all users ignores specific demands of a user, and is particularly poor in tasks where there is large variation in interest (as in the recommendation of music). However, there are other methods to combat information explosion, such as [[WWW|web]] search and [[data clustering]].

==Types==

===Memory-based===
This approach uses user rating data to compute the similarity between users or items. This is used for making recommendations. This was an early approach used in many commercial systems. It's effective and easy to implement. Typical examples of this approach are neighbourhood-based CF and item-based/user-based top-N recommendations. For example, in user based approaches, the value of ratings user 'u' gives to item 'i' is calculated as an [[aggregation]] of some similar users' rating of the item:
:<math>r_{u,i} = \operatorname{aggr}_{u^\prime \in U} r_{u^\prime, i}</math>

where 'U' denotes the set of top 'N' users that are most similar to user 'u' who rated item 'i'. Some examples of the aggregation function includes:
:<math>r_{u,i} = \frac{1}{N}\sum\limits_{u^\prime \in U}r_{u^\prime, i}</math>
:<math>r_{u,i} = k\sum\limits_{u^\prime \in U}\operatorname{simil}(u,u^\prime)r_{u^\prime, i}</math>
:<math>r_{u,i} = \bar{r_u} + k\sum\limits_{u^\prime \in U}\operatorname{simil}(u,u^\prime)(r_{u^\prime, i}-\bar{r_{u^\prime}} )</math>

where k is a normalizing factor defined as <math>k =1/\sum_{u^\prime \in U}|\operatorname{simil}(u,u^\prime)| </math>. and <math>\bar{r_u}</math> is the average rating of user u for all the items rated by u.

The neighborhood-based algorithm calculates the similarity between two users or items, produces a prediction for the user by taking the [[weighted average]] of all the ratings. Similarity computation between items or users is an important part of this approach. Multiple measures, such as [[Pearson product-moment correlation coefficient|Pearson correlation]] and [[Cosine similarity|vector cosine]] based similarity are used for this.

The Pearson correlation similarity of two users x, y is defined as
:<math> \operatorname{simil}(x,y) = \frac{\sum\limits_{i \in I_{xy}}(r_{x,i}-\bar{r_x})(r_{y,i}-\bar{r_y})}{\sqrt{\sum\limits_{i \in I_{xy}}(r_{x,i}-\bar{r_x})^2\sum\limits_{i \in I_{xy}}(r_{y,i}-\bar{r_y})^2}} </math>

where Ixy is the set of items rated by both user x and user y.

The cosine-based approach defines the cosine-similarity between two users x and y as:<ref name="Breese1999">John S. Breese, David Heckerman, and Carl Kadie, [http://uai.sis.pitt.edu/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=231&proceeding_id=14 Empirical Analysis of Predictive Algorithms for Collaborative Filtering], 1998 {{wayback|url=http://uai.sis.pitt.edu/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=231&proceeding_id=14 |date=20131019134152 |df=y }}</ref>
:<math>\operatorname{simil}(x,y) = \cos(\vec x,\vec y) = \frac{\vec x \cdot \vec y}{||\vec x|| \times ||\vec y||} = \frac{\sum\limits_{i \in I_{xy}}r_{x,i}r_{y,i}}{\sqrt{\sum\limits_{i \in I_{x}}r_{x,i}^2}\sqrt{\sum\limits_{i \in I_{y}}r_{y,i}^2}}</math>

The user based top-N recommendation algorithm uses a similarity-based vector model to identify the k most similar users to an active user. After the k most similar users are found, their corresponding user-item matrices are aggregated to identify the set of items to be recommended. A popular method to find the similar users is the [[Locality-sensitive hashing]], which implements the [[Nearest neighbor search|nearest neighbor mechanism]] in linear time.

The advantages with this approach include: the explainability of the results, which is an important aspect of recommendation systems; easy creation and use; easy facilitation of new data; content-independence of the items being recommended; good scaling with co-rated items.

There are also several disadvantages with this approach. Its performance decreases when [[sparsity|data gets sparse]], which occurs frequently with web-related items. This hinders the [[scalability]] of this approach and creates problems with large datasets. Although it can efficiently handle new users because it relies on a [[data structure]], adding new items becomes more complicated since that representation usually relies on a specific [[vector space]]. Adding new items requires inclusion of the new item and the re-insertion of all the elements in the structure.

===Model-based===
Models are developed using [[data mining]], [[machine learning]] algorithms to find patterns based on training data. These are used to make predictions for real data. There are many model-based CF algorithms. These include [[Bayesian networks]], [[Cluster Analysis|clustering models]], [[Latent Semantic Indexing|latent semantic models]] such as [[singular value decomposition]], [[probabilistic latent semantic analysis]], multiple multiplicative factor, [[latent Dirichlet allocation]] and [[Markov decision process]] based models.<ref name="Suetal2009">Xiaoyuan Su, Taghi M. Khoshgoftaar, [http://www.hindawi.com/journals/aai/2009/421425/ A survey of collaborative filtering techniques], Advances in Artificial Intelligence archive, 2009.</ref>

This approach has a more holistic goal to uncover latent factors that explain observed ratings.<ref>[http://research.yahoo.com/pub/2435 Factor in the Neighbors: Scalable and Accurate Collaborative Filtering] {{wayback|url=http://research.yahoo.com/pub/2435 |date=20101023032716 |df=y }}</ref> Most of the models are based on creating a classification or clustering technique to identify the user based on the training set. The number of the parameters can be reduced based on types of [[Principal Component Analysis|principal component analysis]].

There are several advantages with this paradigm. It handles the sparsity better than memory based ones. This helps with scalability with large data sets. It improves the prediction performance. It gives an intuitive rationale for the recommendations.

The disadvantages with this approach are in the expensive model building. One needs to have a tradeoff between prediction performance and scalability. One can lose useful information due to reduction models. A number of models have difficulty explaining the predictions.

===Hybrid===
A number of applications combines the memory-based and the model-based CF algorithms. These overcome the limitations of native CF approaches. It improves the prediction performance. Importantly, it overcomes the CF problems such as sparsity and loss of information. However, they have increased complexity and are expensive to implement.<ref>{{cite journal | url = http://www.sciencedirect.com/science/article/pii/S0020025512002587 | doi=10.1016/j.ins.2012.04.012 | volume=208 | title=Kernel-Mapping Recommender system algorithms | journal=Information Sciences | pages=81–104}}
</ref> Usually most of the commercial recommender systems are hybrid, for example, Google news recommender system.<ref>{{cite web|url=http://dl.acm.org/citation.cfm?id=1242610|title=Google news personalization|publisher=}}</ref>

==Application on social web==
Unlike the traditional model of mainstream media, in which there are few editors who set guidelines, collaboratively filtered social media can have a very large number of editors, and content improves as the number of participants increases. Services like [[Reddit]], [[YouTube]], and [[Last.fm]] are typical example of collaborative filtering based media.<ref>[http://www.readwriteweb.com/archives/collaborative_filtering_social_web.php Collaborative Filtering: Lifeblood of The Social Web]</ref>

One scenario of collaborative filtering application is to recommend interesting or popular information as judged by the community. As a typical example, stories appear in the front page of [[Digg]] as they are "voted up" (rated positively) by the community. As the community becomes larger and more diverse, the promoted stories can better reflect the average interest of the community members.

Another aspect of collaborative filtering systems is the ability to generate more personalized recommendations by analyzing information from the past activity of a specific user, or the history of other users deemed to be of similar taste to a given user. These resources are used as user profiling and helps the site recommend content on a user-by-user basis. The more a given user makes use of the system, the better the recommendations become, as the system gains data to improve its model of that user.

===Problems===
A collaborative filtering system does not necessarily succeed in automatically matching content to one's preferences. Unless the platform achieves unusually good diversity and independence of opinions, one point of view will always dominate another in a particular community. As in the personalized recommendation scenario, the introduction of new users or new items can cause the [[cold start]] problem, as there will be insufficient data on these new entries for the collaborative filtering to work accurately. In order to make appropriate recommendations for a new user, the system must first learn the user's preferences by analysing past voting or rating activities. The collaborative filtering system requires a substantial number of users to rate a new item before that item can be recommended.

==Challenges of collaborative filtering==

===Data sparsity===
In practice, many commercial recommender systems are based on large datasets. As a result, the user-item matrix used for collaborative filtering could be extremely large and sparse, which brings about the challenges in the performances of the recommendation.

One typical problem caused by the data sparsity is the [[cold start]] problem. As collaborative filtering methods recommend items based on users’ past preferences, new users will need to rate sufficient number of items to enable the system to capture their preferences accurately and thus provides reliable recommendations.

Similarly, new items also have the same problem. When new items are added to system, they need to be rated by substantial number of users before they could be recommended to users who have similar tastes with the ones rated them. The new item problem does not limit the [[Content-based filtering|content-based recommendation]], because the recommendation of an item is based on its discrete set of descriptive qualities rather than its ratings.

===Scalability===
As the numbers of users and items grow, traditional CF algorithms will suffer serious scalability problems{{Citation needed|date=April 2013}}. For example, with tens of millions of customers <math>O(M)</math> and millions of items <math>O(N)</math>, a CF algorithm with the complexity of <math>n</math> is already too large. As well, many systems need to react immediately to online requirements and make recommendations for all users regardless of their purchases and ratings history, which demands a higher scalability of a CF system. Large web companies such as Twitter use clusters of machines to scale recommendations for their millions of users, with most computations happening in very large memory machines.<ref name="twitterwtf">Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, and Reza Bosagh Zadeh [http://dl.acm.org/citation.cfm?id=2488433 WTF: The who-to-follow system at Twitter], Proceedings of the 22nd international conference on World Wide Web</ref>

Recently, a method named [[arxiv:1603.04259|Item2Vec]] was introduced for a scalable item-based Collaborative Filtering. Item2Vec produces embedding for items in a latent space and is capable of inferring item-to-item relations even when user information is not available.

===Synonyms===
[[Synonyms]] refers to the tendency of a number of the same or very similar items to have different names or entries. Most recommender systems are unable to discover this latent association and thus treat these products differently.

For example, the seemingly different items "children movie" and "children film" are actually referring to the same item. Indeed, the degree of variability in descriptive term usage is greater than commonly suspected.{{citation needed|date=September 2013}} The prevalence of synonyms decreases the recommendation performance of CF systems. Topic Modeling (like the [[Latent Dirichlet Allocation]] technique) could solve this by grouping different words belonging to the same topic.{{citation needed|date=September 2013}}

===Gray sheep===
Gray sheep refers to the users whose opinions do not consistently agree or disagree with any group of people and thus do not benefit from collaborative filtering. [[Black sheep]] are the opposite group whose idiosyncratic tastes make recommendations nearly impossible. Although this is a failure of the recommender system, non-electronic recommenders also have great problems in these cases, so black sheep is an acceptable failure.

===Shilling attacks===
In a recommendation system where everyone can give the ratings, people may give lots of positive ratings for their own items and negative ratings for their competitors. It is often necessary for the collaborative filtering systems to introduce precautions to discourage such kind of manipulations.

===Diversity and the Long Tail===
Collaborative filters are expected to increase diversity because they help us discover new products. Some algorithms, however, may unintentionally do the opposite. Because collaborative filters recommend products based on past sales or ratings, they cannot usually recommend products with limited historical data. This can create a rich-get-richer effect for popular products, akin to [[positive feedback]]. This bias toward popularity can prevent what are otherwise better consumer-product matches. A [[Wharton School of the University of Pennsylvania|Wharton]] study details this phenomenon along with several ideas that may promote diversity and the "[[long tail]]."<ref>{{cite journal| last1= Fleder | first1= Daniel | first2= Kartik |last2= Hosanagar | title=Blockbuster Culture's Next Rise or Fall: The Impact of Recommender Systems on Sales Diversity|journal=Management Science |date=May 2009|url=http://papers.ssrn.com/sol3/papers.cfm?abstract_id=955984 | doi = 10.1287/mnsc.1080.0974 }}</ref> Several collaborative filtering algorithms have been developed to promote diversity and the "[[long tail]]" by recommending novel, unexpected,<ref>{{cite journal| last1= Adamopoulos | first1= Panagiotis | first2= Alexander |last2= Tuzhilin | title=On Unexpectedness in Recommender Systems: Or How to Better Expect the Unexpected|journal=ACM Transactions on Intelligent Systems and Technology |date=January 2015|url=http://dl.acm.org/citation.cfm?id=2559952 | doi = 10.1145/2559952}}</ref> and serendipitous items.<ref>{{cite journal| last1= Adamopoulos | first1= Panagiotis | title=Beyond rating prediction accuracy: on new perspectives in recommender systems|journal=Proceedings of the 7th ACM conference on Recommender systems |date=October 2013|url=http://dl.acm.org/citation.cfm?id=2508073| doi = 10.1145/2507157.2508073}}</ref>

==Innovations==
{{Prose|date=May 2012}}
* New algorithms have been developed for CF as a result of the [[Netflix prize]].
* Cross-System Collaborative Filtering where user profiles across multiple [[recommender systems]] are combined in a privacy preserving manner.
* Robust Collaborative Filtering, where recommendation is stable towards efforts of manipulation. This research area is still active and not completely solved.<ref>{{cite web|url=http://dl.acm.org/citation.cfm?id=1297240 |title=Robust collaborative filtering |doi=10.1145/1297231.1297240 |publisher=Portal.acm.org |date=19 October 2007 |accessdate=2012-05-15}}</ref>

==See also==
* [[Attention Profiling Mark-up Language|Attention Profiling Mark-up Language (APML)]]
* [[Cold start]]
* [[Collaborative model]]
* [[Collaborative search engine]]
* [[Collective intelligence]]
* [[Customer engagement]]
* [[Delegative Democracy]], the same principle applied to voting rather than filtering
* [[Enterprise bookmarking]]
* [[Firefly (website)]], a defunct website which was based on collaborative filtering
* [[Long tail]]
* [[Preference elicitation]]
* [[Recommendation system]]
* [[Relevance (information retrieval)]]
* [[Reputation system]]
* [[Robust collaborative filtering]]
* [[Similarity search]]
* [[Slope One]]
* [[Social translucence]]

==References==
{{Reflist|30em}}

==External links==
*[http://www.grouplens.org/papers/pdf/rec-sys-overview.pdf ''Beyond Recommender Systems: Helping People Help Each Other''], page 12, 2001
*[http://www.prem-melville.com/publications/recommender-systems-eml2010.pdf Recommender Systems.] Prem Melville and Vikas Sindhwani. In Encyclopedia of Machine Learning, Claude Sammut and Geoffrey Webb (Eds), Springer, 2010.
*[http://arxiv.org/abs/1203.4487 Recommender Systems in industrial contexts - PHD thesis (2012) including a comprehensive overview of many collaborative recommender systems]
*[http://web.archive.org/web/20080602151647/http://ieeexplore.ieee.org:80/xpls/abs_all.jsp?arnumber=1423975 Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions]. Adomavicius, G. and Tuzhilin, A. IEEE Transactions on Knowledge and Data Engineering 06.2005
*[https://web.archive.org/web/20060527214435/http://ectrl.itc.it/home/laboratory/meeting/download/p5-l_herlocker.pdf Evaluating collaborative filtering recommender systems] ([http://www.doi.org/ DOI]: [http://dx.doi.org/10.1145/963770.963772 10.1145/963770.963772])
*[http://www.grouplens.org/publications.html GroupLens research papers].
*[http://www.cs.utexas.edu/users/ml/papers/cbcf-aaai-02.pdf Content-Boosted Collaborative Filtering for Improved Recommendations.] Prem Melville, Raymond J. Mooney, and Ramadass Nagarajan. Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI-2002), pp. 187–192, Edmonton, Canada, July 2002.
*[http://agents.media.mit.edu/projects.html A collection of past and present "information filtering" projects (including collaborative filtering) at MIT Media Lab]
*[http://www.ieor.berkeley.edu/~goldberg/pubs/eigentaste.pdf Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001.]
*[http://downloads.hindawi.com/journals/aai/2009/421425.pdf A Survey of Collaborative Filtering Techniques] Su, Xiaoyuan and Khoshgortaar, Taghi. M
*[http://dl.acm.org/citation.cfm?id=1242610 Google News Personalization: Scalable Online Collaborative Filtering] Abhinandan Das, Mayur Datar, Ashutosh Garg, and Shyam Rajaram. International World Wide Web Conference, Proceedings of the 16th international conference on World Wide Web
*[http://web.archive.org/web/20101023032716/http://research.yahoo.com:80/pub/2435 Factor in the Neighbors: Scalable and Accurate Collaborative Filtering] Yehuda Koren, Transactions on Knowledge Discovery from Data (TKDD) (2009)
*[http://webpages.uncc.edu/~asaric/ISMIS09.pdf Rating Prediction Using Collaborative Filtering]
*[http://www.cis.upenn.edu/~ungar/CF/ Recommender Systems]
*[http://www2.sims.berkeley.edu/resources/collab/ Berkeley Collaborative Filtering]

{{Authority control}}

{{DEFAULTSORT:Collaborative Filtering}}
[[Category:Collaboration]]
[[Category:Collaborative software]]
[[Category:Collective intelligence]]
[[Category:Information retrieval techniques]]
[[Category:Recommender systems]]
[[Category:Social information processing]]
[[Category:Behavioral and social facets of systemic risk]]

Item-item collaborative filtering

2016-03-18T13:58:08Z

Deepalgo: Added reference to a new method for item-item collaborative filtering

{{recommender systems}}
'''Item-item collaborative filtering''', or '''item-based''', or '''item-to-item''', is a form of [[collaborative filtering]] based on the similarity between items calculated using people's ratings of those items. Item-item collaborative filtering was first published in 2001, and in 2003 the e-commerce website [[Amazon.com|Amazon]] stated this algorithm powered its recommender system.

Earlier collaborative filtering systems based on [[Star (classification)|rating]] similarity between users (known as [[user-user collaborative filtering]]) had several problems:
* systems performed poorly when they had many items but comparatively few ratings
* computing similarities between all pairs of users was expensive
* user profiles changed quickly and the entire system model had to be recomputed

Item-item models resolve these problems in systems that have more users than items. Item-item models use rating distributions ''per item'', not ''per user''. With more users than items, each item tends to have more ratings than each user, so an item's average rating usually doesn't change quickly. This leads to more stable rating distributions in the model, so the model doesn't have to be rebuilt as often. When users consume and then rate an item, that item's similar items are picked from the existing system model and added to the user's recommendations.

Recently, a method named [[arxiv:1603.04259|Item2Vec]] was proposed for a scalable item-item collaborative filtering. Item2Vec produces low dimensional representation for items, where the affinity between items can be measured by cosine similarity. The method is based on the Word2Vec method that was successfully applied to natural language processing applications.

==Method==
First, the system executes a model-building stage by finding the similarity between all pairs of items. This [[Similarity measure|similarity function]] can take many forms, such as correlation between ratings or cosine of those rating vectors. As in user-user systems, similarity functions can use [[Normalization (statistics)|normalized]] ratings (correcting, for instance, for each user's average rating).

Second, the system executes a [[recommender system|recommendation]] stage. It uses the most similar items to a user's already-rated items to generate a list of recommendations. Usually this calculation is a [[Weight function|weighted sum]] or [[linear regression]]. This form of recommendation is analogous to "people who rate item X highly, like you, also tend to rate item Y highly, and you haven't rated item Y yet, so you should try it".

==Results==
Item-item collaborative filtering had less error than user-user collaborative filtering. In addition, its less-dynamic model was computed less often and stored in a smaller matrix, so item-item system performance was better than user-user systems.

==See also==
* [[Slope One]], a family of item-item collaborative filtering algorithms designed to reduce model [[overfitting]] problems

==Bibliography==
* {{cite journal|url=http://dl.acm.org/citation.cfm?id=372071|title=Item-based collaborative filtering recommendation algorithms|journal=Proceedings of the 10th international conference on the World Wide Web|pages=285-295 |date=2001 |isbn=1-58113-348-0 |doi=10.1145/371920.372071|first1=Badrul |last1= Sarwar |first2= George |last2= Karypis |first3= Joseph |last3=Konstan|first4= John |last4=Riedl |authorlink4=John Riedl|publisher=[[Association for Computing Machinery|ACM]]}}
* {{cite journal|url=http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1167344|title=Amazon.com recommendations: item-to-item collaborative filtering|journal=IEEE Internet Computing|pages=76-80 |date=22 January 2003 |issn=1089-7801 |publisher=[[IEEE]] |volume=7 |issue=1 |doi=10.1109/MIC.2003.1167344|first1=G |last1= Linden |first2= B |last2= Smith |first3= J |last3=York}}
* Barkan, O; Koenigstein, N (14 March 2016). [[arxiv:1603.04259|"Item2Vec: Neural Item Embedding for Collaborative Filtering"]]. arXiv:1603.04259.
{{reflist}}

[[Category:Recommender systems]]