PLS Path Modelling
What is PLS Path Modeling?
Partial Least Squares Path Modeling (PLS-PM) is a statistical approach for modeling complex multivariable relationships (structural equation models) among observed and latent variables. Since a few years, this approach has been enjoying increasing popularity in several sciences (Esposito Vinzi et al., 2007). Structural Equation Models include a number of statistical methodologies allowing the estimation of a causal theoretical network of relationships linking latent complex concepts, each measured by means of a number of observable indicators.
The first presentation of the finalized PLS approach to path models with latent variables has been published by Wold in 1979 and then the main references on the PLS algorithm are Wold (1982 and 1985).
Herman Wold opposed LISREL (Jöreskog, 1970) "hard modeling" (heavy distribution assumptions, several hundreds of cases necessary) to PLS "soft modeling" (very few distribution assumptions, few cases can suffice). These two approaches to Structural Equation Modeling have been compared in Jöreskog and Wold (1982).
From the standpoint of structural equation modeling, PLS-PM is a component-based approach where the concept of causality is formulated in terms of linear conditional expectation. PLS-PM seeks for optimal linear predictive relationships rather than for causal mechanisms thus privileging a prediction-relevance oriented discovery process to the statistical testing of causal hypotheses. Two very important review papers on PLS approach to Structural Equation Modeling are Chin (1998, more application oriented) and Tenenhaus et al. (2005, more theory oriented).
Furthermore, PLS Path Modeling can be used for analyzing multiple tables and it is directly related to more classical data analysis methods used in this field. In fact, PLS-PM may be also viewed as a very flexible approach to multi-block (or multiple table) analysis by means of both the hierarchical PLS path model and the confirmatory PLS path model (Tenenhaus and Hanafi, 2007). This approach clearly shows how the "data-driven" tradition of multiple table analysis can be somehow merged in the "theory-driven" tradition of structural equation modeling so as to allow running the analysis of multi-block data in light of current knowledge on conceptual relationships between tables.
The PLS Path Modeling algorithm
A PLS Path model is described by two models: (1) a measurement model relating the manifest variables to their own latent variable and (2) a structural model relating some endogenous latent variables to other latent variables. The measurement model is also called the outer model and the structural model the inner model.
1. Manifest variables standardization
There exist four options for the standardization of the manifest variables depending upon three conditions that eventually hold in the data:
- Condition 1: The scales of the manifest variables are comparable. For instance, in the ECSI example the item values (between 0 and 100) are comparable. On the other hand, for instance, weight in tons and speed in km/h would not be comparable.
- Condition 2: The means of the manifest variables are interpretable. For instance, if the difference between two manifest variables is not interpretable, the location parameters are meaningless.
- Condition 3: The variances of the manifest variables reflect their importance.
If condition 1 does not hold, then the manifest variables have to be standardized (mean 0 and variance 1).
If condition 1 holds, it is useful to get the results based on the raw data. But the calculation of the model parameters depends upon the validity of the other conditions:
Condition 2 and 3 do not hold: The manifest variables are standardized (mean 0 variance 1) for the parameter estimation phase. Then the manifest variables are rescaled to their original means and variances for the final expression of the weights and loadings.
Condition 2 holds, but not condition 3: The manifest variables are not centered, but are standardized to unitary variance for the parameter estimation phase. Then the manifest variables are rescaled to their original variances for the final expression of the weights and loadings (to be defined later).
Conditions 2 and 3 hold: Use the original variables.
Lohmöller (1989) introduced a standardization parameter to select one of these four options:
|Variable scales are comparable||Means are interpretable||Variance is related to variable importance||Mean||Variance||Rescaling||METRIC|
With METRIC=1 being "Standardized, weights on standardized MV", METRIC=2 being "Standardized, weights on raw MV", METRIC=3 being "Reduced, weights on raw MV" and METRIC=4 being "Raw MV".
2. The measurement model
A latent variable (LV) ξ is an unobservable variable (or construct) indirectly described by a block of observable variables xh which are called manifest variables (MV) or indicators. There are three ways to relate the manifest variables to their latent variables, respectively called the reflective way, the formative one, and the MIMIC (Multiple effect Indicators for Multiple Causes) way.
2.1. The reflective way
In this model each manifest variable reflects its latent variable. Each manifest variable is related to its latent variable by a simple regression:
xh = πh0+ πhξ + εh,
where ξ has mean m and standard deviation 1. It is a reflective scheme: each manifest variable xh reflects its latent variable ξ. The only hypothesis made on model (1) is called by H. Wold the predictor specification condition:
E(xh | ξ) = πh0+ πhξ.
This hypothesis implied that the residual εh has a zero mean and is uncorrelated with the latent variable ξ.
2.1.2. Check for unidimensionality
In the reflective way the block of manifest variables is unidimensional in the meaning of factor analysis. On practical data this condition has to be checked. Three main tools are available to check the unidimensionality of a block: use of principal component analysis of each block of manifest variables, Cronbach's a and Dillon-Goldstein's r.
- Principal component analysis of a block
A block is essentially unidimensional if the first eigenvalue of the correlation matrix of the block MVs is larger than 1 and the second one smaller than 1, or at least very far from the first one. The first principal component can be built in such a way that it is positively correlated with all (or at least a majority of) the MVs. There is a problem with MV negatively correlated with the first principal component.
- Cronbach's α
Cronbach's α can be used to check unidimensionality of a block of p variables xh when they are all positively correlated. Cronbach has proposed the following procedure for standardized variables:
α = p / (p-1) [Ʃh≠h’cor(xh, xh’) / (p + Ʃh≠h’cor(xh, xh’))]
The Cronbach’s alpha is also defined for original (raw) variables as:
α = p / (p-1) [Ʃh≠h’cor(xh, xh’) / var(Ʃhxh)]
A block is considered as unidimensional when the Cronbach's alpha is larger than 0.7.
- Dillon-Goldstein's r
The sign of the correlation between each MV xh and its LV ξ is known by construction of the item and is supposed here to be positive. In equation (1) this hypothesis means that all the loadings πh are positive. A block is unidimensional if all these loadings are large.
The Goldstein-Dillon's r is defined by:
r = (Ʃh=1..pπh)²Var(ξ) / [(Ʃh=1..pπh)² Var(ξ) + Ʃh=1..pεh]
Let's now suppose that all the MVs xh and the latent variable ξ are standardized. An approximation of the latent variable ξ is obtained by standardization of the first principal component t1 of the block MVs. Then ph is estimated by cor(xh, t1) and, using equation (1), Var(εh) is estimated by 1 – cor2(xh, t1). So we get an estimate of the Dillon-Goldstein's r:
ȓ = (Ʃh=1..pcor(xh,t1))² / [(Ʃh=1..pcor(xh,t1))² / + Ʃh=1..pVar(εh)]
PLS Path Modeling is a mixture of a priori knowledge and data analysis. In the reflective way, the a priori knowledge concerns the unidimensionality of the block and the signs of the loadings. The data have to fit this model. If they do not, they can be modified by removing some manifest variables that are far from the model. Another solution is to change the model and use the formative way that will now be described.
2.2. The formative way
In the formative way, it is supposed that the latent variable ξ is generated by its own manifest variables. The latent variable ξ is a linear function of its manifest variables plus a residual term:
ξ = Ʃhwhxh + δ
In the formative model the block of manifest variables can be multidimensional. The predictor specification condition is supposed to hold as:
This hypothesis implies that the residual vector δ has a zero mean and is uncorrelated with the MVs xh.
2.3. The MIMIC way
The MIMIC way is a mixture of the reflective and formative ways.
The measurement model for a block is the following:
xh = πh0+ πhξ + εh, for h = 1 to p1
where the latent variable is defined by:
ξ = Ʃh=p1+1 whxh + δh
The p1 first manifest variables follow a reflective way and the (p – p1) last ones a formative way. The predictor specification hypotheses still hold and lead to the same consequences as before on the residuals.
3. The structural model
The causality model leads to linear equations relating the latent variables between them (the structural or inner model):
ξj = βj0 Ʃi β ji ξi + vj
The predictor specification hypothesis is still applied.
A latent variable, which never appears as a dependent variable, is called an exogenous variable. Otherwise it is called an endogenous variable.
4. The Estimation Algorithm
4.1. Latent variables Estimation
The latent variables ξj are estimated according to the following procedure.
4.1.1. Outer estimate yj of the standardized latent variable (ξj – mj)
The standardized latent variables (mean = 0 and standard deviation = 1) are estimated as linear combinations of their centered manifest variables:
yj ∞ ± [Ʃ wjh (xjh - ẋjh)]
where the symbol "∞" means that the left variable represents the standardized right variable and the "±" sign shows the sign ambiguity. This ambiguity is solved by choosing the sign making yj positively correlated to a majority of xjh.
The standardized latent variable is finally written as:
yj = Ʃ ŵjh (xjh - ẋjh)
The coefficients wjh and ŵjh are both called the outer weights.
The mean mj is estimated by:
ṁj = Ʃ ŵjh ẋjh
and the latent variable ξj by
approx(ξj) = Ʃ ŵjh xjh = yh ṁj
When all manifest variables are observed on the same measurement scale, it is nice to express (Fornell (1992)) latent variables estimates in the original scale as:
approx(ξj)* = Ʃ ŵjh xjh / Ʃ ŵjh.
This equation is feasible when all outer weights are positive. Finally, most often in real applications, latent variables estimates are required on a 0-100 scale so as to have a reference scale to compare individual scores. For the i-th observed case, this is easily obtained by the following transformation:
approx(ξj)0-100 = 100 * (approx(ξj)* - xmin) / (xmax - xmin)
where xmin and xmax are, respectively, the minimum and the maximum value of the measurement scale common to all manifest variables.
4.1.2. Inner estimate zj of the standardized latent variable (ξj – mj)
The inner estimate zj of the standardized latent variable (ξj – mj) is defined by:
zj ∞ Ʃj':ξi' is connected with ξi ejj' yj'
where the inner weights ejj’ are equal to the signs of the correlations between yj and the yj’'s connected with yj. Two latent variables are connected if there exists a link between the two variables: an arrow goes from one variable to the other in the arrow diagram describing the causality model. This choice of inner weights is called the centroid scheme.
- Centroid scheme:
This choice shows a drawback in case the correlation is approximately zero as its sign may change for very small fluctuations. But it does not seem to be a problem in practical applications.
In the original algorithm, the inner estimate is the right term and there is no standardization. We prefer to standardize because it does not change anything for the final inner estimate of the latent variables and it simplifies the writing of some equations.
Two other schemes for choosing the inner weights exist: the factorial scheme and the path weighting (or structural) scheme. These two new schemes are defined as follows:
- Factorial scheme:
The inner weights eji are equal to the correlation between yi and yj. This is an answer to the drawbacks of the centroid scheme described above.
- Path weighting scheme (structural):
The latent variables connected to xj are divided into two groups: the predecessors of xj, which are latent variables explaining xj, and the followers, which are latent variables explained by xj.
For a predecessor xj’ of the latent variable xj, the inner weight ejj’ is equal to the regression coefficient of yj’ in the multiple regression of yj on all the yj’’s related to the predecessors of xj. If xj’ is a successor of xj then the inner weight ejj’ is equal to the correlation between yj’ and yj.
These new schemes do not significantly influence the results but are very important for theoretical reasons. In fact, they allow to relate PLS Path modeling to usual multiple table analysis methods.
4.2. The PLS algorithm for estimating the weights
4.2.1. Estimation modes for the weights wjh
There are three classical ways to estimate the weights wjh: Mode A, Mode B and Mode C.
In mode A the weight wjh is the regression coefficient of zj in the simple regression of xjh on the inner estimate zj:
wjh = cov(xjh, zj),
as zj is standardized.
In mode B the vector wj of weights wjh is the regression coefficient vector in the multiple regression of zj on the manifest centered variables (xjh - ẋjh) related to the same latent variable ξj:
wj = (Xj'Xj)-1Xj'zj,
where Xj is the matrix with columns defined by the centered manifest variables xjh - ẋjh related to the j-th latent variable ξj.
Mode A is appropriate for a block with a reflective measurement model and Mode B for a formative one. Mode A is often used for an endogenous latent variable and mode B for an exogenous one. Modes A and B can be used simultaneously when the measurement model is the MIMIC one. Mode A is used for the reflective part of the model and Mode B for the formative part.
In practical situations, mode B is not so easy to use because there is often strong multicollinearity inside each block. When this is the case, PLS regression may be used instead of OLS multiple regression. As a matter of fact, it may be noticed that mode A consists in taking the first component from a PLS regression, while mode B takes all PLS regression components (and thus coincides with OLS multiple regression). Therefore, running a PLS regression and retaining a certain number of significant components may be meant as a new intermediate mode between mode A and mode B.
Mode C (centroid):
In mode C the weights are all equal in absolute value and reflect the signs of the correlations between the manifest variables and their latent variables:
wjh = sign(cor(xjh, zj).
These weights are then normalized so that the resulting latent variable has unitary variance. Mode C actually refers to a formative way of linking manifest variables to their latent variables and represents a specific case of Mode B whose comprehension is very intuitive to practitioners.
4.2.2. Estimating the weights
The starting step of the PLS algorithm consists in beginning with an arbitrary vector of weights wjh. These weights are then standardized in order to obtain latent variables with unitary variance.
A good choice for the initial weight values is to take wjh = sign(cor(xjh, ξh)) or, more simply, wjh = sign(cor(xjh, ξh)) for h = 1 and 0 otherwise or they might be the elements of the first eigenvector from a PCA of each block.
Then the steps for the outer and the inner estimates, depending on the selected mode, are iterated until convergence (guaranteed only for the two-blocks case, but practically always encountered in practice even with more than two blocks).
After the last step, final results are yielded for the inner weights ŵjh, the standardized latent variable yj = Ʃ ŵjh (xjh- ẋjh) the estimated mean ṁj = Ʃ ŵjh ẋjh of the latent variable ξj, and the final estimate approx(ξj) = Ʃ ŵjh xjh = yj + ṁj of ξj. The latter estimate can be rescaled.
The latent variable estimates are sensitive to the scaling of the manifest variables in Mode A, but not in mode B. In the latter case, the outer LV estimate is the projection of the inner LV estimate on the space generated by its manifest variables.
4.3. Estimation of the structural equations
The structural equations are estimated by individual OLS multiple regressions where the latent variables ξj are replaced by their estimates approx( ξj). As usual, the use of OLS multiple regressions may be disturbed by the presence of strong m ulticollinearity between the estimated latent variables. In such a case, PLS regression may be applied instead.
Regularized Generalized Canonical Correlation Analysis (RGCCA)
This method introduced by Tenenhaus et al. (2011), allows to optimize a global function using an algorithm very similar to the PLSPM algorithm.
Unlike the PLS approach, the results of the RGCCA are correlations between latent variables and between manifest variables and their associated latent variables (there is no regression at the end of the algorithm).
The RGCCA is based on a simple iterative algorithm similar to that of the PLS approach. Once the algorithm has converged, we obtain results that optimize specific functions depending on the choice of the tau parameter.
Tau is a parameter that has to be set for each latent variable. It enables you to adjust the “mode” associated to the latent variable. If tau = 0, then we will be in the case of mode B and the results of PLSPM and RGCCA are similar. When tau = 1, we find ourselves in the new mode A (as stated by M. Tenenhaus). This mode is close to the mode A of PLSPM while optimizing a given function. When tau varies between 0 and 1, the latent variable mode stands in between mode A and mode B. XLSTAT-PLSPM offers a special mode called Ridge RGCCA that automatically adjust the tau parameter. For more details on RGCCA see Tenenhaus et al. (2011).
Tenenhaus, M. and Tenenhaus, A. (2011). Regularized Generalized Canonical Correlation Analysis, Psychometrika, 76(2), 257–284.
- Easy and user-friendly
- Data and results shared seamlessly
- Accessible - Available in many languages
- Automatable and customizable
- Versions: 9x/Me/NT/2000/XP/Vista/Win 7/Win 8
- Excel: 97 and later
- Processor: 32 or 64 bits
- Hard disk: 150 Mb