9 LeastSquares Joint Diagonalization Of A Matrix Set By

9 least-squares joint diagonalization of a matrix set by a congruence transformation1 marco congedo gipsa-lab (grenoble images par

9
Least-Squares Joint Diagonalization of a Matrix Set by a Congruence
Transformation1
Marco Congedo
GIPSA-lab (Grenoble Images Parole Signal Automatique) : CNRS (Centre
National de la Recherche Scientifique) - Université Joseph Fourier,
Université Pierre Mendès-France, Université Stendhal, Grenoble INP
(Institut National Polytechnique). 961 rue de la Houille Blanche -
38402 Saint Martin d'Hères, FRANCE
Dinh-TUAN Pham
LJK (Laboratoire Jean Kuntzmann), – CNRS (Centre National de la
Recherche Scientifique) - Grenoble INP (Institut National
Polytechnique) - Université Joseph Fourier, 51 rue des Mathématiques -
38402 Saint Martin d'Hères, FRANCE
The approximate joint diagonalization (AJD) is an important analytic
tool at the base of numerous independent component analysis (ICA) and
other blind source separation (BSS) methods, thus finding more and
more applications in medical imaging analysis. In this work we present
a new AJD algorithm named SDIAG (Spheric Diagonalization). It imposes
no constraint either on the input matrices or on the joint
diagonalizer to be estimated, thus it is very general. Whereas it is
well grounded on the classical least-squares criterion, a new
normalization reveals a very simple form of the solution matrix.
Numerical simulations shown that the algorithm, named SDIAG (spheric
diagonalization), behaves well as compared to state-of-the art AJD
algorithms.
1. Introduction
===============
In the following we will indicate matrices by upper case italic
letters (A), matrix sets by bold upper case italic letters (A),
vectors and indexes by lower-case italic letters (a), scalars by
lower-case letters (a) and integers by upper-case letters (A).
Notations (·)T, (·)-1 and ║(·)║ indicate transpose, inverse and
Frobenius norm, respectively. diag(·) and off(·) returns a matrix
comprising only the diagonal and off-diagonal elements of the
argument, respectively. λ(·) indicates an eigenvalue of the argument.
ej indicates a vector where the jth element equals one and the others
equals zero and Eji the matrix where the ji element equals one and the
others zero. Given a set of input matrices
, (1)
the joint diagonalization (JD) problem consists in finding a matrix B
such that all K congruence transformations BTCkB [11, p. 324] result
in diagonal matrices [1-9]; we are given C and we seek a B such that
, (2)
where all k are diagonal. Note that if (2) is true for matrix B, then
it is true also for any matrix of the group BΥP, where Υ is a diagonal
matrix and P a permutation matrix. This simply shows that AJD is
solved up to a sign, scaling and permutation indeterminacy, just like
in ICA and BSS. In the approximate joint diagonalization (AJD)
problem, which is of practical interest in engineering, the input
matrices are statistical estimations, thus they are perturbed by
finite sampling error plus noise. Then we may state the AJD problem as
, (3)
where the Nks are sampling error plus noise matrices. Hereafter we
assume that B, Ck and k are all N-dimensional square matrices and
that the Ck matrices are symmetric (real case) or Hermitian (complex
case). In the sequel we will treat the real case for simplicity, for
the extension in the complex domain is straightforward. In this paper
we present a new least-squares (LS) iterative algorithm for finding B
in the exact case and an approximation of B in the approximate case.
The basic idea involves the minimization of
, (4)
while avoiding the trivial solution B=0. Several constraints may serve
this purpose. For example:
BTB=I (5) and diag(BTC0B)=I (6).
Constraint (5) assumes that B is orthogonal and has been considered in
this fashion in [1]. The authors proposed a solution by Givens
rotations. We are rather interested in the general case, i.e., finding
non-orthogonal Bs. For this purpose (6) has been proposed almost
simultaneously in [6] and [8]. C0 is any positive definite matrix,
such as the data covariance matrix or I, boiling down to fixing the
filter gain or the norm of the filter vectors to unity, respectively.
The algorithm starts by finding a matrix H such that HTC0H=I and then
iteratively applies transformations that minimize the criterion using
successive eigenvector decompositions satisfying (6). In [7] a similar
LS idea has been used to perform simultaneous joint diagonalization
and zero-diagonalization on two matrix sets, an approach that suits
time-frequency data expansions. Contrarily to (5), constrained (6) by
itself cannot prevent degenerate solutions, thus a penalty term
proportional to -log|det(B)| has been added to (4) in [9]. However the
resulting algorithm is slow, even more so than the (already slow)
algorithms in [6-8].
2. Method
=========
Denoting by the ith row vector of BT and by its
transpose (the ith column vector of B) let us define
(7)
and
, (8)
with n=1…N and k=1…K. Note for future use that multiplying the
left-hand side of (2) by its transpose and summing across the K
matrices we obtain
(9)
From (9) it results that the diagonality of M (8) is a necessary
condition for (2) to be true. Notice that such condition is meaningful
regardless the sign of the entries of the ks and the symmetry of the
matrices in C. We have seen that the scaling of the columns of B is
arbitrary. Let us fix the scale such that
(10)
and consider the minimization of (4) under constraint (10). Since

we may write the off-criterion as
. (11)
The method of Lagrange multipliers leads us to minimize
,
where the δn values are adjusted in order to satisfy constraint (10).
The first derivative of ΓOff (B) with respect to the element ji of B
is
.
This is the ij element of matrix
,
where Δ is a diagonal matrix having diagonal elements δn. The solution
to our optimization problem (11) must therefore satisfy (equating the
above to zero)
.
But this is
, (12)
which is readily recognized as a system of nested generalized
eigenvalue-eigenvector decomposition for matrix pencils (Mn, M). For
each pencil bn is the eigenvector associated with the largest
eigenvalue and it is normalized so to satisfy (10), which using (7)
implies that each bn is normalized so that
.
Eq. (12) shows that the system of generalized eigenvalue-eigenvector
decompositions is a stationary point for criterion (11), or
equivalently of criterion (4). This result matches our intuition
noting that

and
,
thus the sought eigenvectors are directions maximizing the sum of the
square of the diagonal elements of the input matrices with respect to
the sum of the squares of all their elements.
The above maximization problem has not known closed-form solution, as
we may expect for an AJD problem, however, following [7] we can
proceed iteratively row-by-row with mutual restriction. The general
scheme for the optimization of (11), which we name SDIAG (Spheric
Diagonalization), is reported here below.
SDIAG Iteration Scheme
Initialize B by a non-singular clever guess or by I if no guess is
available.
While not Convergence do
Obtain by (7) and their sum
(Sphering)
Find a matrix such that
(Optimal Directions)
For n=1 to N do find the principal eigenvector of
Update as , where
Normalize so that , for n=1 to N
End While
SDIAG has a number of properties, to which we now turn. The following
theorem applies in general:
Theorem 2.1
If (after the sphering stage), then , that is, the sum
of the N eigenvalues of the N matrices HTMnH equals N.
Proof : since the trace of a square matrix equal the sum of its
eigenvalues, using (8) we have
. 
The following two theorems apply only at the limit of B in the exact
JD case, that is, when the stationary point has been reached and (2)
is true:
Theorem 2.2
If , let be the unit L2 norm eigenvector associated
with the largest eigenvalue of , i.e., , then square
is orthogonal.
Proof: Let us write . Eq. (9) states that is
diagonal, thus for all i, j=1…N, i≠j, we have , but since
we have , implying orthogonality of whether
its vectors have unit L2 norm. 
Theorem 2.3
If , then for each n=1…N ,
that is, the largest eigenvalue of all the N matrices HTMnH equals 1.0
and all the others N-1 eigenvalues are null.
Proof: Let . For the nth matrix Mn use (7) and to
write
.
Such matrix is a rank-1 projection matrix [11, p. 156], thus it has
one non-null eigenvalue. Furthermore, the eigenvalues of a projection
matrix are either 1.0 or 0 [11, p. 238], which proves the theorem for
any given n. 
We end up this section with three remarks:
1. Theorems 2.1 to 2.3 above assume the existence of a matrix H such
that HTMH=I. Being M symmetric regardless the symmetry of C, such a
matrix always exists if M is positive definite. If this is not the
case it suffices to reduce appropriately the number of columns of H,
say, keeping only R R-dimensional HTMH=I will be satisfied. Finally, U will still be
square, but R-dimensional as well. Ideally, R should match the rank of
M. In practice, we shall define R as the number of eigenvalues of M
larger then λmax(M)/f, where f is an estimation of the ratio between
the variance of the signal and the variance of the noise (by default
we may set f=100).
2. The SDIAG algorithm naturally avoids degenerate solutions (see
[9]). Referring to the previous point, H has row-rank RN. Since at
the stationary point U is orthogonal (theorem 2.2) and multiplication
by a nonsingular matrix does not alter rank [12, p. 197], BT will have
raw-rank R as well. Its pseudo-inverse, the estimation of the “mixing”
matrix in ICA/BSS applications, will have column-rank R and it will be
a right inverse of BT [12, p. 199].
3. The N eigenvectors un (optimal directions) sought at each iteration
of SDIAG can be found in parallel running N threads. Furthermore, they
can be found by power iterations (difference equations) of the form un(p+1)=(M-1Mn)un(p),
where p is the power iteration index [11, p. 359]. To avoid confusion
with SDIAG iterations hereafter we name a power iteration a “pass”.
Note that theorem 2.3 ensures the stability of SDIAG, for the
eigenvectors sought in the optimal directions step have eigenvalues
bounded superiorly by 1.0 (neutrally stable as per [11, p. 259]). Now,
If λn1 is the largest eigenvalue of HTMnH associated to un and λn2 is
the next largest eigenvalue, power passes have convergence factor λn2
/ λn1 [11, p. 360]. From theorem 2.3 we see that the convergence
factor approaches zero as the SDIAG algorithm converges, thus in the
proximity of the solution the difference equations will convergence
with only one pass.
3. Results
==========
We compare our SDIAG algorithm to the well-established FFDIAG
algorithm of [5] and QDIAG of [6]. Referring to (6), we use the sum of
the input matrices as C0 for QDIAG. We generate square diagonal
matrices with each diagonal entry distributed as a chi-squares random
variable with one degree of freedom. Each of these matrices, named Dk,
may represent the error-free covariance matrix of independent standard
Gaussian processes (zero mean and unit variance). The noisy input
matrices are obtained following the noisy model in (3) such as
, . (13)
In (13), symmetric noise matrix N has entries randomly distributed as
a Gaussian with zero mean and  standard deviation. The parameter 
controls the signal to noise ratio of the input matrices. Two
different values will be considered in the simulations, of which one (=0.01)
represents a small amount of noise closely simulating the exact JD
case and the other (=0.03) simulating the approximate (realistic) JD
case. Two kinds of mixing matrix A are considered in (13). In the
general case mixing matrix A is obtained as the pseudo-inverse of a
matrix with unit norm row vectors which entries are randomly
distributed as a standard Gaussian; in this case (A non-orthogonal)
the mixing matrix may be badly conditioned and we can evaluate the
robustness of the AJD algorithms with respect to the conditioning of
the mixing matrix. We also consider the case in which A is a random
orthogonal matrix; in this case the conditioning does not jeopardize
the performance of the algorithms and we can evaluate their robustness
with respect to noise. As it is well known, given true mixing A, each
AJD algorithm estimates demixing matrix BT, which should approximate
the inverse of actual A out of row scaling (including sign) and
permutation. Then, matrix G= BTA should equal a scaled permutation
matrix. At each repetition we compute the performance index as
Performance Index 1 (PI1)= , (14)
Performance Index 2 (PI2)=
where i and j are the row and column index, respectively. Performance
index (14) is positive and reaches its maximum 1.0 iff G has only one
non-null elements in each row and column, i.e., if BT has been
estimated exactly out of usual row scaling and permutation
arbitrariness. The means and standard deviations obtained across 250
repetitions for 30 input matrices of dimension 10x10 are reported in
table 1.
All pair-wise statistical tests between the mean performance of the
three methods (bi-directional unpaired student-t with 248 degrees of
freedom) reveal that the performance of the three algorithms is
statistically equivalent in all conditions but in the case of
non-orthogonal mixing and high noise (=0.03). In this condition the
performance of FFDIAG is statistically lower then both QDIAG and
SDIAG. This indicates that in noisy conditions FFDIAG occasionally
fails in estimating correctly the demixing matrix B due to the
ill-conditioning of the mixing matrix. In [10] we have performed
simulations accounting for other kinds of perturbations of the exact
JD model in (2). Those simulations showed that SDIAG is more robust
than QDIAG to such violations. We note by the way that in [10] the
correct criterion and consequent correct normalization for vectors bn
(10) were not identified. Further work is currently in progress to
study the convergence properties of SDIAG and an efficient
implementation.
Table 1. Mean and standard deviation (in parentheses) of the
performance index (14) attained by QDIAG [6], FFDIAG [5] and our SDIAG
algorithm across 250 repetitions of the simulation with N=10 and K=30.
The higher the mean the better the performance. See text for details.
Orthogonal Mixing
(good conditioning)
Non-Orthogonal Mixing
(variable conditioning)
=0.01
=0.03
=0.01
=0.03
QDIAG
0.99978693 (0.00014084)
0.99459649 (0.00804553)
0.99978692
(0.00014071)
0.99429150 (0.00885019)
FFDIAG
0.99977825 (0.00015045)
0.99555599 (0.00628480)
0.99568611 (0.02840985)
0.98832193 (0.03840436)
SDIAG
0.99978186 (0.00014960)
0.99539675 (0.00656474)
0.99978183 (0.00014947)
0.99521559 (0.00704235)
Acknowledgments
The author wish to express its gratitude to Prof. Christian Jutten, to
Dr. Reza Sameni, Dr. Fabian Theis and Cédric Gouy-Pailler for the
discussions about the ideas behind SDIAG.
References
1. J.-F. Cardoso, A. Souloumiac, SIAM J. Matrix Analysus Appl., 17(1),
161 (1996).
2. D. T. Pham, SIAM J. Matrix Analysus Appl., 22(4), 1136 (2001).
3. E. Moreau, IEEE Trans. Signal Process., 49(3), 530 (2001).
4. A. Yeredor, IEEE Trans. Signal Process., 50(7), 1545 (2002).
5. A. Ziehe, P. Laskov, G. Nolte, K.-R. Müller, J. Machine Learning
Research, 5, 501 (2004).
6. R. Vollgraf, K. Obermayer, IEEE Trans. Signal Process., 54(9), 3270
(2006).
7. E. M. Fadaili, N. T. Moreau, E. Moreau, IEEE Trans. Signal
Process., 55(5), 1673 (2007).
8. S. Degerine, E. Kane, IEEE Trans. Signal Process., 55(6), 3022
(2007).
9. X.-L. Li, X.D. Zhang, IEEE Trans. Signal Process., 55(5), 1803
(2007).
10. M. Congedo, C. Jutten, R. Sameni, C. Gouy-Pailler, Proc. 4th Int.
BCI Workshop, 98-103 (2008).
11. G. Strang, Linear Algebra and its Applications, 4th Ed., Thomson
Brooks/Cole (2006).
12. S. R. Searle, Matrix Algebra useful for Statistics, John Wiley &
Sons (1982).
1* This Research has been partially funded by the French National
Research Agency (ANR) within the National Network for Software
Technologies (RNTL), project Open-ViBE (Open Platform for Virtual
Brain Environments).

Menu

9 LeastSquares Joint Diagonalization Of A Matrix Set By