Contents¶
In this notebook, we will learn
- How to parallelize Gaussian elimination
- How to fix static load imbalance
Gaussian elimination¶
System of linear algebraic equations
$$ \left[ \begin{matrix} 1 & 3 & 1 \\ 1 & 2 & -1 \\ 3 & 11 & 5 \\ \end{matrix} \right] \left[ \begin{matrix} x \\ y \\ z \\ \end{matrix} \right] = \left[ \begin{matrix} 9 \\ 1 \\ 35 \\ \end{matrix} \right] $$
Elimination steps
$$ \left[ \begin{matrix} 1 & 3 & 1 & 9 \\ 1 & 2 & -1 & 1 \\ 3 & 11 & 5 & 35 \\ \end{matrix} \right] \rightarrow \left[ \begin{matrix} 1 & 3 & 1 & 9 \\ 0 & -1 & -2 & -8 \\ 0 & 2 & 2 & 8 \\ \end{matrix} \right] \rightarrow \left[ \begin{matrix} 1 & 3 & 1 & 9 \\ 0 & 1 & 2 & 8 \\ 0 & 2 & 2 & 8 \\ \end{matrix} \right] \rightarrow \left[ \begin{matrix} 1 & 3 & 1 & 9 \\ 0 & 1 & 2 & 8 \\ 0 & 0 & -2 & -8 \\ \end{matrix} \right] \rightarrow \left[ \begin{matrix} 1 & 3 & 1 & 9 \\ 0 & 1 & 2 & 8 \\ 0 & 0 & 1 & 4 \\ \end{matrix} \right] $$
Serial implementation¶
function gaussian_elimination!(B)
n,m = size(B)
@inbounds for k in 1:n
for t in (k+1):m
B[k,t] = B[k,t]/B[k,k]
end
B[k,k] = 1
for i in (k+1):n
for j in (k+1):m
B[i,j] = B[i,j] - B[i,k]*B[k,j]
end
B[i,k] = 0
end
end
B
end
A = Float64[1 3 1; 1 2 -1; 3 11 5]
b = Float64[9,1,35]
B = [A b]
gaussian_elimination!(B)
Parallelization¶
Where can we extract parallelism?¶
n,m = size(B)
for k in 1:n
for t in (k+1):m
B[k,t] = B[k,t]/B[k,k]
end
B[k,k] = 1
for i in (k+1):n
for j in (k+1):m
B[i,j] = B[i,j] - B[i,k]*B[k,j]
end
B[i,k] = 0
end
end
Two possible data partitions¶
Which is the work per process at iteration k ?¶
Load imbalance¶
- CPUs with rows <k are idle during iteration k
- Bad load balance means bad speedups, as some CPUs are waiting instead of doing useful work
- Solution: cyclic partition
Cyclic partition¶
- Less load imbalance
- Same data dependencies as 1d block partition
- Useful for some problems with predictable load imbalance
- A form of static load balancing
- Not suitable for all communication patterns (e.g. Jacobi)
Exercise¶
- The actual parallel implementation is let as an exercise
- Implement both 1d block and 1d cyclic partitions and compare performance
- Closely related with Floyd's algorithm
- Generate input matrix with function below (a random matrix is not enough, we need a non singular matrix that does not require pivoting)
function tridiagonal_matrix(n)
C = zeros(n,n)
stencil = [(-1,2,-1),(-1,0,1)]
for i in 1:n
for (coeff,o) in zip((-1,2,-1),(-1,0,1))
j = i+o
if j in 1:n
C[i,j] = coeff
end
end
end
C
end
n = 5
C = tridiagonal_matrix(n)
b = ones(n)
gaussian_elimination!(C)