Programming large-scale parallel systems¶
Jacobi method¶
Contents¶
In this notebook, we will learn
- How to paralleize a Jacobi method
- How the data partition can impact the performance of a distributed algorithm
- How to use latency hiding
using Printf
function answer_checker(answer,solution)
if answer == solution
"🥳 Well done! "
else
"It's not correct. Keep trying! 💪"
end |> println
end
gauss_seidel_1_check(answer) = answer_checker(answer,"c")
jacobi_1_check(answer) = answer_checker(answer, "d")
jacobi_2_check(answer) = answer_checker(answer, "b")
jacobi_2_check (generic function with 1 method)
The Jacobi method for the Laplace equation¶
The Jacobi method is a numerical tool to solve systems of linear algebraic equations. One of the main applications of the Jacobi method is to solve boundary value problems (BVPs). I.e., given the values at the boundary (of a grid), the Jacoby method will find the interior values that fulfill a certain equation.
When solving a Laplace equation in 1D, the Jacobi method leads to this iterative scheme. The entry $i$ of vector $u$ at iteration $t+1$ is computed as:
$u^{t+1}_i = \dfrac{u^t_{i-1}+u^t_{i+1}}{2}$
Serial implementation¶
The following code implements the iterative scheme above for boundary conditions -1 and 1.
function jacobi(n,niters)
u = zeros(n+2)
u[1] = -1
u[end] = 1
u_new = copy(u)
for t in 1:niters
for i in 2:(n+1)
u_new[i] = 0.5*(u[i-1]+u[i+1])
end
u, u_new = u_new, u
end
u
end
jacobi (generic function with 1 method)
jacobi(5,0)
7-element Vector{Float64}:
-1.0
0.0
0.0
0.0
0.0
0.0
1.0
Where do we can exploit parallelism?¶
Look at the two nested loops in the sequential implementation:
for t in 1:nsteps
for i in 2:(n+1)
u_new[i] = 0.5*(u[i-1]+u[i+1])
end
u, u_new = u_new, u
end
- The outer loop cannot be parallelized. The value of
uat stept+1depends on the value at the previous stept. - The inner loop can be parallelized
The Gauss-Seidel method¶
The usage of u_new seems a bit unnecessary at first sight, right?. If we remove it, we get another method called Gauss-Seidel.
function gauss_seidel(n,niters)
u = zeros(n+2)
u[1] = -1
u[end] = 1
for t in 1:niters
for i in 2:(n+1)
u[i] = 0.5*(u[i-1]+u[i+1])
end
end
u
end
Note that the final solution is the same (up to machine precision).
gauss_seidel(5,1000)
for t in 1:niters
for i in 2:(n+1)
u[i] = 0.5*(u[i-1]+u[i+1])
end
end
a) Both of them
b) The outer, but not the inner
c) None of them
d) The inner, but not the outer
answer = "x" # replace x with a, b, c or d
gauss_seidel_1_check(answer)
It's not correct. Keep trying! 💪
Parallelization of the Jacobi method¶
Parallelization strategy¶
- Each worker updates a consecutive section of the array
u_new
Data dependencies¶
Recall:
u_new[i] = 0.5*(u[i-1]+u[i+1])
Thus, each process will need values from the neighboring processes to perform the update of its boundary values.
Ghost (aka halo) cells¶
A usual way of handling this type of data dependencies is using so-called ghost cells. Ghost cells represent the missing data dependencies in the data owned by each process. After importing the appropriate values from the neighbor processes one can perform the usual sequential Jacobi update locally in the processes.
a) Communication: O(P), computation: O(N/P)
b) Communication: O(1), computation: O(N)
c) Communication: O(P), computation: O(N)
d) Communication: O(1), computation: O(N/P)
answer = "x" # replace x with a, b, c or d
jacobi_1_check(answer)
It's not correct. Keep trying! 💪
Implementation¶
We consider the implementation using MPI. The programming model of MPI is generally better suited for data-parallel algorithms like this one than the task-based model provided by Distributed.jl. In any case, one can also implement it using Distributed, but it requires some extra effort to setup remote channels right for the communication between neighbor processes.
Take a look at the implementation below and try to understand it. Note that we have used MPIClustermanagers and Distributed just to run the MPI code on the notebook. When running it on a cluster MPIClustermanagers and Distributed are not needed.
] add MPI MPIClusterManagers
using MPIClusterManagers
using Distributed
if procs() == workers()
nw = 3
manager = MPIWorkerManager(nw)
addprocs(manager)
end
@everywhere workers() begin
using MPI
MPI.Initialized() || MPI.Init()
comm = MPI.Comm_dup(MPI.COMM_WORLD)
nw = MPI.Comm_size(comm)
iw = MPI.Comm_rank(comm)+1
function jacobi_mpi(n,niters)
if mod(n,nw) != 0
println("n must be a multiple of nw")
MPI.Abort(comm,1)
end
n_own = div(n,nw)
u = zeros(n_own+2)
u[1] = -1
u[end] = 1
u_new = copy(u)
for t in 1:niters
reqs = MPI.Request[]
if iw != 1
neig_rank = (iw-1)-1
req = MPI.Isend(view(u,2:2),comm,dest=neig_rank,tag=0)
push!(reqs,req)
req = MPI.Irecv!(view(u,1:1),comm,source=neig_rank,tag=0)
push!(reqs,req)
end
if iw != nw
neig_rank = (iw+1)-1
s = n_own-1
r = n_own
req = MPI.Isend(view(u,s:s),comm,dest=neig_rank,tag=0)
push!(reqs,req)
req = MPI.Irecv!(view(u,r:r),comm,source=neig_rank,tag=0)
push!(reqs,req)
end
MPI.Waitall(reqs)
for i in 2:(n_own+1)
u_new[i] = 0.5*(u[i-1]+u[i+1])
end
u, u_new = u_new, u
end
u
@show u
end
niters = 100
load = 4
n = load*nw
jacobi_mpi(n,niters)
end
a) 1
b) 2
c) 3
d) 4
answer = "x" # replace x with a, b, c or d
jacobi_2_check(answer)
# TODO: think of more questions
Latency hiding¶
Note that we only need communications to update the values at the boundary of the portion owned by each process. The other values (the one in green in the figure below) can be updated without communications. This provides the opportunity of overlapping the computation of the interior values (green cells in the figure) with the communication of the ghost values. This technique is called latency hiding, since we are hiding communication latency by overlapping it with communications that we need to do anyway.
The modification of the implementation above to include latency hiding is leaved as an exercise (see below).
Extension to 2D¶
Now, let us study the method for a 2D example.
For the Laplace equation in 2D, the interior values in the computational grid (represented by a matrix $u$) are computed with this iterative scheme. The entry $(i,j)$ of matrix $u$ at iteration $t+1$ is computed as:
$u^{t+1}_{(i,j)} = \dfrac{u^t_{(i-1,j)}+u^t_{(i+1,j)}+u^t_{(i,j-1)}+u^t_{(i,j+1)}}{4}$
Serial implementation¶
The next code implements a simple example, where the boundary values are equal to 1.
function jacobi_2d(n,niters)
u = zeros(n+2,n+2)
u[1,:] = u[end,:] = u[:,1] = u[:,end] .= 1
u_new = copy(u)
for t in 1:niters
for j in 2:(n+1)
for i in 2:(n+1)
north = u[i,j+1]
south = u[i,j-1]
east = u[i+1,j]
west = u[i-1,j]
u_new[i,j] = 0.25*(north+south+east+west)
end
end
u, u_new = u_new, u
end
u
end
jacobi_2d (generic function with 1 method)
u = jacobi_2d(10,0)
12×12 Matrix{Float64}:
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Where do we can exploit parallelism?¶
for t in 1:niters
for j in 2:(n+1)
for i in 2:(n+1)
north = u[i,j+1]
south = u[i,j-1]
east = u[i+1,j]
west = u[i-1,j]
u_new[i,j] = 0.25*(north+south+east+west)
end
end
u, u_new = u_new, u
end
- The outer loop cannot be parallelized (like in the 1d case).
- The two inner loops can be parallelized
Parallelization strategies¶
Data dependencies¶
Complexity analysis¶
The complexity analysis is let as an exercise (see below).
Exercises¶
Exercise 1¶
Transform the following parallel implementation of the 1d Jacobi method (it is copied from above) to use latency hiding (overlap between computation of interior values and communication)
@everywhere workers() begin
using MPI
MPI.Initialized() || MPI.Init()
comm = MPI.Comm_dup(MPI.COMM_WORLD)
nw = MPI.Comm_size(comm)
iw = MPI.Comm_rank(comm)+1
function jacobi_mpi(n,niters)
if mod(n,nw) != 0
println("n must be a multiple of nw")
MPI.Abort(comm,1)
end
n_own = div(n,nw)
u = zeros(n_own+2)
u[1] = -1
u[end] = 1
u_new = copy(u)
for t in 1:niters
reqs = MPI.Request[]
if iw != 1
neig_rank = (iw-1)-1
req = MPI.Isend(view(u,2:2),comm,dest=neig_rank,tag=0)
push!(reqs,req)
req = MPI.Irecv!(view(u,1:1),comm,source=neig_rank,tag=0)
push!(reqs,req)
end
if iw != nw
neig_rank = (iw+1)-1
s = n_own-1
r = n_own
req = MPI.Isend(view(u,s:s),comm,dest=neig_rank,tag=0)
push!(reqs,req)
req = MPI.Irecv!(view(u,r:r),comm,source=neig_rank,tag=0)
push!(reqs,req)
end
MPI.Waitall(reqs)
for i in 2:(n_own+1)
u_new[i] = 0.5*(u[i-1]+u[i+1])
end
u, u_new = u_new, u
end
u
@show u
end
niters = 100
load = 4
n = load*nw
jacobi_mpi(n,niters)
end
## TODO move the following solution to its appropiate place:
@everywhere workers() begin
using MPI
MPI.Initialized() || MPI.Init()
comm = MPI.Comm_dup(MPI.COMM_WORLD)
nw = MPI.Comm_size(comm)
iw = MPI.Comm_rank(comm)+1
function jacobi_mpi(n,niters)
if mod(n,nw) != 0
println("n must be a multiple of nw")
MPI.Abort(comm,1)
end
n_own = div(n,nw)
u = zeros(n_own+2)
u[1] = -1
u[end] = 1
u_new = copy(u)
for t in 1:niters
reqs_snd = MPI.Request[]
reqs_rcv = MPI.Request[]
if iw != 1
neig_rank = (iw-1)-1
req = MPI.Isend(view(u,2:2),comm,dest=neig_rank,tag=0)
push!(reqs_snd,req)
req = MPI.Irecv!(view(u,1:1),comm,source=neig_rank,tag=0)
push!(reqs_rcv,req)
end
if iw != nw
neig_rank = (iw+1)-1
s = n_own-1
r = n_own
req = MPI.Isend(view(u,s:s),comm,dest=neig_rank,tag=0)
push!(reqs_snd,req)
req = MPI.Irecv!(view(u,r:r),comm,source=neig_rank,tag=0)
push!(reqs_rcv,req)
end
for i in 3:n_own
u_new[i] = 0.5*(u[i-1]+u[i+1])
end
MPI.Waitall(reqs_rcv)
for i in (2,n_own+1)
u_new[i] = 0.5*(u[i-1]+u[i+1])
end
MPI.Waitall(reqs_snd)
u, u_new = u_new, u
end
u
end
niters = 100
load = 4
n = load*nw
jacobi_mpi(n,niters)
end
Exercise 2¶
Compute the complexity of the communication and computation of the three data partition strategies (1d block partition, 2d block partition, and 2d cyclic partition) when computing a single iteration of the Jacobi method in 2D. Assume that the grid is of size $N \times N$ and the number of processes $P$ is a perfect square number, i.e. $\sqrt{P}$ is an integer. Hint: For the complexity analysis, you can ignore the effect of the boundary conditions.
# TODO