{ "cells": [ { "cell_type": "markdown", "id": "bea76753", "metadata": {}, "source": [ "\n", "\n", "### Programming large-scale parallel systems" ] }, { "cell_type": "markdown", "id": "038e5442", "metadata": {}, "source": [ "# Matrix-matrix multiplication" ] }, { "cell_type": "markdown", "id": "f70e2f35", "metadata": {}, "source": [ "## Contents\n", "\n", "In this notebook, we will:\n", "\n", "- Parallelize a simple algorithm\n", "- Study the performance of different parallelization strategies\n", "- Implement them using Julia" ] }, { "cell_type": "markdown", "id": "480af594", "metadata": {}, "source": [ "
\n", "Note: Do not forget to execute the cells below before starting this notebook! \n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "b4d5f1a1", "metadata": {}, "outputs": [], "source": [ "] add BenchmarkTools" ] }, { "cell_type": "code", "execution_count": null, "id": "2f8ba040", "metadata": {}, "outputs": [], "source": [ "using Distributed\n", "using BenchmarkTools\n", "using Printf\n", "if procs() == workers()\n", " addprocs(4)\n", "end\n", "function answer_checker(answer,solution)\n", " if answer == solution\n", " \"🥳 Well done! \"\n", " else\n", " \"It's not correct. Keep trying! 💪\"\n", " end |> println\n", "end\n", "alg_1_deps_check(answer) = answer_checker(answer,\"b\")\n", "alg_1_comm_overhead_check(answer) = answer_checker(answer, \"c\")\n", "alg_1_comp_check(answer) = answer_checker(answer, \"a\")\n", "alg_2_complex_check(answer) = answer_checker(answer, \"b\")\n", "alg_2_deps_check(answer) = answer_checker(answer,\"d\")\n", "alg_3_deps_check(answer) = answer_checker(answer, \"c\")\n", "alg_3_complex_check(answer) = answer_checker(answer, \"d\")" ] }, { "cell_type": "markdown", "id": "96d2693d", "metadata": {}, "source": [ "## Problem Statement\n", "\n", "Let us consider the (dense) matrix-matrix product `C=A*B`." ] }, { "cell_type": "markdown", "id": "88bc2633", "metadata": {}, "source": [ "
\n", "\n", "
\n" ] }, { "cell_type": "markdown", "id": "a358ee60", "metadata": {}, "source": [ "### Goals\n", "\n", "We want to\n", "\n", "- compute the product in parallel using more than one process (distributed implementation)\n", "- study the performance of different parallelization alternatives\n", "- implement the algorithms using Julia\n" ] }, { "cell_type": "markdown", "id": "495ef679", "metadata": {}, "source": [ "### Assumptions\n", "\n", "- All matrices `A`,`B`, and `C` are initially stored in the master process\n", "- The result will be overwritten in `C`" ] }, { "cell_type": "markdown", "id": "5828e243", "metadata": {}, "source": [ "
\n", "\n", "
\n" ] }, { "cell_type": "markdown", "id": "c57f340c", "metadata": {}, "source": [ "### Steps\n", "\n", "To develop and study the parallel implementation, we will follow these steps:\n", "\n", "- Identify the parts of the sequential algorithm that can be parallelized\n", "- Consider different parallelization strategies\n", "- Discuss the (theoretical) performance of these implementations\n" ] }, { "cell_type": "markdown", "id": "ca56a7fe", "metadata": {}, "source": [ "## Serial implementation\n", "\n", "We start by considering the (naive) sequential algorithm:" ] }, { "cell_type": "code", "execution_count": null, "id": "af8dfb37", "metadata": {}, "outputs": [], "source": [ "@everywhere function matmul_seq!(C,A,B)\n", " m = size(C,1)\n", " n = size(C,2)\n", " l = size(A,2)\n", " @assert size(A,1) == m\n", " @assert size(B,2) == n\n", " @assert size(B,1) == l\n", " z = zero(eltype(C))\n", " for j in 1:n\n", " for i in 1:m\n", " Cij = z\n", " for k in 1:l\n", " @inbounds Cij += A[i,k]*B[k,j]\n", " end\n", " C[i,j] = Cij\n", " end\n", " end\n", " C\n", "end" ] }, { "cell_type": "markdown", "id": "f967d2ea", "metadata": {}, "source": [ "
\n", "Note: The matrix-matrix multiplication naively implemented with 3 nested loops as above is known to be very inefficient (memory bound). Libraries such as BLAS provide much more efficient implementations, which are the ones used in practice (e.g., by the `*` operator in Julia). We consider, our hand-written implementation as a simple way of expressing the algorithm we are interested in.\n", "
\n", "\n", "Run the following cell to compare the performance of our hand-written function with respect to the built in function `mul!`\n" ] }, { "cell_type": "code", "execution_count": null, "id": "725387f6", "metadata": {}, "outputs": [], "source": [ "using LinearAlgebra\n", "N = 1000\n", "A = rand(N,N)\n", "B = rand(N,N)\n", "C = rand(N,N)\n", "@btime matmul_seq!(C,A,B)\n", "@btime mul!(C,A,B);" ] }, { "cell_type": "markdown", "id": "0eedd28a", "metadata": {}, "source": [ "### Where do we can exploit parallelism?\n", "\n", "Look at the three nested loops in the sequential implementation:\n", "\n", "```julia\n", "for j in 1:n\n", " for i in 1:m\n", " Cij = z\n", " for k in 1:l\n", " @inbounds Cij += A[i,k]*B[k,j]\n", " end\n", " C[i,j] = Cij\n", " end\n", "end\n", "```\n", "- Loops over `i` and `j` are trivially parallelizable.\n", "- The loop over `k` can be parallelized but it requires a reduction." ] }, { "cell_type": "markdown", "id": "b50aecff", "metadata": {}, "source": [ "### Parallel algorithms\n", "\n", "All the entries of matrix C can be potentially computed in parallel, but *is it the most efficient solution to solve all these entries in parallel in a distributed system?* To find this we will consider different parallelization strategies:\n", "\n", "- Algorithm 1: each worker computes a single entry of C\n", "- Algorithm 2: each worker computes a single row of C\n", "- Algorithm 3: each worker computes a block rows of C" ] }, { "cell_type": "markdown", "id": "6a706283", "metadata": {}, "source": [ "
\n", "\n", "
\n" ] }, { "cell_type": "markdown", "id": "bd759cb2", "metadata": {}, "source": [ "## Parallel algorithm 1" ] }, { "cell_type": "markdown", "id": "bdae0a02", "metadata": {}, "source": [ "### Data dependencies\n", "\n", "Moving data through the network is expensive and reducing data movement is one of the key points in a distributed algorithm. To this end, we determine which is the minimum data needed by a worker to perform its computations.\n", "\n", "In algorithm 1, each worker computes only an entry of the result matrix C." ] }, { "cell_type": "markdown", "id": "acfb354b", "metadata": {}, "source": [ "
\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "be3c4a01", "metadata": {}, "source": [ "
\n", "Question: Which are the data dependencies of the computations done by the worker in charge of computing entry C[i,j] ? \n", "
\n", "\n", " a) column A[:,i] and row B[j,:]\n", " b) row A[i,:] and column B[:,j]\n", " c) the whole matrices A and B\n", " d) row A[i,:] and the whole matrix B" ] }, { "cell_type": "code", "execution_count": null, "id": "a8b7d1e1", "metadata": {}, "outputs": [], "source": [ "answer = \"x\" # replace x with a, b, c, or d \n", "alg_1_deps_check(answer)" ] }, { "cell_type": "markdown", "id": "06e1977a", "metadata": {}, "source": [ "### Implementation\n", "\n", "Taking into account the data dependencies, the parallel algorithm 1 can be efficiently implemented following these steps from the worker perspective:\n", "\n", "1. The worker receives the corresponding row A[i,:] and column B[:,j] from the master process\n", "2. The worker computes the dot product of A[i,:] and B[:,j]\n", "3. The worker sends back the result of C[i,j] to the master process" ] }, { "cell_type": "markdown", "id": "70087bce", "metadata": {}, "source": [ "
\n", "\n", "
\n", "\n" ] }, { "cell_type": "markdown", "id": "9d22ccea", "metadata": {}, "source": [ "A possible implementation of this algorithm in Julia is as follows:" ] }, { "cell_type": "code", "execution_count": null, "id": "e4697fda", "metadata": {}, "outputs": [], "source": [ "function matmul_dist_1!(C, A, B)\n", " m = size(C,1)\n", " n = size(C,2)\n", " l = size(A,2)\n", " @assert size(A,1) == m\n", " @assert size(B,2) == n\n", " @assert size(B,1) == l\n", " z = zero(eltype(C))\n", " @assert nworkers() == m*n\n", " iw = 0 \n", " @sync for j in 1:n\n", " for i in 1:m\n", " Ai = A[i,:]\n", " Bj = B[:,j]\n", " iw += 1\n", " w = workers()[iw]\n", " ftr = @spawnat w begin\n", " Cij = z\n", " for k in 1:l\n", " @inbounds Cij += Ai[k]*Bj[k]\n", " end\n", " Cij\n", " end\n", " @async C[i,j] = fetch(ftr)\n", " end\n", " end\n", " C\n", "end" ] }, { "cell_type": "markdown", "id": "f0e5a38b", "metadata": {}, "source": [ "You can execute the following cells to test this implementation." ] }, { "cell_type": "code", "execution_count": null, "id": "13920a31", "metadata": {}, "outputs": [], "source": [ "using Test\n", "N = 2\n", "A = rand(N,N)\n", "B = rand(N,N)\n", "C = similar(A)\n", "@test matmul_dist_1!(C,A,B) ≈ A*B" ] }, { "cell_type": "markdown", "id": "f69d3333", "metadata": {}, "source": [ "### Performance\n", "\n", "Let us study the performance of this algorithm. To this end, we will analyze if algorithm 1 is able to achieve the optimal parallel *speedup*. The parallel speedup on $P$ processes is defined as \n", "\n", "$$\n", "S_P = \\frac{T_1}{T_P},\n", "$$\n", "\n", "where $T_1$ denotes the runtime of the sequential algorithm on one node and $T_P$ denotes the runtime of the parallel algorithm on $P$ processes. If we run an optimal parallel algorithm with $P$ processes we expect it to run $p$ times faster than the sequential implementation. I.e., the *optimal* speedup of a parallel algorithm on $p$ processes is equal to $P$:\n", "\n", "$$\n", "S^{*}_p = P.\n", "$$\n", "\n", "The ratio of the actual speedup over the optimal one is called the parallel efficiency\n", "\n", "$$\n", "E_p = \\frac{S_p}{S^{*}_p} = \\frac{T_1/T_P}{P}.\n", "$$\n", "\n" ] }, { "cell_type": "markdown", "id": "b8eb224d", "metadata": {}, "source": [ "### Experimental speedup\n", "\n", "The following cell measures the speedup of parallel algorithm 1. Do we achieve the optimal speedup?" ] }, { "cell_type": "code", "execution_count": null, "id": "cc698aa8", "metadata": {}, "outputs": [], "source": [ "N = 2\n", "A = rand(N,N)\n", "B = rand(N,N)\n", "C = similar(A)\n", "T1 = @belapsed matmul_seq!(C,A,B)\n", "C = similar(A)\n", "TP = @belapsed matmul_dist_1!(C,A,B)\n", "P = nworkers()\n", "println(\"Speedup = \", T1/TP)\n", "println(\"Optimal speedup = \", P)\n", "println(\"Efficiency = \", 100*(T1/TP)/P, \"%\")" ] }, { "cell_type": "markdown", "id": "dac6a50b", "metadata": {}, "source": [ "### Communication overhead\n", "\n", "Since communication is usually the main bottleneck in a distributed algorithm, we want to reduce the amount of communication per unit of computation in a worker. Let us compute the (theoretical) communication overhead for algorithm 1. This will help us understand why the speedup of this algorithm was so bad.\n", "\n", "Remember, algorithm 1 consisted of these main steps:\n", "\n", "1. The worker receives the corresponding row A[i,:] and column B[:,j] from the master process\n", "2. The worker computes the dot product of A[i,:] and B[:,j]\n", "3. The worker sends back the result of C[i,j] to the master process\n", "\n", "
\n", "Question: How many scalars are communicated from and to a worker? Assume that matrices A, B, and C are N by N matrices.\n", "
\n", "\n", " a) 3N\n", " b) 2N + 2\n", " c) 2N + 1\n", " d) N² + 1" ] }, { "cell_type": "code", "execution_count": null, "id": "e78cbc7b", "metadata": {}, "outputs": [], "source": [ "answer = \"x\" # replace x with a, b, c, or d \n", "alg_1_comm_overhead_check(answer)" ] }, { "cell_type": "markdown", "id": "b27a4d3f", "metadata": {}, "source": [ "
\n", "Question: How many operations are done in a worker? \n", "
\n", "\n", " a) O(N)\n", " b) O(N²)\n", " c) O(N³)" ] }, { "cell_type": "code", "execution_count": null, "id": "fcc9b903", "metadata": {}, "outputs": [], "source": [ "answer = \"x\" # replace x with a, b, or c\n", "alg_1_comp_check(answer)" ] }, { "cell_type": "markdown", "id": "d4c301de", "metadata": {}, "source": [ "From these results we can conclude:\n", "\n", "- The communication complexity is O(N)\n", "- The computation complexity is O(N)\n", "- The ratio communication over computation (the communication overhead) is O(1)\n", "\n", "In other words, the communication cost is of the same order of magnitude as the computation cost. Since, communication is orders of magnitude slower in real systems, the runtime in the worker will be dominated by communication. This explains why we obtained such a bad speedup.\n" ] }, { "cell_type": "markdown", "id": "b15cbaf4", "metadata": {}, "source": [ "## Parallel algorithm 2\n", "\n", "Let us study the next algorithm to see if we can improve the efficiency by augmenting the granularity (i.e. the amount of work) in each parallel task. In parallel algorithm 2, each worker computes an entire row of C.\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "62e5c637", "metadata": {}, "source": [ "
\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "fa6cad0e", "metadata": {}, "source": [ "### Data dependencies" ] }, { "cell_type": "markdown", "id": "d4312f2c", "metadata": {}, "source": [ "
\n", "Question: Which are the data dependencies of the computations done by the worker in charge of computing row C[i,:] ? \n", "
\n", "\n", " a) column A[:,i] and row B[j,:]\n", " b) row A[i,:] and column B[:,j]\n", " c) the whole matrices A and B\n", " d) row A[i,:] and the whole matrix B" ] }, { "cell_type": "code", "execution_count": null, "id": "cdb46cd8", "metadata": {}, "outputs": [], "source": [ "answer = \"x\" # replace x with a, b, c, or d \n", "alg_2_deps_check(answer)" ] }, { "cell_type": "markdown", "id": "a9d84ac2", "metadata": {}, "source": [ "### Implementation\n", "\n", "These are the main steps of the implementation of algorithm 2:\n", "\n", "1. The worker receives the corresponding row A[i,:] and matrix B from the master process\n", "2. The worker computes the product of row A[i,:] times B\n", "3. The worker sends back the result of row C[i,:] to the master process\n", "\n" ] }, { "cell_type": "markdown", "id": "fb6b572b", "metadata": {}, "source": [ "\n", "
\n", "\n", "
\n" ] }, { "cell_type": "markdown", "id": "a662afa4", "metadata": {}, "source": [ "A possible implementation of this algorithm in Julia is as follows:" ] }, { "cell_type": "code", "execution_count": null, "id": "365dc58e", "metadata": {}, "outputs": [], "source": [ "function matmul_dist_2!(C, A, B)\n", " m = size(C,1)\n", " n = size(C,2)\n", " l = size(A,2)\n", " @assert size(A,1) == m\n", " @assert size(B,2) == n\n", " @assert size(B,1) == l\n", " z = zero(eltype(C))\n", " @assert nworkers() == m\n", " iw = 0\n", " @sync for i in 1:m\n", " Ai = A[i,:]\n", " iw += 1\n", " w = workers()[iw]\n", " ftr = @spawnat w begin\n", " Ci = fill(z,l)\n", " for j in 1:n\n", " for k in 1:l\n", " @inbounds Ci[j] += Ai[k]*B[k,j]\n", " end\n", " end\n", " Ci\n", " end\n", " @async C[i,:] = fetch(ftr)\n", " end\n", " C\n", " end" ] }, { "cell_type": "markdown", "id": "8de835b9", "metadata": {}, "source": [ "Test it using next cell" ] }, { "cell_type": "code", "execution_count": null, "id": "267ac8b2", "metadata": {}, "outputs": [], "source": [ "using Test\n", "N = 4\n", "A = rand(N,N)\n", "B = rand(N,N)\n", "C = similar(A)\n", "@test matmul_dist_2!(C,A,B) ≈ A*B" ] }, { "cell_type": "markdown", "id": "f1f30faf", "metadata": {}, "source": [ "### Experimental speedup" ] }, { "cell_type": "code", "execution_count": null, "id": "fe42f069", "metadata": {}, "outputs": [], "source": [ "N = 4\n", "A = rand(N,N)\n", "B = rand(N,N)\n", "C = similar(A)\n", "T1 = @belapsed matmul_seq!(C,A,B)\n", "C = similar(A)\n", "TP = @belapsed matmul_dist_2!(C,A,B)\n", "P = nworkers()\n", "println(\"Speedup = \", T1/TP)\n", "println(\"Optimal speedup = \", P)\n", "println(\"Efficiency = \", 100*(T1/TP)/P, \"%\")" ] }, { "cell_type": "markdown", "id": "e2c6f60a", "metadata": {}, "source": [ "### Complexity\n", "\n", "The speedup is still far from the optimal one. Let us study the communication overhead for this algorithm. Remember, algorithm 2 consists in these main steps:\n", "\n", "1. The worker receives the corresponding row A[i,:] and matrix B from the master process\n", "2. The worker computes the product of row A[i,:] times B\n", "3. The worker sends back the result of row C[i,:] to the master process\n", "\n", "
\n", "Question: Which is the complexity of the communication and computations done by a worker in algorithm 2?\n", "
\n", "\n", " a) O(N) communication and O(N^2) computation\n", " b) O(N^2) communication and O(N^2) computation\n", " c) O(N^3) communication and O(N^3) computation\n", " d) O(N) communication and O(N) computation" ] }, { "cell_type": "code", "execution_count": null, "id": "1bf8feff", "metadata": {}, "outputs": [], "source": [ "answer = \"x\" # replace x with a, b, c, or d \n", "alg_2_complex_check(answer)" ] }, { "cell_type": "markdown", "id": "a2038e04", "metadata": {}, "source": [ "The communication and computation cost are still of the same order of magnitude even though we have increased the grain size. " ] }, { "cell_type": "markdown", "id": "71088fb9", "metadata": {}, "source": [ "## Parallel algorithm 3\n", "\n", "Let us increase even more the granularity of the parallel tasks by computing several rows of C in a worker. Each worker computes N/P consecutive rows of C, where P is the number of workers.\n" ] }, { "cell_type": "markdown", "id": "f1b8c712", "metadata": {}, "source": [ "
\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "4d456fed", "metadata": {}, "source": [ "### Data dependencies" ] }, { "cell_type": "markdown", "id": "67b65ea6", "metadata": {}, "source": [ "
\n", "Question: Which are the data dependencies of the computations done by the worker in charge of computing the range of rows C[rows,:] ? \n", "
\n", "\n", "\n", " a) A[rows,:] and B[:,rows] \n", " b) the whole matrix A and B[:,rows]\n", " c) A[rows,:] and the whole matrix B\n", " d) the whole matrices A and B" ] }, { "cell_type": "code", "execution_count": null, "id": "2825385a", "metadata": {}, "outputs": [], "source": [ "answer = \"x\" # replace x with a, b, c, or d \n", "alg_3_deps_check(answer)" ] }, { "cell_type": "markdown", "id": "429faa32", "metadata": {}, "source": [ "### Implementation\n", "\n", "These are the main steps of the implementation of algorithm 3:\n", "\n", "1. The worker receives the corresponding rows A[rows,:] and matrix B from the master process\n", "2. The worker computes the product of A[rows,:] times B\n", "3. The worker sends back the result of C[rows,:] to the master process" ] }, { "cell_type": "markdown", "id": "c14ebcb3", "metadata": {}, "source": [ "
\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "7d7952c1", "metadata": {}, "source": [ "The implementation of this variant is let as an exercise (see below)." ] }, { "cell_type": "markdown", "id": "0323d6d8", "metadata": {}, "source": [ "### Communication overhead\n", "\n", "Let us analyze the (theoretical) communication overhead for algorithm 3.\n", "\n", "
\n", "Question: Which is the complexity of the communication and computations done by a worker in algorithm 3?\n", "
\n", "\n", " a) O(N²) communication and O(N³) computation\n", " b) O(N²) communication and O(N³/P) computation\n", " c) O(N²) communication and O(N²) computation\n", " d) O(N²/P) communication and O(N³/P) computation" ] }, { "cell_type": "code", "execution_count": null, "id": "50b8bf53", "metadata": {}, "outputs": [], "source": [ "answer = \"x\" # replace x with a, b, c, or d \n", "alg_3_complex_check(answer)" ] }, { "cell_type": "markdown", "id": "3f0d99e6", "metadata": {}, "source": [ "In this case, the ratio between communication and computation is O(P/N). If the matrix size N is much larger than the number of workers P, then the communication overhead O(P/N) would be negligible. This opens the door to an scalable implementation." ] }, { "cell_type": "markdown", "id": "7bb65f2e", "metadata": {}, "source": [ "## Summary\n", "\n", "The table below compares the three parallel algorithms. \n", "\n", "
\n", "\n", "| Algorithm | Parallelism
(#workers) | Communication
per worker | Computation
per worker | Ratio communication/
computation |\n", "|---|---|---|---|---|\n", "| 1 | N² | 2N + 1 | N | O(1) |\n", "| 2 | N | 2N + N² | N² | O(1) |\n", "| 3 | P | N² + 2N²/P | N³/P | O(P/N) |\n", "\n", "\n", "- Matrix-matrix multiplication is trivially parallelizable (all entries in the result matrix can be computed in parallel, at least in theory)\n", "- However, we cannot exploit all the potential parallelism in a distributed system due to communication overhead\n", "- We need a sufficiently large grain size to obtain a near optimal speedup\n" ] }, { "cell_type": "markdown", "id": "8b83e744", "metadata": {}, "source": [ "## Exercises" ] }, { "cell_type": "markdown", "id": "a628a1df", "metadata": {}, "source": [ "### Exercise 1\n", "\n", "Implement algorithm 3 in the function below. For simplicity, assume that the number of rows of C is a multiple of the number of workers.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8e50b923", "metadata": {}, "outputs": [], "source": [ "function matmul_dist_3!(C,A,B)\n", " m = size(C,1)\n", " n = size(C,2)\n", " l = size(A,2)\n", " @assert size(A,1) == m\n", " @assert size(B,2) == n\n", " @assert size(B,1) == l\n", " @assert mod(m,nworkers()) == 0\n", " # Implement here\n", " \n", " C\n", "end" ] }, { "cell_type": "markdown", "id": "4506dcfb", "metadata": {}, "source": [ "Use test-driven development to implement the algorithm. Use this test:" ] }, { "cell_type": "code", "execution_count": null, "id": "28cde36a", "metadata": {}, "outputs": [], "source": [ "using Test\n", "P = nworkers()\n", "load = 100\n", "N = load*P\n", "A = rand(N,N)\n", "B = rand(N,N)\n", "C = similar(A)\n", "@test matmul_dist_3!(C,A,B) ≈ A*B" ] }, { "cell_type": "markdown", "id": "03952b0b", "metadata": {}, "source": [ "Measure the performance of your implementation by running next cell. Do you get close to the optimal speedup?" ] }, { "cell_type": "code", "execution_count": null, "id": "b3aa2b7c", "metadata": {}, "outputs": [], "source": [ "P = nworkers()\n", "load = 100\n", "N = load*P\n", "A = rand(N,N)\n", "B = rand(N,N)\n", "C = similar(A)\n", "T1 = @belapsed matmul_seq!(C,A,B)\n", "C = similar(A)\n", "TP = @belapsed matmul_dist_3!(C,A,B)\n", "println(\"Speedup = \", T1/TP)\n", "println(\"Optimal speedup = \", P)\n", "println(\"Efficiency = \", 100*(T1/TP)/P, \"%\")" ] }, { "cell_type": "markdown", "id": "fa8d7f40", "metadata": {}, "source": [ "### Exercise 2" ] }, { "cell_type": "markdown", "id": "0e7c607e", "metadata": {}, "source": [ "The implementation of algorithm 1 is very impractical. One needs as many processors as entries in the result matrix C. For 1000 times 1000 matrix one would need a supercomputer with one million processes! We can easily fix this problem by using less processors and spawning the computation of an entry in any of the available processes.\n", "See the following code:" ] }, { "cell_type": "code", "execution_count": null, "id": "023b20d1", "metadata": {}, "outputs": [], "source": [ "function matmul_dist_1_v2!(C, A, B)\n", " m = size(C,1)\n", " n = size(C,2)\n", " l = size(A,2)\n", " @assert size(A,1) == m\n", " @assert size(B,2) == n\n", " @assert size(B,1) == l\n", " z = zero(eltype(C))\n", " @sync for j in 1:n\n", " for i in 1:m\n", " Ai = A[i,:]\n", " Bj = B[:,j]\n", " ftr = @spawnat :any begin\n", " Cij = z\n", " for k in 1:l\n", " @inbounds Cij += Ai[k]*Bj[k]\n", " end\n", " Cij\n", " end\n", " @async C[i,j] = fetch(ftr)\n", " end\n", " end\n", " C\n", "end" ] }, { "cell_type": "markdown", "id": "52005ca1", "metadata": {}, "source": [ "With this new implementation, we can multiply matrices of arbitrary size with a fixed number of workers. Test it:" ] }, { "cell_type": "code", "execution_count": null, "id": "c1d3595b", "metadata": {}, "outputs": [], "source": [ "using Test\n", "N = 50\n", "A = rand(N,N)\n", "B = rand(N,N)\n", "C = similar(A)\n", "@test matmul_dist_1_v2!(C,A,B) ≈ A*B" ] }, { "cell_type": "markdown", "id": "ab609c18", "metadata": {}, "source": [ "Run the next cell to check the performance of this implementation. Note that we are far away from the optimal speed up. Why? To answer this question compute the theoretical communication over computation ratio for this implementation and reason about the obtained result. Hint: the number of times a worker is spawned in this implementation is N^3/P on average." ] }, { "cell_type": "code", "execution_count": null, "id": "d7d31710", "metadata": {}, "outputs": [], "source": [ "N = 100\n", "A = rand(N,N)\n", "B = rand(N,N)\n", "C = similar(A)\n", "P = nworkers()\n", "T1 = @belapsed matmul_seq!(C,A,B)\n", "C = similar(A)\n", "TP = @belapsed matmul_dist_1_v2!(C,A,B)\n", "println(\"Speedup = \", T1/TP)\n", "println(\"Optimal speedup = \", P)\n", "println(\"Efficiency = \", 100*(T1/TP)/P, \"%\")" ] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.9.1", "language": "julia", "name": "julia-1.9" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.9.1" } }, "nbformat": 4, "nbformat_minor": 5 }