Enhancements in matrix-matrix multiplication.

2025-11-09 00:34:24 +01:00 · 2024-08-26 11:46:56 +02:00 · 2024-08-26 11:46:56 +02:00 · 316c89f32f
commit 316c89f32f
parent 69e0571369
1 changed files with 41 additions and 18 deletions
--- a/notebooks/matrix_matrix.ipynb
+++ b/notebooks/matrix_matrix.ipynb
@ -29,7 +29,8 @@
    "\n",
    "- Parallelize a simple algorithm\n",
    "- Study the performance of different parallelization strategies\n",
-    "- Implement them using Julia"
+    "- Learn the importance of \"grain size\" in a parallel algorithm\n",
    "- Implement and measure the performance of parallel algorithms"
   ]
  },
  {
@ -54,10 +55,18 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 8,
   "id": "2f8ba040",
   "metadata": {},
-   "outputs": [],
+   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🥳 Well done! \n"
     ]
    }
   ],
   "source": [
    "using Distributed\n",
    "using BenchmarkTools\n",
@ -79,7 +88,8 @@
    "alg_2_complex_check(answer) = answer_checker(answer, \"b\")\n",
    "alg_2_deps_check(answer) = answer_checker(answer,\"d\")\n",
    "alg_3_deps_check(answer) = answer_checker(answer, \"c\")\n",
-    "alg_3_complex_check(answer) = answer_checker(answer, \"b\")"
+    "alg_3_complex_check(answer) = answer_checker(answer, \"b\")\n",
    "println(\"🥳 Well done! \")"
   ]
  },
  {
@ -89,6 +99,8 @@
   "source": [
    "## Problem Statement\n",
    "\n",
    "We consider matrix-matrix multiplication as our first algorithm to parallelize. The problem we want so solve is defined as follows.\n",
    "\n",
    "Given $A$ and $B$ two $N$-by-$N$ matrices, compute the matrix-matrix product $C=AB$. Compute it in parallel and efficiently."
   ]
  },
@ -113,7 +125,7 @@
    "\n",
    "- compute the product in parallel using more than one process (distributed implementation)\n",
    "- study the performance of different parallelization alternatives\n",
-    "- implement the algorithms using Julia\n"
+    "- implement the algorithms using Julia's task-based programming model\n"
   ]
  },
  {
@ -124,7 +136,7 @@
    "### Assumptions\n",
    "\n",
    "- All matrices `A`,`B`, and `C` are initially stored in the master process\n",
-    "- The result will be overwritten in `C`"
+    "- The result will be overwritten in `C` (in the master process)"
   ]
  },
  {
@ -148,7 +160,8 @@
    "\n",
    "- Identify the parts of the sequential algorithm that can be parallelized\n",
    "- Consider different parallelization strategies\n",
-    "- Discuss the (theoretical) performance of these implementations\n"
+    "- Discuss the (theoretical) performance of these implementations\n",
    "- Measure the actual performance of these implementations\n"
   ]
  },
  {
@ -222,7 +235,7 @@
    "<b>Note:</b> The matrix-matrix multiplication naively implemented with 3 nested loops as above is known to be very inefficient (memory bound). Libraries such as BLAS provide much more efficient implementations, which are the ones used in practice (e.g., by the `*` operator in Julia). We consider our hand-written implementation as a simple way of expressing the algorithm we are interested in.\n",
    "</div>\n",
    "\n",
-    "Run the following cell to compare the performance of our hand-written function with respect to the built in function `mul!`.\n"
+    "Just to satisfy your curiosity, run the following cell to compare the performance of our hand-written function with respect to the built in function `mul!`.\n"
   ]
  },
  {
@ -287,8 +300,14 @@
    "    end\n",
    "end\n",
    "```\n",
-    "- Loops over `i` and `j` are trivially parallelizable.\n",
+    "\n",
-    "- The loop over `k` can be parallelized but it requires a reduction."
+    "To find out which parts of an algorithm can be parallelized it is useful to start by looking into the for loops. We can run the iterations of the for loop in parallel if the iterations are independent of each other and do not cause any side effect. An easy check to find out if the iterations are independent is checking what happens if we change their order (for instance changing `for j in 1:n` by `for j in n:-1:1`, i.e. doing the loop in reverse). Is the result independent of the loop order? Then one says that the iteration order is *overspecified* and the iterations are parallelizable (if there are not side effects).\n",
    "\n",
    "In our case:\n",
    "\n",
    "- Loops over `i` and `j` are parallelizable.\n",
    "- The loop over `k` can be parallelized but it requires a reduction. Note that this loop causes a side effect on the outer variable `Cij`. This is why parallelizing this loop is not as easy as the other cases. We are not going to parallelize this loop in this notebook.\n",
    "\n"
   ]
  },
  {
@ -298,7 +317,7 @@
   "source": [
    "### Parallel algorithms\n",
    "\n",
-    "All the entries of matrix C can be potentially computed in parallel, but *is it the most efficient solution to solve all these entries in parallel in a distributed system?* To find this we will consider different parallelization strategies:\n",
+    "Parallelizing the loops over `i` and `j` means that all the entries of matrix C can be potentially computed in parallel. However, *which it the most efficient solution to solve all these entries in parallel in a distributed system?* To find this we will consider different parallelization strategies:\n",
    "\n",
    "- Algorithm 1: each worker computes a single entry of C\n",
    "- Algorithm 2: each worker computes a single row of C\n",
@ -330,7 +349,7 @@
   "source": [
    "### Data dependencies\n",
    "\n",
-    "Moving data through the network is expensive and reducing data movement is one of the key points in a distributed algorithm. To this end, we determine which is the minimum data needed by a worker to perform its computations.\n",
+    "Moving data through the network is expensive and reducing data movement is one of the key points in a distributed algorithm. To this end, we need to determine which is the minimum data needed by a worker to perform its computations. These are called the *data dependencies*. This will give us later information about the performance of the parallel algorithm.\n",
    "\n",
    "In algorithm 1, each worker computes only an entry of the result matrix C."
   ]
@ -380,7 +399,7 @@
    "\n",
    "Taking into account the data dependencies, the parallel algorithm 1 can be efficiently implemented following these steps from the worker perspective:\n",
    "\n",
-    "1. The worker receives the corresponding row A[i,:] and column B[:,j] from the master process\n",
+    "1. The worker receives the data dependencies, i.e., the corresponding row A[i,:] and column B[:,j] from the master process\n",
    "2. The worker computes the dot product of A[i,:] and B[:,j]\n",
    "3. The worker sends back the result of C[i,j] to the master process"
   ]
@ -471,19 +490,19 @@
   "source": [
    "### Performance\n",
    "\n",
-    "Let us study the performance of this algorithm. To this end, we will analyze if algorithm 1 is able to achieve the optimal parallel *speedup*. The parallel speedup on $P$ processes is defined as  \n",
+    "We have a first parallel algorithm, but how efficient is this algorithm? Let us study its performance. To this end we need to consider a performance baseline as reference. In this case, we will use the so-called optimal parallel *speedup*. The parallel speedup on $P$ processes is defined as  \n",
    "\n",
    "$$\n",
    "S_P = \\frac{T_1}{T_P},\n",
    "$$\n",
    "\n",
-    "where $T_1$ denotes the runtime of the sequential algorithm on one node and $T_P$ denotes the runtime of the parallel algorithm on $P$ processes. If we run an optimal parallel algorithm with $P$ processes we expect it to run $p$ times faster than the sequential implementation. I.e., the *optimal* speedup of a parallel algorithm on $p$ processes is equal to $P$:\n",
+    "where $T_1$ denotes the runtime of the sequential algorithm and $T_P$ denotes the runtime of the parallel algorithm on $P$ processes. If we run an optimal parallel algorithm with $P$ processes we expect it to run $p$ times faster than the sequential implementation. I.e., the *optimal* speedup of a parallel algorithm on $p$ processes is equal to $P$:\n",
    "\n",
    "$$\n",
    "S^{*}_p = P.\n",
    "$$\n",
    "\n",
-    "The ratio of the actual speedup over the optimal one is called the parallel efficiency\n",
+    "The ratio of the actual speedup over the optimal one is called the parallel efficiency (the closer to one the better).\n",
    "\n",
    "$$\n",
    "E_p = \\frac{S_p}{S^{*}_p} = \\frac{T_1/T_P}{P}.\n",
@ -754,7 +773,9 @@
   "id": "f1f30faf",
   "metadata": {},
   "source": [
-    "### Experimental speedup"
+    "### Experimental speedup\n",
    "\n",
    "Measure the speedup with the following cell. Is this time better?"
   ]
  },
  {
@ -964,7 +985,9 @@
    "\n",
    "- Matrix-matrix multiplication is trivially parallelizable (all entries in the result matrix can be computed in parallel, at least in theory)\n",
    "- However, we cannot exploit all the potential parallelism in a distributed system due to communication overhead\n",
-    "- We need a sufficiently large grain size to obtain a near optimal speedup\n"
+    "- We need a sufficiently large grain size to obtain a near optimal speedup\n",
    "- We measured the theoretical parallel performance by computing the complexity of communication over computation\n",
    "- We measured the actual performance using the parallel speedup and parallel efficiency\n"
   ]
  },
  {