build based on 567534f

2025-12-29 10:18:31 +01:00 · 2024-08-26 11:01:09 +00:00
parent 4d93e76dbf
commit dad32dba25
32 changed files with 9675 additions and 107 deletions
--- a/dev/matrix_matrix.ipynb
+++ b/dev/matrix_matrix.ipynb
@@ -29,7 +29,8 @@
    "\n",
    "- Parallelize a simple algorithm\n",
    "- Study the performance of different parallelization strategies\n",
-    "- Implement them using Julia"
+    "- Learn the importance of \"grain size\" in a parallel algorithm\n",
+    "- Implement and measure the performance of parallel algorithms"
   ]
  },
  {
@@ -54,10 +55,18 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 8,
   "id": "2f8ba040",
   "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "🥳 Well done! \n"
+     ]
+    }
+   ],
   "source": [
    "using Distributed\n",
    "using BenchmarkTools\n",
@@ -79,7 +88,8 @@
    "alg_2_complex_check(answer) = answer_checker(answer, \"b\")\n",
    "alg_2_deps_check(answer) = answer_checker(answer,\"d\")\n",
    "alg_3_deps_check(answer) = answer_checker(answer, \"c\")\n",
-    "alg_3_complex_check(answer) = answer_checker(answer, \"b\")"
+    "alg_3_complex_check(answer) = answer_checker(answer, \"b\")\n",
+    "println(\"🥳 Well done! \")"
   ]
  },
  {
@@ -89,6 +99,8 @@
   "source": [
    "## Problem Statement\n",
    "\n",
+    "We consider matrix-matrix multiplication as our first algorithm to parallelize. The problem we want so solve is defined as follows.\n",
+    "\n",
    "Given $A$ and $B$ two $N$-by-$N$ matrices, compute the matrix-matrix product $C=AB$. Compute it in parallel and efficiently."
   ]
  },
@@ -113,7 +125,7 @@
    "\n",
    "- compute the product in parallel using more than one process (distributed implementation)\n",
    "- study the performance of different parallelization alternatives\n",
-    "- implement the algorithms using Julia\n"
+    "- implement the algorithms using Julia's task-based programming model\n"
   ]
  },
  {
@@ -124,7 +136,7 @@
    "### Assumptions\n",
    "\n",
    "- All matrices `A`,`B`, and `C` are initially stored in the master process\n",
-    "- The result will be overwritten in `C`"
+    "- The result will be overwritten in `C` (in the master process)"
   ]
  },
  {
@@ -148,7 +160,8 @@
    "\n",
    "- Identify the parts of the sequential algorithm that can be parallelized\n",
    "- Consider different parallelization strategies\n",
-    "- Discuss the (theoretical) performance of these implementations\n"
+    "- Discuss the (theoretical) performance of these implementations\n",
+    "- Measure the actual performance of these implementations\n"
   ]
  },
  {
@@ -222,7 +235,7 @@
    "<b>Note:</b> The matrix-matrix multiplication naively implemented with 3 nested loops as above is known to be very inefficient (memory bound). Libraries such as BLAS provide much more efficient implementations, which are the ones used in practice (e.g., by the `*` operator in Julia). We consider our hand-written implementation as a simple way of expressing the algorithm we are interested in.\n",
    "</div>\n",
    "\n",
-    "Run the following cell to compare the performance of our hand-written function with respect to the built in function `mul!`.\n"
+    "Just to satisfy your curiosity, run the following cell to compare the performance of our hand-written function with respect to the built in function `mul!`.\n"
   ]
  },
  {
@@ -272,9 +285,14 @@
   "id": "0eedd28a",
   "metadata": {},
   "source": [
-    "### Where can we exploit parallelism?\n",
+    "## Where can we exploit parallelism?\n",
    "\n",
-    "Look at the three nested loops in the sequential implementation:\n",
+    "\n",
+    "The matrix-matrix multiplication is an example of [embarrassingly parallel algorithm](https://en.wikipedia.org/wiki/Embarrassingly_parallel). An embarrassingly parallel (also known as trivially parallel) algorithm is an algorithm that can be split in parallel tasks with no (or very few) dependences between them. Such algorithms are typically easy to parallelize.\n",
+    "\n",
+    "Which parts of an algorithm are completely independent and thus trivially parallel? To answer this question, it is useful to inspect the for loops, which are potential sources parallelism. If the iterations are independent of each other, then they are trivial to parallelize. An easy check to find out if the iterations are dependent or not is to change their order (for instance changing `for j in 1:n` by `for j in n:-1:1`, i.e. doing the loop in reverse). If the result changes, then the iterations are not independent.\n",
+    "\n",
+    "Look at the three nested loops in the sequential implementation of the matrix-matrix product:\n",
    "\n",
    "```julia\n",
    "for j in 1:n\n",
@@ -287,8 +305,12 @@
    "    end\n",
    "end\n",
    "```\n",
-    "- Loops over `i` and `j` are trivially parallelizable.\n",
-    "- The loop over `k` can be parallelized but it requires a reduction."
+    "\n",
+    "Note that:\n",
+    "\n",
+    "- Loops over `i` and `j` are trivially parallel.\n",
+    "- The loop over `k` is not trivially parallel. The accumulation into the reduction variable `Cij` introduces extra dependences. In addition, remember that the addition of floating point numbers is not strictly associative due to rounding errors. Thus, the result of this loop may change with the loop order when using floating point numbers. In any case, this loop can also be parallelized, but it requires a parallel *fold* or a parallel *reduction*.\n",
+    "\n"
   ]
  },
  {
@@ -298,7 +320,7 @@
   "source": [
    "### Parallel algorithms\n",
    "\n",
-    "All the entries of matrix C can be potentially computed in parallel, but *is it the most efficient solution to solve all these entries in parallel in a distributed system?* To find this we will consider different parallelization strategies:\n",
+    "Parallelizing the loops over `i` and `j` means that all the entries of matrix C can be potentially computed in parallel. However, *which it the most efficient solution to solve all these entries in parallel in a distributed system?* To find this we will consider different parallelization strategies:\n",
    "\n",
    "- Algorithm 1: each worker computes a single entry of C\n",
    "- Algorithm 2: each worker computes a single row of C\n",
@@ -330,7 +352,7 @@
   "source": [
    "### Data dependencies\n",
    "\n",
-    "Moving data through the network is expensive and reducing data movement is one of the key points in a distributed algorithm. To this end, we determine which is the minimum data needed by a worker to perform its computations.\n",
+    "Moving data through the network is expensive and reducing data movement is one of the key points in a distributed algorithm. To this end, we need to determine which is the minimum data needed by a worker to perform its computations. These are called the *data dependencies*. This will give us later information about the performance of the parallel algorithm.\n",
    "\n",
    "In algorithm 1, each worker computes only an entry of the result matrix C."
   ]
@@ -380,7 +402,7 @@
    "\n",
    "Taking into account the data dependencies, the parallel algorithm 1 can be efficiently implemented following these steps from the worker perspective:\n",
    "\n",
-    "1. The worker receives the corresponding row A[i,:] and column B[:,j] from the master process\n",
+    "1. The worker receives the data dependencies, i.e., the corresponding row A[i,:] and column B[:,j] from the master process\n",
    "2. The worker computes the dot product of A[i,:] and B[:,j]\n",
    "3. The worker sends back the result of C[i,j] to the master process"
   ]
@@ -471,19 +493,19 @@
   "source": [
    "### Performance\n",
    "\n",
-    "Let us study the performance of this algorithm. To this end, we will analyze if algorithm 1 is able to achieve the optimal parallel *speedup*. The parallel speedup on $P$ processes is defined as  \n",
+    "We have a first parallel algorithm, but how efficient is this algorithm? Let us study its performance. To this end we need to consider a performance baseline as reference. In this case, we will use the so-called optimal parallel *speedup*. The parallel speedup on $P$ processes is defined as  \n",
    "\n",
    "$$\n",
    "S_P = \\frac{T_1}{T_P},\n",
    "$$\n",
    "\n",
-    "where $T_1$ denotes the runtime of the sequential algorithm on one node and $T_P$ denotes the runtime of the parallel algorithm on $P$ processes. If we run an optimal parallel algorithm with $P$ processes we expect it to run $p$ times faster than the sequential implementation. I.e., the *optimal* speedup of a parallel algorithm on $p$ processes is equal to $P$:\n",
+    "where $T_1$ denotes the runtime of the sequential algorithm and $T_P$ denotes the runtime of the parallel algorithm on $P$ processes. If we run an optimal parallel algorithm with $P$ processes we expect it to run $p$ times faster than the sequential implementation. I.e., the *optimal* speedup of a parallel algorithm on $p$ processes is equal to $P$:\n",
    "\n",
    "$$\n",
    "S^{*}_p = P.\n",
    "$$\n",
    "\n",
-    "The ratio of the actual speedup over the optimal one is called the parallel efficiency\n",
+    "The ratio of the actual speedup over the optimal one is called the parallel efficiency (the closer to one the better).\n",
    "\n",
    "$$\n",
    "E_p = \\frac{S_p}{S^{*}_p} = \\frac{T_1/T_P}{P}.\n",
@@ -754,7 +776,9 @@
   "id": "f1f30faf",
   "metadata": {},
   "source": [
-    "### Experimental speedup"
+    "### Experimental speedup\n",
+    "\n",
+    "Measure the speedup with the following cell. Is this time better?"
   ]
  },
  {
@@ -964,7 +988,9 @@
    "\n",
    "- Matrix-matrix multiplication is trivially parallelizable (all entries in the result matrix can be computed in parallel, at least in theory)\n",
    "- However, we cannot exploit all the potential parallelism in a distributed system due to communication overhead\n",
-    "- We need a sufficiently large grain size to obtain a near optimal speedup\n"
+    "- We need a sufficiently large grain size to obtain a near optimal speedup\n",
+    "- We measured the theoretical parallel performance by computing the complexity of communication over computation\n",
+    "- We measured the actual performance using the parallel speedup and parallel efficiency\n"
   ]
  },
  {