Merge pull request #38 from fverdugo/francesc

More changes in notebooks
2025-11-08 23:44:23 +01:00 · 2024-08-26 12:58:28 +02:00 · 2024-08-26 12:58:28 +02:00 · 567534f594
commit 567534f594
parent 4e6b7696f0 02968243ba
5 changed files with 1104 additions and 45 deletions
--- a/docs/src/getting_started_with_julia.md
+++ b/docs/src/getting_started_with_julia.md
@ -124,6 +124,11 @@ $ julia --version
 If this runs without error and you see a version number, you are good to go!
 You can also run julia code from the terminal using the `-e` flag:
 ```
 $ julia -e 'println("Hello, world!")'
 ```
 !!! note
    In this tutorial, when a code snipped starts with `$`, it should be run in the terminal. Otherwise, the code is to be run in the Julia REPL.
@ -366,9 +371,62 @@ is equivalent to calling `status` in package mode.
 (@v1.10) pkg> status
 ```
 ### Creating you own package
 In many situations is useful to create your own package, for instance, when working with a large code base, when you want to reduce compilation latency using [`Revise.jl`](https://github.com/timholy/Revise.jl),
 or if you want to eventually [register your package](https://github.com/JuliaRegistries/Registrator.jl) and share it with others.
 The simplest way of generating a package (called `MyPackage`) is as follows. Open Julia go to package mode and type
 ```julia
 (@v1.10) pkg> generate MyPackage
 ```
 This will crate a minimal package consisting of a new folder `MyPackage` with two files:
 * `MyPackage/Project.toml`: Project file defining the direct dependencies of your package.
 * `MyPackage/src/MyPackage.jl`: Main source file of your package. You can split your code in several files if needed, and include them in the package main file using function `include`.
 !!! tip
    This approach only generates a very minimal package. To create a more sophisticated package skeleton (including unit testing, code coverage, readme file, licence, etc.) use
    [`PkgTemplates.jl`](https://github.com/JuliaCI/PkgTemplates.jl) or [`BestieTemplate.jl`](https://github.com/abelsiqueira/BestieTemplate.jl). The later one is developed in Amsterdam at the
    [Netherlands eScience Center](https://www.esciencecenter.nl/).
 You can add dependencies to the package by activating the `MyPackage` folder in package mode and adding new dependencies as always:
 ```julia
 (@v1.10) pkg> activate MyPackage
 (MyPackage) pkg> add MPI
 ```
 This will add MPI to your package dependencies.
 ### Using your own package
 To use your package you first need to add it to a package environment of your choice. This is done with `develop path/to/the/package/folder` in package mode. For instance:
 ```julia
 (@v1.10) pkg> develop MyPackage
 ```
 !!! note
    You do not need to "develop" your package if you activated the package folder `MyPackage`.
 Now, we can go back to standard Julia mode and use it as any other package:
 ```julia
 using MyPackage
 MyPackage.greet()
 ```
 Here, we just called the example function defined in `MyPackage/src/MyPackage.jl`.
 ## Conclusion
-We have learned the basics of how to work with Julia. If you want to further dig into the topics we have covered here, you can take a look at the following links:
+We have learned the basics of how to work with Julia, including how to run serial and parallel code, and how to manage, create, and use Julia packages.
 This knowledge will allow you to follow the course effectively!
 If you want to further dig into the topics we have covered here, you can take a look at the following links:
 - [Julia Manual](https://docs.julialang.org/en/v1/manual/getting-started/)
 - [Package manager](https://pkgdocs.julialang.org/v1/getting-started/)
--- a/notebooks/jacobi_method.ipynb
+++ b/notebooks/jacobi_method.ipynb
@ -27,9 +27,9 @@
    "\n",
    "In this notebook, we will learn\n",
    "\n",
-    "- How to paralleize a Jacobi method\n",
+    "- How to paralleize the Jacobi method\n",
    "- How the data partition can impact the performance of a distributed algorithm\n",
-    "- How to use latency hiding\n",
+    "- How to use latency hiding to improve parallel performance\n",
    "\n"
   ]
  },
@ -93,9 +93,12 @@
   "id": "93e84ff8",
   "metadata": {},
   "source": [
-    "When solving a Laplace equation in 1D, the Jacobi method leads to the following iterative scheme: The entry $i$ of vector $u$ at iteration $t+1$ is computed as:\n",
+    "When solving a [Laplace equation](https://en.wikipedia.org/wiki/Laplace%27s_equation) in 1D, the Jacobi method leads to the following iterative scheme: The entry $i$ of vector $u$ at iteration $t+1$ is computed as:\n",
    "\n",
-    "$u^{t+1}_i = \\dfrac{u^t_{i-1}+u^t_{i+1}}{2}$"
+    "$u^{t+1}_i = \\dfrac{u^t_{i-1}+u^t_{i+1}}{2}$\n",
    "\n",
    "\n",
    "This iterative is yet simple but shares fundamental challenges with many other algorithms used in scientific computing. This is why we are studying it here.\n"
   ]
  },
  {
@ -130,6 +133,14 @@
    "end"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "432bd862",
   "metadata": {},
   "source": [
    "If you run it for zero iterations, we will see the initial condition."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
@ -140,14 +151,78 @@
    "jacobi(5,0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c75cb9a6",
   "metadata": {},
   "source": [
    "If you run it for enough iterations, you will see the expected solution of the Laplace equation: values that vary linearly form -1 to 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b52be374",
   "metadata": {},
   "outputs": [],
   "source": [
    "jacobi(5,100)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22fda724",
   "metadata": {},
   "source": [
    "In our version of the jacobi method, we return after a given number of iterations. Other stopping criteria are possible. For instance, iterate until the difference between u and u_new is below a tolerance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15de7bf5",
   "metadata": {},
   "outputs": [],
   "source": [
    "using LinearAlgebra: norm\n",
    "function jacobi_with_tol(n,tol)\n",
    "    u = zeros(n+2)\n",
    "    u[1] = -1\n",
    "    u[end] = 1\n",
    "    u_new = copy(u)\n",
    "    increment = similar(u)\n",
    "    while true\n",
    "        for i in 2:(n+1)\n",
    "            u_new[i] = 0.5*(u[i-1]+u[i+1])\n",
    "        end\n",
    "        increment .= u_new .- u\n",
    "        if norm(increment)/norm(u_new) < tol\n",
    "            return u_new\n",
    "        end\n",
    "        u, u_new = u_new, u\n",
    "    end\n",
    "    u\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "697ad307",
   "metadata": {},
   "outputs": [],
   "source": [
    "n = 5\n",
    "tol = 1e-9\n",
    "jacobi_with_tol(n,tol)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e085701",
   "metadata": {},
   "source": [
-    "<div class=\"alert alert-block alert-info\">\n",
+    "However, we are not going to parallelize this more complex in this notebook (we will consider it later in this course)."
    "<b>Note:</b> In our version of the jacobi method, we return after a given number of iterations. Other stopping criteria are possible. For instance, iterate until the difference between u and u_new is below a tolerance.\n",
    "</div>"
   ]
  },
  {
@ -156,7 +231,7 @@
   "metadata": {},
   "source": [
    "\n",
-    "### Where can we exploit parallelism?\n",
+    "## Where can we exploit parallelism?\n",
    "\n",
    "Look at the two nested loops in the sequential implementation:\n",
    "\n",
@ -169,8 +244,8 @@
    "end\n",
    "```\n",
    "\n",
-    "- The outer loop cannot be parallelized. The value of `u` at step `t+1` depends on the value at the previous step `t`.\n",
+    "- The outer loop over `t` cannot be parallelized. The value of `u` at step `t+1` depends on the value at the previous step `t`.\n",
-    "- The inner loop can be parallelized.\n",
+    "- The inner loop is trivially parallel. The loop iterations are independent (any order is possible).\n",
    "\n"
   ]
  },
@ -386,19 +461,11 @@
   "source": [
    "### Communication overhead\n",
    "- We update $N/P$ entries in each process at each iteration, where $N$ is the total length of the vector and $P$ the number of processes\n",
    "- Thus, computation complexity is $O(N/P)$\n",
    "- We need to get remote entries from 2 neighbors (2 messages per iteration)\n",
    "- We need to communicate 1 entry per message\n",
-    "- Communication/computation ration is $O(P/N)$ (potentially scalable if $P<<N$)\n"
+    "- Thus, communication complexity is $O(1)$\n",
-   ]
+    "- Communication/computation ration is $O(P/N)$, making the algorithm potentially scalable if $P<<N$.\n"
  },
  {
   "cell_type": "markdown",
   "id": "f6b54b7b",
   "metadata": {},
   "source": [
    "## 1D Implementation\n",
    "\n",
    "We consider the implementation using MPI. The programming model of MPI is generally better suited for data-parallel algorithms like this one than the task-based model provided by Distributed.jl. In any case, one can also implement it using Distributed, but it requires some extra effort to setup the remote channels right for the communication between neighbor processes."
   ]
  },
  {
@ -457,8 +524,10 @@
   "id": "8ed4129c",
   "metadata": {},
   "source": [
-    "### MPI Code\n",
+    "## MPI implementation\n",
    "\n",
    "We consider the implementation using MPI. The programming model of MPI is generally better suited for data-parallel algorithms like this one than the task-based model provided by Distributed.jl. In any case, one can also implement it using Distributed.jl, but it requires some extra effort to setup the remote channels right for the communication between neighbor processes.\n",
    " \n",
    "Take a look at the implementation below and try to understand it.\n"
   ]
  },
@ -749,7 +818,7 @@
    "```\n",
    "\n",
    "- The outer loop cannot be parallelized (like in the 1d case). \n",
-    "- The two inner loops can be parallelized\n"
+    "- The two inner loops are trivially parallel\n"
   ]
  },
  {
--- a/notebooks/julia_mpi.ipynb
+++ b/notebooks/julia_mpi.ipynb
@ -687,7 +687,12 @@
    "* `rcvbuf` space to store the incoming data.\n",
    "* `source` rank of the sender.\n",
    "* `dest` rank of the receiver.\n",
-    "* `tag`. Might be used to distinguish between different kinds of messages from the same sender to the same receiver (similar to the \"subject\" in an email).\n"
+    "* `tag`. Might be used to distinguish between different kinds of messages from the same sender to the same receiver (similar to the \"subject\" in an email).\n",
    "\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>Note:</b> Note that the C interface provides additional arguments `MPI_Datatype` (type of data to send/receive) and `count` (number of items to send/receive). In Julia, send and receive buffers are usually arrays or references, from which the data type and the count can be inferred. This is true for many other MPI functions.\n",
    "</div>"
   ]
  },
  {
--- a/notebooks/matrix_matrix.ipynb
+++ b/notebooks/matrix_matrix.ipynb
@ -29,7 +29,8 @@
    "\n",
    "- Parallelize a simple algorithm\n",
    "- Study the performance of different parallelization strategies\n",
-    "- Implement them using Julia"
+    "- Learn the importance of \"grain size\" in a parallel algorithm\n",
    "- Implement and measure the performance of parallel algorithms"
   ]
  },
  {
@ -54,10 +55,18 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 8,
   "id": "2f8ba040",
   "metadata": {},
-   "outputs": [],
+   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🥳 Well done! \n"
     ]
    }
   ],
   "source": [
    "using Distributed\n",
    "using BenchmarkTools\n",
@ -79,7 +88,8 @@
    "alg_2_complex_check(answer) = answer_checker(answer, \"b\")\n",
    "alg_2_deps_check(answer) = answer_checker(answer,\"d\")\n",
    "alg_3_deps_check(answer) = answer_checker(answer, \"c\")\n",
-    "alg_3_complex_check(answer) = answer_checker(answer, \"b\")"
+    "alg_3_complex_check(answer) = answer_checker(answer, \"b\")\n",
    "println(\"🥳 Well done! \")"
   ]
  },
  {
@ -89,6 +99,8 @@
   "source": [
    "## Problem Statement\n",
    "\n",
    "We consider matrix-matrix multiplication as our first algorithm to parallelize. The problem we want so solve is defined as follows.\n",
    "\n",
    "Given $A$ and $B$ two $N$-by-$N$ matrices, compute the matrix-matrix product $C=AB$. Compute it in parallel and efficiently."
   ]
  },
@ -113,7 +125,7 @@
    "\n",
    "- compute the product in parallel using more than one process (distributed implementation)\n",
    "- study the performance of different parallelization alternatives\n",
-    "- implement the algorithms using Julia\n"
+    "- implement the algorithms using Julia's task-based programming model\n"
   ]
  },
  {
@ -124,7 +136,7 @@
    "### Assumptions\n",
    "\n",
    "- All matrices `A`,`B`, and `C` are initially stored in the master process\n",
-    "- The result will be overwritten in `C`"
+    "- The result will be overwritten in `C` (in the master process)"
   ]
  },
  {
@ -148,7 +160,8 @@
    "\n",
    "- Identify the parts of the sequential algorithm that can be parallelized\n",
    "- Consider different parallelization strategies\n",
-    "- Discuss the (theoretical) performance of these implementations\n"
+    "- Discuss the (theoretical) performance of these implementations\n",
    "- Measure the actual performance of these implementations\n"
   ]
  },
  {
@ -222,7 +235,7 @@
    "<b>Note:</b> The matrix-matrix multiplication naively implemented with 3 nested loops as above is known to be very inefficient (memory bound). Libraries such as BLAS provide much more efficient implementations, which are the ones used in practice (e.g., by the `*` operator in Julia). We consider our hand-written implementation as a simple way of expressing the algorithm we are interested in.\n",
    "</div>\n",
    "\n",
-    "Run the following cell to compare the performance of our hand-written function with respect to the built in function `mul!`.\n"
+    "Just to satisfy your curiosity, run the following cell to compare the performance of our hand-written function with respect to the built in function `mul!`.\n"
   ]
  },
  {
@ -272,9 +285,14 @@
   "id": "0eedd28a",
   "metadata": {},
   "source": [
-    "### Where can we exploit parallelism?\n",
+    "## Where can we exploit parallelism?\n",
    "\n",
-    "Look at the three nested loops in the sequential implementation:\n",
+    "\n",
    "The matrix-matrix multiplication is an example of [embarrassingly parallel algorithm](https://en.wikipedia.org/wiki/Embarrassingly_parallel). An embarrassingly parallel (also known as trivially parallel) algorithm is an algorithm that can be split in parallel tasks with no (or very few) dependences between them. Such algorithms are typically easy to parallelize.\n",
    "\n",
    "Which parts of an algorithm are completely independent and thus trivially parallel? To answer this question, it is useful to inspect the for loops, which are potential sources parallelism. If the iterations are independent of each other, then they are trivial to parallelize. An easy check to find out if the iterations are dependent or not is to change their order (for instance changing `for j in 1:n` by `for j in n:-1:1`, i.e. doing the loop in reverse). If the result changes, then the iterations are not independent.\n",
    "\n",
    "Look at the three nested loops in the sequential implementation of the matrix-matrix product:\n",
    "\n",
    "```julia\n",
    "for j in 1:n\n",
@ -287,8 +305,12 @@
    "    end\n",
    "end\n",
    "```\n",
-    "- Loops over `i` and `j` are trivially parallelizable.\n",
+    "\n",
-    "- The loop over `k` can be parallelized but it requires a reduction."
+    "Note that:\n",
    "\n",
    "- Loops over `i` and `j` are trivially parallel.\n",
    "- The loop over `k` is not trivially parallel. The accumulation into the reduction variable `Cij` introduces extra dependences. In addition, remember that the addition of floating point numbers is not strictly associative due to rounding errors. Thus, the result of this loop may change with the loop order when using floating point numbers. In any case, this loop can also be parallelized, but it requires a parallel *fold* or a parallel *reduction*.\n",
    "\n"
   ]
  },
  {
@ -298,7 +320,7 @@
   "source": [
    "### Parallel algorithms\n",
    "\n",
-    "All the entries of matrix C can be potentially computed in parallel, but *is it the most efficient solution to solve all these entries in parallel in a distributed system?* To find this we will consider different parallelization strategies:\n",
+    "Parallelizing the loops over `i` and `j` means that all the entries of matrix C can be potentially computed in parallel. However, *which it the most efficient solution to solve all these entries in parallel in a distributed system?* To find this we will consider different parallelization strategies:\n",
    "\n",
    "- Algorithm 1: each worker computes a single entry of C\n",
    "- Algorithm 2: each worker computes a single row of C\n",
@ -330,7 +352,7 @@
   "source": [
    "### Data dependencies\n",
    "\n",
-    "Moving data through the network is expensive and reducing data movement is one of the key points in a distributed algorithm. To this end, we determine which is the minimum data needed by a worker to perform its computations.\n",
+    "Moving data through the network is expensive and reducing data movement is one of the key points in a distributed algorithm. To this end, we need to determine which is the minimum data needed by a worker to perform its computations. These are called the *data dependencies*. This will give us later information about the performance of the parallel algorithm.\n",
    "\n",
    "In algorithm 1, each worker computes only an entry of the result matrix C."
   ]
@ -380,7 +402,7 @@
    "\n",
    "Taking into account the data dependencies, the parallel algorithm 1 can be efficiently implemented following these steps from the worker perspective:\n",
    "\n",
-    "1. The worker receives the corresponding row A[i,:] and column B[:,j] from the master process\n",
+    "1. The worker receives the data dependencies, i.e., the corresponding row A[i,:] and column B[:,j] from the master process\n",
    "2. The worker computes the dot product of A[i,:] and B[:,j]\n",
    "3. The worker sends back the result of C[i,j] to the master process"
   ]
@ -471,19 +493,19 @@
   "source": [
    "### Performance\n",
    "\n",
-    "Let us study the performance of this algorithm. To this end, we will analyze if algorithm 1 is able to achieve the optimal parallel *speedup*. The parallel speedup on $P$ processes is defined as  \n",
+    "We have a first parallel algorithm, but how efficient is this algorithm? Let us study its performance. To this end we need to consider a performance baseline as reference. In this case, we will use the so-called optimal parallel *speedup*. The parallel speedup on $P$ processes is defined as  \n",
    "\n",
    "$$\n",
    "S_P = \\frac{T_1}{T_P},\n",
    "$$\n",
    "\n",
-    "where $T_1$ denotes the runtime of the sequential algorithm on one node and $T_P$ denotes the runtime of the parallel algorithm on $P$ processes. If we run an optimal parallel algorithm with $P$ processes we expect it to run $p$ times faster than the sequential implementation. I.e., the *optimal* speedup of a parallel algorithm on $p$ processes is equal to $P$:\n",
+    "where $T_1$ denotes the runtime of the sequential algorithm and $T_P$ denotes the runtime of the parallel algorithm on $P$ processes. If we run an optimal parallel algorithm with $P$ processes we expect it to run $p$ times faster than the sequential implementation. I.e., the *optimal* speedup of a parallel algorithm on $p$ processes is equal to $P$:\n",
    "\n",
    "$$\n",
    "S^{*}_p = P.\n",
    "$$\n",
    "\n",
-    "The ratio of the actual speedup over the optimal one is called the parallel efficiency\n",
+    "The ratio of the actual speedup over the optimal one is called the parallel efficiency (the closer to one the better).\n",
    "\n",
    "$$\n",
    "E_p = \\frac{S_p}{S^{*}_p} = \\frac{T_1/T_P}{P}.\n",
@ -754,7 +776,9 @@
   "id": "f1f30faf",
   "metadata": {},
   "source": [
-    "### Experimental speedup"
+    "### Experimental speedup\n",
    "\n",
    "Measure the speedup with the following cell. Is this time better?"
   ]
  },
  {
@ -964,7 +988,9 @@
    "\n",
    "- Matrix-matrix multiplication is trivially parallelizable (all entries in the result matrix can be computed in parallel, at least in theory)\n",
    "- However, we cannot exploit all the potential parallelism in a distributed system due to communication overhead\n",
-    "- We need a sufficiently large grain size to obtain a near optimal speedup\n"
+    "- We need a sufficiently large grain size to obtain a near optimal speedup\n",
    "- We measured the theoretical parallel performance by computing the complexity of communication over computation\n",
    "- We measured the actual performance using the parallel speedup and parallel efficiency\n"
   ]
  },
  {
--- a/notebooks/mpi_collectives.ipynb
+++ b/notebooks/mpi_collectives.ipynb