2024-bsc-sebastian-lenzlinger/thesis/Chapters/ch4-iottb.tex

\chapter{Implementation}\label{ch4}
This chapter discusses the implementation of the IoT device testbed, \iottbsc which is developed using the Python programming language. This choice is motivated by Python's wide availability and the familiarity many users have with it, thus lowering the barrier for extending and modifying the testbed in the future. The testbed is delivered as a Python package and provides the \iottb command with various subcommands. A full command reference can be found at \cref{appendix:cmdref}.\\
Conceptually, the software implements two separate aspects: data collection and data storage.
The \iottbsc database schema is implicitly implemented by \iottb. Users use \iottb mainly to operate on the database or initiate data collection. Since the database schema is transparent to the user during operation, we begin with a brief description of the database layout as a directory hierarchy, before we get into \iottb \cli.

\section{Database Schema}
The storage for \iottbsc is implemented on top of the file system of the user.
Since user folder structures provide little standardization, we require a configuration file, while gives \iottb some basic information about the execution environment.
The testbed is configured in a configuration file in JSON format, following the scheme in \cref{lst:cfg-shema}.
\verb|DefaultDatabase| is a string which represents the name of the database, which is a directory in \\
\verb|DefaultDatabasePath| once initialized.
\iottb assumes these values during execution, unless the user specified otherwise.
If the user specifies a different database location as in option in a subcommand, \verb|DatabaseLocations| is consulted.
\verb|DatabaseLocations| is a mapping from every known database name to the full path of its parent directory in the file system.
The configuration file is loaded for every invocation of \iottb.
It provides the minimal operating information.
Now that we understand
\begin{listing}[!ht]
    \inputminted[]{json}{cfg-shema.json}
    \caption{Schema of the testbed configuration file.}
    \label{lst:cfg-shema}
\end{listing}
\newpage
\section{High Level Description}
\iottb is invoked following the schema below. In all cases, a subcommand be specified for anything to happen.
\iottb is used from the command line and follows the following schema:
\begin{minted}[fontsize=\small]{bash}
iottb [<global options>] <subcommand> [<subcommand options>] [<argument(s)>]
\end{minted}
\todoRevise{Better listing}
When \iottb is invoked, it first checks to see if it can find the database directory in the \os users home directory\footnote{Default can be changed}.

\section{Database Initialization}\label{sec:db-init}
The IoT testbed database is defined to be a directory named \db. Currently, \iottb creates this directory in the user's home directory (commonly located at the path \texttt{/home/<username>} on Linux systems) the first time any subcommand is used. All data and metadata are placed under this directory. Invoking \verb|iottb init-db| without arguments causes defaults to be loaded from the configuration file. If the file does not exist, it is created with default values following \cref{lst:cfg-shema}. Else, the database is created with the default name or the user-suplied name as a directory in the file system, unless a database under that name is already registered in the \verb|DatabaseLocaions| map. The commands described in the later sections all depend on the existence of a \iottbsc database.
It is neither possible to add a device nor initiate data collection without an existing database.
The full command line specification can be found in \cref{cmdref:init-db}.
Once a database is initialized, devices may be added to that database.

\section{Adding Devices}\label{sec:add-dev}
Before we capture the traffic of a \iot device, \iottb demands that there exists a dedicated
directory for it.
We add a device to the database by passing a string representing the name of the device to the \addev subcommand.
This does two things:
\begin{enumerate}
    \item A python object is initialized from the class as in \cref{lst:dev-meta-python}
    \item A directory for the device is created as \verb|<db-path>/<device_canonical_name>|
    \item A metadata file \verb|device_metadata.json| is created and placed in the newly created directory. This file is in             the JSON format, and follows the schema seen in \cref{lst:dev-meta-python}.
\end{enumerate}

\begin{listing}[!ht]
    \inputminted[firstline=12, lastline=29, linenos]{python}{device_metadata.py}
    \caption{Device Metadata}
    \label{lst:dev-meta-python}
\end{listing}

The Device ID is automatically generated using a UUID to be FAIR compliant. \verb|canonical_name| is generated by the \verb|make_canonical_name()| function provided in \cref{lst:dev-canonical}.
Fields not supplied to \verb|__init__| in \cref{lst:dev-meta-python} are kept empty. The other fields in  are currently not used by \iottb itself, but provide metadata
which can be used during a processing step. Optionally, one can manually create such a file with pre-set values and pass it to the setup.
For example, say the testbed contains a configuration as can be seen in \cref{lst:appendix:appendixa:config-file}

\begin{listing}[!ht]
    \inputminted[firstline=1, lastline=8, linenos]{json}{appendixa-after-add-device-dir.txt}
    \caption{Directory layout after adding device 'default' and 'Roomba'}
    \label{lst:cfg-file-post-add}
\end{listing}

If we then add two devices \verb|'iPhone 13 (year 2043)'| and \verb|roomba|, the layout of the database resembles \cref{lst:cfg-db-layout-post-add} and, for instance, the \verb|roomba| devices' will contain the metadata listed in \cref{lst:meta-roomba-post-add}. See \cref{appendixA:add-dev-cfg} for a complete overview.

\begin{listing}[!ht]
    \lstinputlisting[firstline=11, lastline=16]{appendixa-after-add-device-dir.txt}
    \caption{Directory layout after adding device 'default' and 'Roomba'}
    \label{lst:cfg-db-layout-post-add}
\end{listing}

\begin{listing}[!ht]
    \lstinputlisting[firstline=39, lastline=55]{appendixa-after-add-device-dir.txt}
    \caption{Directory layout after adding device 'default' and 'Roomba'}
    \label{lst:meta-roomba-post-add}
\end{listing}

\newpage
\section{Traffic Sniffing}\label{sec:sniff}
Automated network capture is a key component of \iottb. The standard network capture is provided by the \texttt{sniff} subcommand, which wraps the common traffic capture utility \emph{tcpdump}\citep{tcpdump}. \cref{cmdref:sniff} shows usage of the command.

Unless explicitly allowed by specifying that the command should run in \texttt{unsafe} mode, an IPv4, or MAC address \emph{must} be provided. An IP addresses are only accepted in dot-decimal notation \footnote{e.g., 172.168.1.1} and MAC addresses must specify as six groups of two hexadecimal digits\footnote{e.g., 12:34:56:78:AA:BB}. Failing to provide either results in the capture being aborted. The rationale behind this is simple: they are the only way to identify the traffic of interest. Of course, it is possible to retrieve the IP or MAC after a capture. Still, the merits outweigh the annoyance. The hope is that this makes \iottb easier to use \emph{correctly}. For example, consider the situation, where a student is tasked with performing multiple captures across multiple devices. If the student is not aware of the need of an address for the captured data to be usable, then this policy avoids the headache and frustration of wasted time and unusable data.

To comply with \ref{req:auto_config_start} and \ref{req:fair_data_meta_inventory}, each capture also stores some metadata in \texttt{capture\_metadata.json}. \cref{lst:cap-meta} shows the metadata files schema.


\begin{listing}[!ht]
\inputminted[firstline=288, lastline=319]{python}{sniff.py}
\caption{Metadata Stored for sniff command}
\label{lst:cap-meta}
\end{listing}

The \texttt{device\_id} is the \uuid \ of the device for which the capture was performed. This ensures the capture metadata remains associated even if files are moved. Furthermore, each capture also gets a \uuid. This \uuid \ is used as the suffix for the PCAP file, and the log files. The exact naming scheme is given in \cref{lst:cap-naming}.

\begin{listing}
\inputminted[firstline=179, lastline=181]{python}{sniff.py}
\caption{Naming scheme for files created during capture.}
\label{lst:cap-naming}
\end{listing}


\section{Working with Metadata}
The \texttt{meta} subcommand provides a facility for manipulating metadata files. It allows users to get the value of any key in a metadata file as well as introduce new key-value pairs. However, it is not possible to change the value of any key already present in the metadata. This restriction is in place to prevent metadata corruption.

The most crucial value in any metadata file is the \texttt{uuid} of the device or capture the metadata belongs to. Changing the \texttt{uuid} would cause \iottb to mishandle the data, as all references to data associated with that \texttt{uuid} would become invalid. Changeing the any other value might not cause mishandling by \iottb, but they nonetheless represent essential information about the data. Therefore, \iottb does not allow changes to existing keys once they are set.

Future improvements might relax this restriction by implementing stricter checks on which keys can be modified. This would involve defining a strict set of keys that are write-once and then read-only.

\section{Raw Captures}
The \texttt{raw} subcommand offers a flexible way to run virtually any command wrapped in \iottb. Of course, the intended use is with other capture tools, like \textit{mitmproxy}\citet{mitmproxy}, and not arbitrary shell commands.
While some benefits, particularly those related to standardized capture, are diminished, users still retain the advantages of the database.


The syntax of the \texttt{raw} subcommand is as follows:
\begin{minted}{bash}
iottb raw <device> <command-name> "<command-options-string>" # or
iottb raw <device> "<string-executable-by-a-shell>" #
\end{minted}

\iottb does not provide error checking for user-supplied arguments or strings.
Users benefit from the fact that captures will be registered in the database, assigned a \texttt{uuid}, and associated with the device.
The metadata file of the capture can then be edited manually if needed.


\iottb does not provide error checking for user-supplied arguments or strings.
Users benefit from the fact that captures will be registered in the database, assigned a \texttt{uuid}, and associated with the device.
The metadata file of the capture can then be edited manually if needed.

However, each incorrect or unintended invocation that adheres to the database syntax (i.e., the specified device exists) will create a new capture directory with a metadata file and \texttt{uuid}. Therefore, users are advised to thoroughly test commands beforehand to avoid creating unnecessary clutter.

\section{Integrating user scripts}\label{sec:integrating-user-scripts}
The \texttt{--pre} and \texttt{--post} options allow users to run any executable before and after any subcommand, respectively.
Both options take a string as their argument, which is passed as input to a shell and launched as a subprocess.
The rationale for running the process in a shell is that Python's Standard Library process management module, \texttt{subprocess}\footnote{\url{https://docs.python.org/3/library/subprocess.html}}, does not accepts argument to the target subprocess when a single string is passed for execution.

Execution is synchronous: the subcommand does not begin execution until the \texttt{--pre} script finishes, and the \texttt{--post} script only starts executing after the subcommand has completed its execution. \iottb always runs in that order.

There may be cases where a script provides some type of relevant interaction intended to run in parallel with the capture. Currently, the recommended way to achieve this is to wrap the target executable in a script that forks a process to execute the target script, detaches from it, and returns.

These options are a gateway for more complex environment setups and, in particular, allow users to reuse their scripts, thus lowering the barrier to adopting \iottb.

\section{Extending and Modifying the Testbed}
One of the key design goals of \iottb is easy extensibility. \iottb uses the Click Library \citep{click} to handle parsing arguments. Adding a new command amounts to no more than writing a function and decorating it according to Click specification.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%% Figures
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%