{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "915c3bfd-e01f-465f-8cb7-8f3691735d1f",
   "metadata": {},
   "source": [
    "# SW11 InClass - Repetition and programming\n",
    "\n",
    "This file is intended for interactive repetition in the first lesson in class. You write directly into the empty cells."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff6b108c-f402-4e91-8312-a478f6123ae3",
   "metadata": {},
   "source": [
    "---\n",
    "---\n",
    "## Exercise 1\n",
    "Conditional probabilities I\n",
    "\n",
    "In this task, we want to review the result of Example 2.4.1 on page 73 from the script [SW11_Probability](SW11_Probability.pdf) and deepen it on the basis of a simulation.\n",
    "\n",
    "We consider the following case (analogous to Example 2.4.1):\n",
    "\n",
    "We use a fair die and define the two events:\n",
    "\n",
    "- $A=\\{\\text{number of eyes} < 4\\}$\n",
    "- $B=\\{\\text{number of eyes even}\\}$\n",
    "\n",
    "---\n",
    "\n",
    "a) Determine the probability $P(A|B)$ i.e. that the number of eyes is less than 4 under the condition that the number of eyes is even. First, represent the events $A$, $B$ and $A \\cap B$ graphically as a Venn diagram.\n",
    "\n",
    "### Solution\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52309702-acc5-42da-8eba-aa7b29661b42",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "ba300354-4943-439f-a978-2b4d06e26421",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "b) Now use the [numpy](https://numpy.org) random generator [numpy.random.Generator](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.choice.html#numpy.random.Generator.choice) to simulate the above probabilities. \n",
    "\n",
    "To do this, generate a random series - of parameterizable length - of the values $\\{1,2,3,4,5,6\\}$ with equal probability, since it is a fair dice. Then determine the frequencies of the events $A$ and $B$ and of $A \\cap B$. Then determine the conditional probability $P(A|B)$ and compare with the theoretical value. Also vary the length of the random series and examine the effects on the frequencies and the probability.\n",
    "\n",
    "### Solution\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff753311-191c-4cb4-b6f5-211b530a637d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "fc9f88f5-6511-4377-9256-e73f85b1d8c0",
   "metadata": {},
   "source": [
    "---\n",
    "---\n",
    "## Exercise 2\n",
    "Conditional probabilities II\n",
    "\n",
    "In this exercise, we want to review the result of Example 2.4.3 on page 77 from the script [SW11_Probability](SW11_Probability.pdf) and deepen it on the basis of a simulation.\n",
    "\n",
    "We consider the following case (analogous to Example 2.4.3):\n",
    "\n",
    "Given a medical test for the detection of a virus in which the disease is initially asymptomatic - i.e. without recognizable symptoms - but can lead to a severe illness at a later stage. We assume that $1\\mskip3mu \\%$ of the population carries the virus (hereafter referred to as “sick” for simplicity) and $99\\mskip3mu \\%$ does not (\"healthy\"). The test was applied during the approval phase by the pharmaceutical company to a sample of healthy and sick people - information obtained by use of other tests - and provided the following results:\n",
    "\n",
    "- For sick people, the test is positive in $95\\mskip3mu \\%$ of cases (and therefore negative in $5\\mskip3mu \\%$)\n",
    "- For healthy individuals, the test is negative in $98\\mskip3mu \\%$ of cases (and thus positive in $2\\mskip3mu \\%$)\n",
    "\n",
    "We use the labels below:\n",
    "\n",
    "- ${K}$: Person is sick (from the German word \"Krank\")\n",
    "- $\\overline{K}$: Person is healthy\n",
    "- $T^+$: Test is positive\n",
    "- $T^-$: Test is negative\n",
    "\n",
    "Now the test is applied to an arbitrary person from the entire population.  \n",
    "\n",
    "---\n",
    "\n",
    "a) Assign the following probabilities correctly based on the information given in the exercise:\n",
    "\n",
    "- $P(K)$: Probability that the person is sick.\n",
    "- $P(\\overline{K})$: Probability that the person is healthy.\n",
    "- $P(T^+|K)$: Probability that the test is positive under the condition that the person is sick.\n",
    "- $P(T^-|K)$: Probability that the test is negative under the condition that the person is sick.\n",
    "- $P(T^+|\\overline{K})$: Probability that the test is positive under the condition that the person is healthy.\n",
    "- $P(T^-|\\overline{K})$: Probability that the test is negative under the condition that the person is healthy.\n",
    "\n",
    "### Solution:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50255139-4f1f-441c-9b87-408b55b33953",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "cafbcaae-da80-45b6-93ce-83a465d84950",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "b) Create the corresponding probability tree, starting at the root with the decision $K \\leftrightarrow \\overline{K}$. Indicate the corresponding probability at the branches (symbol + numerical value).\n",
    "\n",
    "### Solution:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38ca33c4-3419-4011-962a-aac8814cbaa6",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "ea76bd6b-d23f-4f7f-aec3-01437bbbe855",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "b) Now determine the following four probabilities and complete the probability tree:\n",
    "\n",
    "- $P(K \\cap T^+)$: Probability that the person is sick *and* that the test is positive.\n",
    "- $P(K \\cap T^-)$: Probability that the person is sick *and* that the test is negative.\n",
    "- $P(\\overline{K}\\cap T^+)$: Probability that the person is healthy *and* that the test is positive.\n",
    "- $P(\\overline{K}\\cap T^-)$: Probability that the person is healthy *and* that the test is negative.\n",
    "\n",
    "\n",
    "### Solution:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f765126-03f9-4d34-ad59-a2851bf406d4",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "7b5c2234-5495-4b1c-8c8d-39dd11e1dcbd",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "c) Now we want to illustrate these values again with a simulation. Therefore, we use the [numpy](https://numpy.org) random generator [numpy.random.Generator](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.choice.html#numpy.random.Generator.choice) again to generate random series that simulate the probabilities assumed above. To do this, proceed in two steps.\n",
    "\n",
    "1. Generate a series of random numbers $\\in \\{0,1\\}$, which represents the ratio of healthy to sick people in the population. Use a variable length for the series.\n",
    "2. Then determine the *simulated* frequencies for the healthy and sick people (these values will differ slightly from the theoretical ones) and then generate another series of random numbers $\\in \\{0,1\\}$ representing the result of the respective medical test. This means that the test is positive (negative) for 95% (98%) of the simulated number of sick (healthy) people.\n",
    "\n",
    "Then compare the values found - i.e. simulated - for the probabilities from b) with the theoretical values determined above. Again, vary the length of the random series and examine the influence on the result.\n",
    "\n",
    "### Solution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2bd8b38a-afae-4389-a514-03db803de094",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "b711fcfe-08b9-4312-9c60-8820be4b9e11",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "d) Now create an alternative probability tree in which you start at the root with the decision $T^+\\leftrightarrow T^-$. Label all branches and the leaves with the corresponding (symbolic) probabilities and make sure you understand their meaning. We will postpone the calculations for a little while.\n",
    "\n",
    "### Solution"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d77fdf7-d3cb-449f-b5e8-1691be5e94a5",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "8c9006c2-5858-47ed-8efa-3b3e04a079ee",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "e) What can you say about the probabilities of the leaves (without calculation)?\n",
    "\n",
    "### Solution"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ce656c1-85c9-44ff-b174-10168a9475a8",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "5b76c955-a3ac-4790-9ed3-a326cbee45b4",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "f) Now determine the values of the probabilities at the branches of the new tree, but first from the simulated data, i.e. purely by counting the corresponding (simulated) events.\n",
    "\n",
    "### Solution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15472026-b55f-4d3c-a7eb-5175d60c02bc",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "9fdfccfa-0226-4928-8aba-77c214ce6f6f",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "g) Now determine the theoretical values for the probabilities from f) and compare them with the simulated values. Use Bayes' formula for the conditional probabilities.\n",
    "\n",
    "### Solution"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24d3e42b-d0f9-485a-a148-11d4998c4c91",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "be1f7e2a-0c94-41c0-897b-c6e927716b8a",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "h) Finally, add the theoretical (numerical) values to all the probabilities on the branches and leaves in the new tree.\n",
    "\n",
    "### Solution"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b33040e-ae02-41af-b146-4da577001b9e",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "97466433-7d01-4df2-b740-e452f77315c7",
   "metadata": {},
   "source": [
    "Supplementary note:\n",
    "\n",
    "The two probabilities $P(T^+|K)$ and $P(K|T^+)$ are highly relevant in the field of machine learning (ML) - a branch of artificial intelligence. ML often deals with decisions, so-called classifiers (here our test) for the automated distinction of a given set into different categories (here sick and healthy persons). The probability $P(T^+|K)$ denotes the so-called sensitivity (recall) and is a measure of the proportion of the category ($K$) - which should be detected by the classifier (here sick person) - that is actually recognized:\n",
    "\n",
    "$P(T^+|K) = \\dfrac{P(T^+ \\cap K)}{P(K)}$\n",
    "\n",
    "$P(K|T^+)$, on the other hand, is the so-called precision and indicates how large the proportion of correct detections $P(K \\cap T^+)$ within all detections of the classifier $(T^+)$ really is:\n",
    "\n",
    "$P(K|T^+) = \\dfrac{P(K \\cap T^+)}{P(T^+)}$\n",
    "\n",
    "Sensitivity and relevance are mutually dependent and an increase in one value inevitably leads to a decrease in the other. Depending on the problem, a classifier with high sensitivity or high relevance is sought.\n",
    "\n",
    "- For the medical test, we are looking for a high sensitivity (here $P(T^+|K)=0.95$) so that sick people are reliably detected. Although there are relatively many errors (relevance “only” $P(K|T^+) = 0.3242$), if the person is healthy - despite a positive test - this will quickly become apparent in the subsequent examinations. “Missing” a sick person and the outbreak of the disease would have much more serious consequences.\n",
    "- In contrast to this, we are looking for a high relevance e.g. in case of a SPAM filter. I.e. an email marked as SPAM - which is automatically sorted out - should be one with a probability close to certainty. Even if not all SPAM are recognized - i.e. sensitivity is not close to 100% - this is less serious, as we can usually recognize  a SPAM at first sight and simply delete it. However, if 80% or even 90% of SPAM emails are automatically pre-filtered, this is already a great help.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d6c713a-cfe6-4d8f-8b0b-a5dc2b5922e2",
   "metadata": {},
   "source": [
    "---\n",
    "---\n",
    "## Exercise 3\n",
    "Independence with probabilities\n",
    "\n",
    "In this exercise, we want to review the considerations of Example 2.6.1 on page 92 from the script [SW11_Probability](SW11_Probability.pdf) and deepen them on the basis of a simulation.\n",
    "\n",
    "We consider the following somewhat more complex case:\n",
    "\n",
    "We flip a fair coin three times and define the two events:\n",
    "\n",
    "- $A=\\{$at least 1 time heads in the 1st and 2nd toss$\\}$\n",
    "- $B=\\{$heads on the 3rd toss$\\}$\n",
    "\n",
    "We will now first simulate the probabilities $P(A)$ and $P(B)$ and try to check empirically - i.e. using the simulation - whether the two outcomes are dependent or independent. What does your intuition say?\n",
    "\n",
    "---\n",
    "\n",
    "a) Use the [numpy](https://numpy.org) random generator [numpy.random.Generator](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.choice.html#numpy.random.Generator.choice) to simulate the above probabilities. \n",
    "\n",
    "To do this, generate a random series - of parameterizable length - of the values $\\{0,1,2,...7\\}$ with equal probability, whereby the bit sequence of the number corresponds to the three successive coin tosses as follows ($H$=heads,$T$=Tails):\n",
    "\n",
    "- $0 = 000 \\leftrightarrow HHH$\n",
    "- $1 = 001 \\leftrightarrow HHT$\n",
    "- $2 = 010 \\leftrightarrow HTH$\n",
    "- $3 = 011 \\leftrightarrow HTT$\n",
    "- $4 = 100 \\leftrightarrow THH$\n",
    "- $5 = 101 \\leftrightarrow THT$\n",
    "- $6 = 110 \\leftrightarrow TTH$\n",
    "- $7 = 111 \\leftrightarrow TTT$\n",
    "\n",
    "Then determine the frequencies of the events $A$ and $B$ and check their independence empirically. \n",
    "\n",
    "### Solution\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "77caf167-f35a-4612-8ce8-62e048f7e54d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "fbbef20c-7ad6-4ee6-8391-43262a6bba12",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "b) Now verify theoretically that the two events $A$ and $B$ are indeed independent of each other.\n",
    "\n",
    "### Solution"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93fc7f07-1e9c-491e-b6a3-097032af9c90",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "33d28dd3-0efe-410b-b59e-c2f4a75c67cf",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}