{ "cells": [ { "cell_type": "markdown", "id": "915c3bfd-e01f-465f-8cb7-8f3691735d1f", "metadata": {}, "source": [ "# SW11 InClass - Repetition and programming\n", "\n", "This file is intended for interactive repetition in the first lesson in class. You write directly into the empty cells." ] }, { "cell_type": "markdown", "id": "ff6b108c-f402-4e91-8312-a478f6123ae3", "metadata": {}, "source": [ "---\n", "---\n", "## Exercise 1\n", "Conditional probabilities I\n", "\n", "In this task, we want to review the result of Example 2.4.1 on page 73 from the script [SW11_Probability](SW11_Probability.pdf) and deepen it on the basis of a simulation.\n", "\n", "We consider the following case (analogous to Example 2.4.1):\n", "\n", "We use a fair die and define the two events:\n", "\n", "- $A=\\{\\text{number of eyes} < 4\\}$\n", "- $B=\\{\\text{number of eyes even}\\}$\n", "\n", "---\n", "\n", "a) Determine the probability $P(A|B)$ i.e. that the number of eyes is less than 4 under the condition that the number of eyes is even. First, represent the events $A$, $B$ and $A \\cap B$ graphically as a Venn diagram.\n", "\n", "### Solution\n" ] }, { "cell_type": "markdown", "id": "52309702-acc5-42da-8eba-aa7b29661b42", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "ba300354-4943-439f-a978-2b4d06e26421", "metadata": {}, "source": [ "---\n", "\n", "b) Now use the [numpy](https://numpy.org) random generator [numpy.random.Generator](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.choice.html#numpy.random.Generator.choice) to simulate the above probabilities. \n", "\n", "To do this, generate a random series - of parameterizable length - of the values $\\{1,2,3,4,5,6\\}$ with equal probability, since it is a fair dice. Then determine the frequencies of the events $A$ and $B$ and of $A \\cap B$. Then determine the conditional probability $P(A|B)$ and compare with the theoretical value. Also vary the length of the random series and examine the effects on the frequencies and the probability.\n", "\n", "### Solution\n" ] }, { "cell_type": "code", "execution_count": null, "id": "ff753311-191c-4cb4-b6f5-211b530a637d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "fc9f88f5-6511-4377-9256-e73f85b1d8c0", "metadata": {}, "source": [ "---\n", "---\n", "## Exercise 2\n", "Conditional probabilities II\n", "\n", "In this exercise, we want to review the result of Example 2.4.3 on page 77 from the script [SW11_Probability](SW11_Probability.pdf) and deepen it on the basis of a simulation.\n", "\n", "We consider the following case (analogous to Example 2.4.3):\n", "\n", "Given a medical test for the detection of a virus in which the disease is initially asymptomatic - i.e. without recognizable symptoms - but can lead to a severe illness at a later stage. We assume that $1\\mskip3mu \\%$ of the population carries the virus (hereafter referred to as “sick” for simplicity) and $99\\mskip3mu \\%$ does not (\"healthy\"). The test was applied during the approval phase by the pharmaceutical company to a sample of healthy and sick people - information obtained by use of other tests - and provided the following results:\n", "\n", "- For sick people, the test is positive in $95\\mskip3mu \\%$ of cases (and therefore negative in $5\\mskip3mu \\%$)\n", "- For healthy individuals, the test is negative in $98\\mskip3mu \\%$ of cases (and thus positive in $2\\mskip3mu \\%$)\n", "\n", "We use the labels below:\n", "\n", "- ${K}$: Person is sick (from the German word \"Krank\")\n", "- $\\overline{K}$: Person is healthy\n", "- $T^+$: Test is positive\n", "- $T^-$: Test is negative\n", "\n", "Now the test is applied to an arbitrary person from the entire population. \n", "\n", "---\n", "\n", "a) Assign the following probabilities correctly based on the information given in the exercise:\n", "\n", "- $P(K)$: Probability that the person is sick.\n", "- $P(\\overline{K})$: Probability that the person is healthy.\n", "- $P(T^+|K)$: Probability that the test is positive under the condition that the person is sick.\n", "- $P(T^-|K)$: Probability that the test is negative under the condition that the person is sick.\n", "- $P(T^+|\\overline{K})$: Probability that the test is positive under the condition that the person is healthy.\n", "- $P(T^-|\\overline{K})$: Probability that the test is negative under the condition that the person is healthy.\n", "\n", "### Solution:\n" ] }, { "cell_type": "markdown", "id": "50255139-4f1f-441c-9b87-408b55b33953", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "cafbcaae-da80-45b6-93ce-83a465d84950", "metadata": {}, "source": [ "---\n", "\n", "b) Create the corresponding probability tree, starting at the root with the decision $K \\leftrightarrow \\overline{K}$. Indicate the corresponding probability at the branches (symbol + numerical value).\n", "\n", "### Solution:" ] }, { "cell_type": "markdown", "id": "38ca33c4-3419-4011-962a-aac8814cbaa6", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "ea76bd6b-d23f-4f7f-aec3-01437bbbe855", "metadata": {}, "source": [ "---\n", "\n", "b) Now determine the following four probabilities and complete the probability tree:\n", "\n", "- $P(K \\cap T^+)$: Probability that the person is sick *and* that the test is positive.\n", "- $P(K \\cap T^-)$: Probability that the person is sick *and* that the test is negative.\n", "- $P(\\overline{K}\\cap T^+)$: Probability that the person is healthy *and* that the test is positive.\n", "- $P(\\overline{K}\\cap T^-)$: Probability that the person is healthy *and* that the test is negative.\n", "\n", "\n", "### Solution:" ] }, { "cell_type": "markdown", "id": "7f765126-03f9-4d34-ad59-a2851bf406d4", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "7b5c2234-5495-4b1c-8c8d-39dd11e1dcbd", "metadata": {}, "source": [ "---\n", "\n", "c) Now we want to illustrate these values again with a simulation. Therefore, we use the [numpy](https://numpy.org) random generator [numpy.random.Generator](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.choice.html#numpy.random.Generator.choice) again to generate random series that simulate the probabilities assumed above. To do this, proceed in two steps.\n", "\n", "1. Generate a series of random numbers $\\in \\{0,1\\}$, which represents the ratio of healthy to sick people in the population. Use a variable length for the series.\n", "2. Then determine the *simulated* frequencies for the healthy and sick people (these values will differ slightly from the theoretical ones) and then generate another series of random numbers $\\in \\{0,1\\}$ representing the result of the respective medical test. This means that the test is positive (negative) for 95% (98%) of the simulated number of sick (healthy) people.\n", "\n", "Then compare the values found - i.e. simulated - for the probabilities from b) with the theoretical values determined above. Again, vary the length of the random series and examine the influence on the result.\n", "\n", "### Solution" ] }, { "cell_type": "code", "execution_count": null, "id": "2bd8b38a-afae-4389-a514-03db803de094", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "b711fcfe-08b9-4312-9c60-8820be4b9e11", "metadata": {}, "source": [ "---\n", "\n", "d) Now create an alternative probability tree in which you start at the root with the decision $T^+\\leftrightarrow T^-$. Label all branches and the leaves with the corresponding (symbolic) probabilities and make sure you understand their meaning. We will postpone the calculations for a little while.\n", "\n", "### Solution" ] }, { "cell_type": "markdown", "id": "3d77fdf7-d3cb-449f-b5e8-1691be5e94a5", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "8c9006c2-5858-47ed-8efa-3b3e04a079ee", "metadata": {}, "source": [ "---\n", "\n", "e) What can you say about the probabilities of the leaves (without calculation)?\n", "\n", "### Solution" ] }, { "cell_type": "markdown", "id": "1ce656c1-85c9-44ff-b174-10168a9475a8", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "5b76c955-a3ac-4790-9ed3-a326cbee45b4", "metadata": {}, "source": [ "---\n", "\n", "f) Now determine the values of the probabilities at the branches of the new tree, but first from the simulated data, i.e. purely by counting the corresponding (simulated) events.\n", "\n", "### Solution" ] }, { "cell_type": "code", "execution_count": null, "id": "15472026-b55f-4d3c-a7eb-5175d60c02bc", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "9fdfccfa-0226-4928-8aba-77c214ce6f6f", "metadata": {}, "source": [ "---\n", "\n", "g) Now determine the theoretical values for the probabilities from f) and compare them with the simulated values. Use Bayes' formula for the conditional probabilities.\n", "\n", "### Solution" ] }, { "cell_type": "markdown", "id": "24d3e42b-d0f9-485a-a148-11d4998c4c91", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "be1f7e2a-0c94-41c0-897b-c6e927716b8a", "metadata": {}, "source": [ "---\n", "\n", "h) Finally, add the theoretical (numerical) values to all the probabilities on the branches and leaves in the new tree.\n", "\n", "### Solution" ] }, { "cell_type": "markdown", "id": "9b33040e-ae02-41af-b146-4da577001b9e", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "97466433-7d01-4df2-b740-e452f77315c7", "metadata": {}, "source": [ "Supplementary note:\n", "\n", "The two probabilities $P(T^+|K)$ and $P(K|T^+)$ are highly relevant in the field of machine learning (ML) - a branch of artificial intelligence. ML often deals with decisions, so-called classifiers (here our test) for the automated distinction of a given set into different categories (here sick and healthy persons). The probability $P(T^+|K)$ denotes the so-called sensitivity (recall) and is a measure of the proportion of the category ($K$) - which should be detected by the classifier (here sick person) - that is actually recognized:\n", "\n", "$P(T^+|K) = \\dfrac{P(T^+ \\cap K)}{P(K)}$\n", "\n", "$P(K|T^+)$, on the other hand, is the so-called precision and indicates how large the proportion of correct detections $P(K \\cap T^+)$ within all detections of the classifier $(T^+)$ really is:\n", "\n", "$P(K|T^+) = \\dfrac{P(K \\cap T^+)}{P(T^+)}$\n", "\n", "Sensitivity and relevance are mutually dependent and an increase in one value inevitably leads to a decrease in the other. Depending on the problem, a classifier with high sensitivity or high relevance is sought.\n", "\n", "- For the medical test, we are looking for a high sensitivity (here $P(T^+|K)=0.95$) so that sick people are reliably detected. Although there are relatively many errors (relevance “only” $P(K|T^+) = 0.3242$), if the person is healthy - despite a positive test - this will quickly become apparent in the subsequent examinations. “Missing” a sick person and the outbreak of the disease would have much more serious consequences.\n", "- In contrast to this, we are looking for a high relevance e.g. in case of a SPAM filter. I.e. an email marked as SPAM - which is automatically sorted out - should be one with a probability close to certainty. Even if not all SPAM are recognized - i.e. sensitivity is not close to 100% - this is less serious, as we can usually recognize a SPAM at first sight and simply delete it. However, if 80% or even 90% of SPAM emails are automatically pre-filtered, this is already a great help.\n" ] }, { "cell_type": "markdown", "id": "4d6c713a-cfe6-4d8f-8b0b-a5dc2b5922e2", "metadata": {}, "source": [ "---\n", "---\n", "## Exercise 3\n", "Independence with probabilities\n", "\n", "In this exercise, we want to review the considerations of Example 2.6.1 on page 92 from the script [SW11_Probability](SW11_Probability.pdf) and deepen them on the basis of a simulation.\n", "\n", "We consider the following somewhat more complex case:\n", "\n", "We flip a fair coin three times and define the two events:\n", "\n", "- $A=\\{$at least 1 time heads in the 1st and 2nd toss$\\}$\n", "- $B=\\{$heads on the 3rd toss$\\}$\n", "\n", "We will now first simulate the probabilities $P(A)$ and $P(B)$ and try to check empirically - i.e. using the simulation - whether the two outcomes are dependent or independent. What does your intuition say?\n", "\n", "---\n", "\n", "a) Use the [numpy](https://numpy.org) random generator [numpy.random.Generator](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.choice.html#numpy.random.Generator.choice) to simulate the above probabilities. \n", "\n", "To do this, generate a random series - of parameterizable length - of the values $\\{0,1,2,...7\\}$ with equal probability, whereby the bit sequence of the number corresponds to the three successive coin tosses as follows ($H$=heads,$T$=Tails):\n", "\n", "- $0 = 000 \\leftrightarrow HHH$\n", "- $1 = 001 \\leftrightarrow HHT$\n", "- $2 = 010 \\leftrightarrow HTH$\n", "- $3 = 011 \\leftrightarrow HTT$\n", "- $4 = 100 \\leftrightarrow THH$\n", "- $5 = 101 \\leftrightarrow THT$\n", "- $6 = 110 \\leftrightarrow TTH$\n", "- $7 = 111 \\leftrightarrow TTT$\n", "\n", "Then determine the frequencies of the events $A$ and $B$ and check their independence empirically. \n", "\n", "### Solution\n" ] }, { "cell_type": "code", "execution_count": null, "id": "77caf167-f35a-4612-8ce8-62e048f7e54d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "fbbef20c-7ad6-4ee6-8391-43262a6bba12", "metadata": {}, "source": [ "---\n", "\n", "b) Now verify theoretically that the two events $A$ and $B$ are indeed independent of each other.\n", "\n", "### Solution" ] }, { "cell_type": "markdown", "id": "93fc7f07-1e9c-491e-b6a3-097032af9c90", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "id": "33d28dd3-0efe-410b-b59e-c2f4a75c67cf", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.18" } }, "nbformat": 4, "nbformat_minor": 5 }