Lunarnotes - SW11_Theorie

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SW11 - Flipped Classroom\n",
    "\n",
    "*This Jupyter notebook is intended for practicing the theory accompanying the slides and is used as a “practical review of the theory input”. There are no sample solutions and the file is not corrected and does not have to be handed in.*\n",
    "\n",
    "In SW11, the theory script includes various simulations to illustrate the theoretical concepts of the lecture notes [SW11_Probability](SW11_Probability.pdf).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Conditional probability\n",
    "\n",
    "A statistic has shown that all women in the population of a country ($52\\%$) indicated the following preferences for their favorite color: \n",
    "\n",
    "- Red: $36\\%$\n",
    "- Blue: $16\\%$\n",
    "- Green: $48\\%$\n",
    "\n",
    "For men ($48\\%$), the results were as follows: \n",
    "\n",
    "- Red: $32\\%$\n",
    "- Blue: $53\\%$\n",
    "- Green: $15\\%$\n",
    "\n",
    "We use the following labels:\n",
    "\n",
    "- $F$: Person is female\n",
    "- $M$: Person is a male\n",
    "- $R$: Favorite color is red\n",
    "- $B$: Favorite color is blue\n",
    "- $G$: Favorite color is green\n",
    "\n",
    "We will now create a probability tree, starting at the root with the decision $F \\leftrightarrow M$ - i.e. person is woman or man. We will plot the following probabilities on the branches of the tree: \n",
    "\n",
    "- $P(F)=0.52$: Probability that the person is a woman.\n",
    "- $P(M)=0.48$: Probability that the person is a man.\n",
    "- $P(R|F)=0.36$: Probability that the favorite color is red under the condition that the person is a woman. (analogous $B$ and $G$)\n",
    "- $P(R|M)=0.32$: Probability that the favorite color is red under the condition that the person is a man. (analogous $B$ and $G$).\n",
    "\n",
    "This results in the following probability tree:\n",
    "\n",
    "<img src=img/Wahrscheinlichkeitsbaum1.png alt=Drawing width=400 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "Now we add the probabilities at the leaves of the tree. There the following probabilities are required:\n",
    "\n",
    "- $P(F \\cap R)$: Probability that the person is a woman *and* that the favorite color is red. (analogous $B$ and $G$)\n",
    "- $P(M \\cap R)$: Probability that the person is a man *and* that the favorite color is red. (analogous $B$ and $G$)\n",
    "\n",
    "To determine the corresponding values, simply the product of the probabilities along the relevant path has to be calculated. E.g:\n",
    "\n",
    "$P(F \\cap R) = P(F) \\cdot P(R|F)=0.52 \\cdot 0.36 = 0.1872$\n",
    "\n",
    "This results in the complete probability tree as follows\n",
    "\n",
    "<img src=img/Wahrscheinlichkeitsbaum_full1.png alt=Drawing width=650 />\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "\n",
    "We now use the [numpy](https://numpy.org) random generator [numpy.random.Generator](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.choice.html#numpy.random.Generator.choice) to simulate the above probabilities. In a first step, we create a random vector that represents the distribution of women ($52\\%$) to men ($48\\%$) in the population."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of women:  507 in percent:  50.7 %\n",
      "Number of man:  493 in percent:  49.3 %\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "#Definition of the probabilities \n",
    "#P for woman\n",
    "PF = 0.52\n",
    "#P for man\n",
    "PM = 1 - PF\n",
    "\n",
    "#Number of sample size (max. 10^6 !!)\n",
    "sampleSize = 1000\n",
    "\n",
    "#Random generator (set seed for reproducible values)\n",
    "rng = np.random.default_rng()\n",
    "\n",
    "#Random vector, 0 = woman, 1 = man\n",
    "randArr = rng.choice([0,1], sampleSize, p=[PF, PM])\n",
    "\n",
    "#for visual control of the random vector\n",
    "#print(randArr, '\\n')\n",
    "\n",
    "#number of women and men\n",
    "nF = np.sum(randArr==0)\n",
    "nM = np.sum(randArr==1)\n",
    "\n",
    "print('Number of women: ', nF, 'in percent: ', 100*nF/sampleSize, '%')\n",
    "print('Number of man: ', nM, 'in percent: ', 100*nM/sampleSize, '%')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we start different runs, the results vary due to randomness. As the length of the random vector increases (parameter `sampleSize`), the variations tend to decrease."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "Now we generate - first for women - another random vector that represents the distribution of favorite colors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of women with favorite color red:  190 in percent:  19.0 %\n",
      "Number of women with favorite color blue:  89 in percent:  8.9 %\n",
      "Number of women with favorite color green:  228 in percent:  22.8 %\n",
      "\n",
      "Control (must be zero):  0\n"
     ]
    }
   ],
   "source": [
    "#The cell [1] above must be executed first!\n",
    "\n",
    "#Definition of the probabilities \n",
    "#R, B, G for women\n",
    "PFR = 0.36\n",
    "PFB = 0.16\n",
    "PFG = 1 - PFR - PFB\n",
    "\n",
    "#Random vector, 0 = R, 1 = B, 2 = G, for women, therefore length of vector is 'nF' from above\n",
    "randArr = rng.choice([0,1,2], nF, p=[PFR, PFB, PFG])\n",
    "\n",
    "#for visual control of the random vector\n",
    "#print(randArr, '\\n')\n",
    "\n",
    "#number of women with the respective favorite color\n",
    "nFR = np.sum(randArr==0)\n",
    "nFB = np.sum(randArr==1)\n",
    "nFG = np.sum(randArr==2)\n",
    "\n",
    "print('Number of women with favorite color red: ', nFR, 'in percent: ', 100*nFR/sampleSize, '%')\n",
    "print('Number of women with favorite color blue: ', nFB, 'in percent: ', 100*nFB/sampleSize, '%')\n",
    "print('Number of women with favorite color green: ', nFG, 'in percent: ', 100*nFG/sampleSize, '%\\n')\n",
    "\n",
    "#Control of the total number\n",
    "print('Control (must be zero): ', nF - nFR - nFB - nFG)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the number correspond to the theoretical probabilities within the scope of random uncertainty.\n",
    "\n",
    "---\n",
    "\n",
    "Now generate the corresponding random vector or the results for the men.\n",
    "\n",
    "### Solution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of women with favorite color red:  144 in percent:  14.4 %\n",
      "Number of women with favorite color blue:  270 in percent:  27.0 %\n",
      "Number of women with favorite color green:  79 in percent:  7.9 %\n",
      "\n",
      "Control (must be zero):  0\n"
     ]
    }
   ],
   "source": [
    "# Your code here\n",
    "\n",
    "PMR = 0.32\n",
    "PMB = 0.53\n",
    "PMG = 1 - PMR - PMB\n",
    "\n",
    "randArr = rng.choice([0,1,2],  nM, p=[PMR, PMB, PMG])\n",
    "\n",
    "nMR = np.sum(randArr==0)\n",
    "nMB = np.sum(randArr==1)\n",
    "nMG = np.sum(randArr==2)\n",
    "\n",
    "print('Number of women with favorite color red: ', nMR, 'in percent: ', 100*nMR/sampleSize, '%')\n",
    "print('Number of women with favorite color blue: ', nMB, 'in percent: ', 100*nMB/sampleSize, '%')\n",
    "print('Number of women with favorite color green: ', nMG, 'in percent: ', 100*nMG/sampleSize, '%\\n')\n",
    "\n",
    "print('Control (must be zero): ', nM - nMR - nMB - nMG)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "Now we determine the simulated conditional probabilities. For this, we only need to perform the normalization correctly, i.e. e.g:\n",
    "\n",
    "$P(R|F)= \\frac{P(F \\cap R)}{P(F)}$\n",
    "\n",
    "In other words, we need to divide by the number of women instead of the total length of the random vector."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Probability for favorite color red, under the condition that the person is a woman:  0.3747534516765286\n",
      "Probability for favorite color blue, under the condition that person is a woman:  0.1755424063116371\n",
      "Probability for favorite color green, under the condition that the person is a woman:  0.44970414201183434\n"
     ]
    }
   ],
   "source": [
    "print('Probability for favorite color red, under the condition that the person is a woman: ', nFR/nF)\n",
    "print('Probability for favorite color blue, under the condition that person is a woman: ', nFB/nF)\n",
    "print('Probability for favorite color green, under the condition that the person is a woman: ', nFG/nF)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the numbers within the scope of the random uncertainty match the theoretical probabilities again.\n",
    "\n",
    "---\n",
    "\n",
    "Now determine the corresponding results for the men.\n",
    "\n",
    "### Solution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Probability for favorite color red, under the condition that the person is a man: 29.2%\n",
      "Probability for favorite color blue, under the condition that person is a man: 54.8%\n",
      "Probability for favorite color green, under the condition that the person is a man: 16.0%\n"
     ]
    }
   ],
   "source": [
    "# Your code here\n",
    "\n",
    "print(f'Probability for favorite color red, under the condition that the person is a man: {100*nMR/nM:.1f}%')\n",
    "print(f'Probability for favorite color blue, under the condition that person is a man: {100*nMB/nM:.1f}%')\n",
    "print(f'Probability for favorite color green, under the condition that the person is a man: {100*nMG/nM:.1f}%')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "---\n",
    "\n",
    "### Independence\n",
    "\n",
    "Finally, we want to examine whether the choice of color is independent of gender. If this was the case, the following would apply:\n",
    "\n",
    "$P(F \\cap R) = P(F) \\cdot P(R)$\n",
    "\n",
    "This means that the probability that a woman's favorite color is red $P(F \\cap R)$ is equal to the product of the probabilities that any person is a woman $P(F)$ and that the favorite color for any person - i.e. woman or man - is red $P(R)$. The same should then also apply to the other colors.\n",
    "\n",
    "We first test this on the basis of the simulations. However, we use the probabilities for the color green for the test. Think about why? \n",
    "\n",
    "We have already simulated the two probabilities $P(F \\cap G)$ and $P(F)$. We only need to determine the probability $P(G)$ that the favorite color for any person is green."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Probability for a woman:  0.507\n",
      "Probability that a woman`s favorite color is green:  0.228\n",
      "Probability that the favorite color of any person is green:  0.307 \n",
      "\n",
      "Product of P(F)*P(G):  0.155649\n"
     ]
    }
   ],
   "source": [
    "#the solution above for the men must be available\n",
    "\n",
    "#from above\n",
    "print('Probability for a woman: ', nF/sampleSize)\n",
    "print('Probability that a woman`s favorite color is green: ', nFG/sampleSize)\n",
    "\n",
    "#Number of all persons with favorite color green\n",
    "nG = nFG + nMG\n",
    "print('Probability that the favorite color of any person is green: ', nG/sampleSize,'\\n')\n",
    "\n",
    "#Independence test\n",
    "print('Product of P(F)*P(G): ', nF/sampleSize*nG/sampleSize)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There appears to be an obvious dependency between color affinity and gender, which is particularly evident for the colors blue and green. In the case of red, the simulated difference would hardly be recognizable within the uncertainties.\n",
    "\n",
    "Let's determine the theoretical values as a control:\n",
    "\n",
    "- $P(F)=0.25$\n",
    "- $P(F \\cap G)=0.2496$\n",
    "- $P(G)=P(F \\cap G) + P(M \\cap G) = 0.2496 + 0.072 = 0.3216$\n",
    "\n",
    "This results in:\n",
    "\n",
    "$P(F)\\cdot P(G) = 0.52 \\cdot 0.3216 = 0.1672$\n",
    "\n",
    "This is obviously not equal to $P(F \\cap G)=0.2496$, which also proves the dependency on the basis of the theoretical numbers."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}

SW11_Theorie_en