{ "cells": [ { "cell_type": "markdown", "id": "4fbfd25e-2fb3-49e3-a6eb-f156645adf92", "metadata": {}, "source": [ "# SW09 InClass - Repetition and programming\n", "\n", "This file is intended for interactive repetition in the first lesson in class. You write directly into the empty cells." ] }, { "cell_type": "markdown", "id": "4d6c713a-cfe6-4d8f-8b0b-a5dc2b5922e2", "metadata": {}, "source": [ "---\n", "---\n", "## Exercise 1\n", "Significant digits and decimal places \n", "\n", "We want to understand the meaning of significant digits and decimal places and their influence on the accuracy of a calculation result." ] }, { "cell_type": "markdown", "id": "911a2878-e827-4176-b6f6-9b66d75445f8", "metadata": {}, "source": [ "---\n", "Two different devices are used to measure a voltage.\n", "\n", "- An inexpensive multimeter from Conrad shows a voltage value of $-1.00\\mskip3mu V$.\n", "- A high-precision measuring device displays the value $-1.000604\\mskip3mu V$.\n", "\n", "![img01.png](img/Messgeraet.png \"hochgenaues Messgerät\")\n", "\n", "a) How many significant digits (labeled 'N') do each of the two measurements have?\n", "\n", "b) What is the relative accuracy of the measurements in percent? Here it is assumed that the measurement error - due to rounding - comprises a maximum of 5 units of the `N+1`-th digit in each case.

\n", "*Note*: We should actually distinguish between the accuracy and resolution of the measuring devices. For the sake of simplicity, we are omitting this here.\n", "\n", "### Solution:" ] }, { "cell_type": "markdown", "id": "1e0ef709-3d94-4f68-b85f-7f6ae2c97b39", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "e5107fe4-028d-4429-92d9-200ea992c4ac", "metadata": {}, "source": [ "---\n", "c) Enter the number of significant digits of the measured value in each case:\n", "\n", "- $0.001235 \\mskip3mu mm$\n", "- $0.0195 \\mskip3mu m$\n", "- $12.36 \\mskip3mu \\mu m$\n", "- $234.5 \\mskip3mu km$\n", "\n", "### Solution" ] }, { "cell_type": "markdown", "id": "5d0b392e-9d67-4e8f-bfd4-3f10ee1b1935", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "20738f8a-f7da-4c8b-a443-68b53846276a", "metadata": {}, "source": [ "---\n", "You have two measured values, $L_1 = 102.3 \\mskip3mu m$ and $L_2 = 24.2 \\mskip3mu cm$.\n", "\n", "d) Which of the two values has the higher *relative* accuracy?\n", "\n", "e) Which of the two values has the higher *absolute* accuracy?\n", "\n", "f) Determine the result of the calculations $\\frac{L_1}{L_2}$ and $L_1+L_2$ to a reasonable number of digits.\n", "\n", "### Solution" ] }, { "cell_type": "markdown", "id": "99c027fb-613b-4a37-8843-4c2a5a76f162", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "478f4cb9-595d-4545-b0e1-a9e7218c1d11", "metadata": {}, "source": [ "---\n", "---\n", "## Exercise 2\n", "`DataFrames` and `Series`\n", "\n", "We want to familiarize ourselves with the classes [DataFrames](https://pandas.pydata.org/docs/reference/frame.html) and [Series](https://pandas.pydata.org/docs/reference/series.html) of [pandas](https://pandas.pydata.org/)." ] }, { "cell_type": "markdown", "id": "1d57632f-2965-43ff-95d1-ce07ccbbb0fd", "metadata": {}, "source": [ "---\n", "The following measured values (in $m$) are given: \n", "\n", " listA = \\\n", " \\[19.16, 21.08, 19.81, 18.84, 20.25, 20.03, 21.13, 20.19, 19.76, 19.82, 20.76, 20.55, 18.51, 21.36, 20.10, 20.81, 21.01, 18.63,\\\n", " 19.18, 21.17, 19.96, 20.95, 19.76, 19.41, 21.22, 20.38, 19.56, 21.43, 18.96, 19.42, 19.00, 19.81, 21.07, 18.69, 19.05, 20.53,\\\n", " 20.08, 19.57, 20.52, 18.95\\]\n", "\n", " listB = \\\n", " \\[20.19, 20.43, 19.42, 20.79, 21.74, 18.35, 20.46, 19.45, 19.82, 19.90, 20.52, 19.69, 20.01, 21.51, 18.48, 22.22, 20.43, 19.22,\\\n", " 19.91, 20.22, 20.24, 20.11, 18.34, 19.17, 19.92, 20.59, 20.18, 19.76, 19.22, 18.78, 20.75, 19.07, 20.69, 18.77, 19.60, 21.55,\\\n", " 19.47, 20.48, 19.82, 20.43\\]\n", "\n", "a) Create a [pandas](https://pandas.pydata.org/) [series](https://pandas.pydata.org/docs/reference/series.html) from the values and determine the mean value and the standard deviation for both sets.\n", "\n", "### Solution" ] }, { "cell_type": "code", "execution_count": null, "id": "12e8b37b-c0db-48bf-b1e6-2abf9aba9756", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "b2f4633a-1baf-4a11-ae98-4d8d07cab2dd", "metadata": {}, "source": [ "---\n", "b) Both series appear to be characterized by similar mean values and standard deviations. Plot the histograms for both series to examine the distribution of the values. Use the method [DataFrame.plot()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) with `kind='hist'`, which works for both `DataFrame` and `Series` objects.\n", "\n", "What do you find?\n", "\n", "### Solution" ] }, { "cell_type": "code", "execution_count": null, "id": "45bb0c8e-ac81-4eef-aeac-3a3d3f1b3f66", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "a0d59c2a-9385-4201-b3df-237b673d285c", "metadata": {}, "source": [ "---\n", "\n", "c) Now also create the cumulative histograms with [DataFrame.plot()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) (`kind='hist'` and parameter `cumulative=True`) and check whether their shapes are consistent to the standard histograms above.\n", "\n", "### Solution" ] }, { "cell_type": "code", "execution_count": null, "id": "3589304a-d7ab-4bb5-9f73-5347ef6bbc3c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "c3b45da5-9ff4-4d78-836d-8d7bdb3c03e8", "metadata": {}, "source": [ "---\n", "d) Combine both series into one [DataFrame](https://pandas.pydata.org/docs/reference/frame.html) and create a scatter plot of the values of series B over series A using the method [DataFrame.plot()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) with `kind='scatter'`.\n", "\n", "### Solution" ] }, { "cell_type": "code", "execution_count": null, "id": "dae94d5a-e7a1-4726-bfe1-e193cd3f61b2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "27c91ca8-af9e-4bf8-9eb1-efa576744cd8", "metadata": {}, "source": [ "---\n", "e) Now create a boxplot for the two data series using the method [DataFrame.plot()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) with `kind='box'`.\n", "\n", "### Solution" ] }, { "cell_type": "code", "execution_count": null, "id": "2d78a099-9677-4751-aa20-f79961bf5426", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "01afd477-c849-4956-8234-037fe79f08e7", "metadata": {}, "source": [ "---\n", "\n", "f) Determine numerically the values of the quartile differences of both series and compare them with the boxplot.\n", "\n", "### Solution" ] }, { "cell_type": "code", "execution_count": null, "id": "4e8758ef-0264-4e5d-941f-5853052857e2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "bd657575", "metadata": {}, "source": [ "---\n", "---\n", "## Exercise 3\n", "Import of a data file using [pandas](https://pandas.pydata.org/).\n", "\n", "We want to deepen the use of [pandas](https://pandas.pydata.org/) using the [Iris data set](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html). The dataset contains characteristic sizes (e.g. petal lengths and widths) of three different lily species. It is a popular data set in the field of machine learning for testing simple methods for the automated categorization of types on the basis of different characterization features." ] }, { "cell_type": "markdown", "id": "071376db-e357-416d-aa08-970499c9da97", "metadata": {}, "source": [ "---\n", "a) Read the data file [`iris.txt`](iris.txt) using [pandas](https://pandas.pydata.org/) and output the labels of the columns using the method [DataFrame.keys()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.keys.html#pandas.DataFrame.keys).\n", "\n", "### Solution:" ] }, { "cell_type": "code", "execution_count": null, "id": "7610ef6c-70c8-44fd-b274-0aae5ff68e81", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "a85f3230-d96a-4235-90fa-4574f46bafb6", "metadata": {}, "source": [ "---\n", "b) The respective lily species is defined in the 'Species' column. Output only this column and define the different lily types. To do this, you can use e.g. the Python data type [set()](https://docs.python.org/3/tutorial/datastructures.html#sets) by applying its constructor to the output of the column.\n", "\n", "Why does this procedure work?\n", "\n", "### Solution:" ] }, { "cell_type": "code", "execution_count": null, "id": "0f34dae7-b061-46de-a8ed-99f28e48fc87", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "bf38f8bd-fe38-4a76-b33c-48e003a6e463", "metadata": {}, "source": [ "---\n", "c) In a first step, determine the mean values and standard deviations of the lengths and widths of the [sepals](https://de.wikipedia.org/wiki/Kelchblatt) (`Sepal.Length`, `Sepal.Width`) independently of the lily species - i.e. over the entire columns - using the methods [DataFrame.mean()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html#pandas.DataFrame.mean) and [DataFrame.std()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.std.html#pandas.DataFrame.std).\n", "\n", "### Solution:" ] }, { "cell_type": "code", "execution_count": null, "id": "44d153ad-db20-48a6-b033-faea2ea32d18", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e7169efe-18a0-4388-bf89-da54b8987f08", "metadata": {}, "source": [ "---\n", "d) Now create a histogram for the lengths and widths of the sepals (`Sepal.Length`, `Sepal.Width`) independently of the lily types - i.e. over the entire columns - using the method [DataFrame.plot()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) with `kind='hist'`.\n", "\n", "How can you plot both histograms in one single plot?\n", "\n", "### Solution:" ] }, { "cell_type": "code", "execution_count": null, "id": "e7a95ff8-29a3-4648-b442-62467254fe4b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "6317aa9d-d82d-4f4b-9ab1-144918792ea5", "metadata": {}, "source": [ "---\n", "e) Now create a boxplot for the lengths and widths of the sepals (`Sepal.Length`, `Sepal.Width`) independently of the lily types - i.e. over the entire columns - using the methods [DataFrame.plot()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) with `kind='box'.\n", "\n", "To select multiple columns, you can either use the syntax `dataIris[['Sepal.Length','Sepal.Width']]` or use the method [DataFrame.loc()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc).\n", "\n", "### Solution:" ] }, { "cell_type": "code", "execution_count": null, "id": "f3703508-0f73-4412-b634-a85d683ea634", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "c1e893a5-3027-44f7-8e50-5e72cbb656b6", "metadata": {}, "source": [ "---\n", "f) Now plot the area of the sepals (`Sepal.Area`) as a function of their length and width (`Sepal.Length`, `Sepal.Width`) again for all lily types using the method [DataFrame.plot()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html).\n", "\n", "Do you find a correlation in each case? If yes, why?\n", "\n", "### Solution:" ] }, { "cell_type": "code", "execution_count": null, "id": "be0890f6-6603-4a7a-b7ae-c0b2a0a45013", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "a52d7db4-dc76-4b53-9e3a-dcd33ac36c16", "metadata": {}, "source": [ "---\n", "g) Now determine the correlation coefficient between the area of the sepals (`Sepal.Area`) and their length or width (`Sepal.Length`, `Sepal.Width`) again for all lily species using the method [DataFrame.corr()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html).\n", "\n", "Is your finding from f) confirmed?\n", "\n", "### Solution:" ] }, { "cell_type": "code", "execution_count": null, "id": "1b62cf91-bb5f-40ca-a49e-1dec18a2e42e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "b4560b97-37c1-4dae-ad37-1263ad5e78d6", "metadata": {}, "source": [ "---\n", "h) Now determine the regression line for one of the two graphs from f) and plot it. To do this, use the code template on page 33 from the script [SW09_DescriptiveStatistics](SW09_DescriptiveStatistics.pdf).\n", "\n", "### Solution:" ] }, { "cell_type": "code", "execution_count": null, "id": "e01636c3-7c98-4a28-92f2-eb182b6c2845", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "f49ec561-931f-485d-a918-d1dc98708a10", "metadata": {}, "source": [ "---\n", "---\n", "## Exercise 4\n", "In this task, we now want to perform the various steps from exercise 3 individually for the respective lily species and, as an outlook, consider how a given lily species can be determined from a set of characteristic petal and sepal quantities.\n" ] }, { "cell_type": "markdown", "id": "50da7aae-fb54-43bc-bfd1-3522fee184c0", "metadata": {}, "source": [ "---\n", "a) In a first step, we only want to select the values of a single lily type. To do this, we use the [DataFrame.loc[]](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc) attribute, which provides access to multiple column (and row) values. Among other things, [DataFrame.loc[]](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc) can work with a Boolean array, which we can create by applying the `== 'type'` operator to the `Species` column. Here `'type'` is one of the three lily types `'setosa', 'versicolor', 'virginica'`.\n", "\n", "Create a new `DataFrame` which only contains the data of the `'setosa'`.\n", "\n", "\n", "### Solution:" ] }, { "cell_type": "code", "execution_count": null, "id": "7f466980-1799-4083-918b-17ce08f80024", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "1c8a8962-586d-4358-afc9-f319318dead4", "metadata": {}, "source": [ "---\n", "b) For easy access, create a [Dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) `irisDict` with the three different lily types. \n", "\n", "### Solution" ] }, { "cell_type": "code", "execution_count": null, "id": "d498fd46-280b-4b36-ac4d-76e8812ebd5b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "0b158e93-8828-44a1-ae6f-34b99b00a4ea", "metadata": {}, "source": [ "---\n", "After this preliminary work, we are now in a position to determine the quantities from exercise 3 individually for the three different types of lilies.\n", "\n", "c) In a first step, determine the mean values and standard deviations of the lengths and widths of the [sepals](https://de.wikipedia.org/wiki/Kelchblatt) (`Sepal.Length`, `Sepal.Width`) of the three types of lilies.\n", "\n", "### Solution:" ] }, { "cell_type": "code", "execution_count": null, "id": "1399548b-7cc5-4067-a818-ea4d8c978be6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "25f88df9-fd11-48bb-88f7-de355849a5ec", "metadata": {}, "source": [ "---\n", "d) The lily species obviously differ in the size of their sepals, as we can see from their mean values. We now want to confirm this observation by means of the corresponding histograms.\n", "\n", "Now for each of the four quantities - lengths and widths of the sepals (`Sepal.Length`, `Sepal.Width`) and the petals (`Petal.Length`, `Petal.Width`) - create a graph with the histograms with all three types of lilies.\n", "\n", "### Solution:" ] }, { "cell_type": "code", "execution_count": null, "id": "5c6afc98-cf8d-4af0-b398-9f32f2101df0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "9c94ec52-0d1a-4dec-bc04-5c1d0a4b717b", "metadata": {}, "source": [ "---\n", "We can see from the last graph that the petal width offers the best criterion for differentiating between the various types of lilies. Thus, the category `setosa` could be recognized perfectly - i.e. without errors - with a criterion of the form `Petal.Width < 0.7`. \n", "\n", "The distinction between `versicolor` and `virginica` is not possible without errors. An extension of the decision criterion simultaneously to several characterization variables (so-called “features”) often helps with such decision problems. \n", "\n", "Various feature pairs can be selected below and the result displayed graphically in 2D." ] }, { "cell_type": "code", "execution_count": null, "id": "ab4ba1ca-b479-4727-987e-5dd99f3f87e1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "e8d9eafb-a228-4d7e-8d08-21db8af289ac", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.18" } }, "nbformat": 4, "nbformat_minor": 5 }