{ "cells": [ { "cell_type": "markdown", "id": "0b62d225", "metadata": {}, "source": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "

MTH786 Machine Learning with Python

\n", "
\n", " \n", "

Semester A, 2023/2024

\n", "
\n", "

Lab Coursework 4

\n", "
\n", " \n", "

Dr Nicola Perra

\n", "
" ] }, { "cell_type": "markdown", "id": "deff7880", "metadata": {}, "source": [ "We start by loading necessary libraries, including NumPy (used for linear algebra calculations) and MatPlotLib (used for visualisation)." ] }, { "cell_type": "code", "execution_count": null, "id": "755166c6", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "id": "f3d6fd67", "metadata": {}, "source": [ "### Linear regression\n", "By completing this exercise you will write a set of functions that are used for building a linear regression for a given data samples. You will then finish by calculating a linear regression for a height-weight dataset. \n", "\n", "\n", "1. Implement function **linear_regression_data** that computes (and outputs) the linear regression data matrix defined as\n", "$$\n", "\\mathbf{X} = \n", "\\begin{pmatrix}\n", "1 & x^{(1)}_1 & x^{(1)}_2 & \\ldots & x^{(1)}_d \\\\\n", "1 & x^{(2)}_1 & x^{(2)}_2 & \\ldots & x^{(2)}_d \\\\\n", "\\vdots & \\vdots & \\vdots & \\ddots & \\vdots & \\\\\n", "1 & x^{(s)}_1 & x^{(s)}_2 & \\ldots & x^{(s)}_d \\\\\n", "\\end{pmatrix}\n", "$$\n", "The function **linear_regression_data** should take the NumPy array *data_inputs* as argument. Here, *data_inputs* is supposed to be a data matrix containing all inputs in the matrix form as follows\n", "$$\n", "data\\_inputs = \n", "\\begin{pmatrix}\n", "x^{(1)}_1 & x^{(1)}_2 & \\ldots & x^{(1)}_d \\\\\n", "x^{(2)}_1 & x^{(2)}_2 & \\ldots & x^{(2)}_d \\\\\n", "\\vdots & \\vdots & \\ddots & \\vdots & \\\\\n", "x^{(s)}_1 & x^{(s)}_2 & \\ldots & x^{(s)}_d \\\\\n", "\\end{pmatrix}.\n", "$$\n", "The function should output data matrix $\\mathbf{X}$." ] }, { "cell_type": "code", "execution_count": null, "id": "17087197", "metadata": {}, "outputs": [], "source": [ "def linear_regression_data(data_inputs):\n", " first_column=np.ones((len(data_inputs),1))\n", " X_matrix = np.c_[first_column,data_inputs]\n", " return X_matrix" ] }, { "cell_type": "markdown", "id": "b846478b", "metadata": {}, "source": [ "Test your function with the following unit tests" ] }, { "cell_type": "code", "execution_count": null, "id": "d856d5a6", "metadata": {}, "outputs": [], "source": [ "from numpy.testing import assert_array_almost_equal, assert_array_equal\n", "test_inputs = np.array([[1], [2], [3], [4]])\n", "assert_array_equal(linear_regression_data(test_inputs), \n", " np.array([[1, 1], [1, 2], [1, 3], [1, 4]]))" ] }, { "cell_type": "code", "execution_count": null, "id": "4ed0cb05", "metadata": {}, "outputs": [], "source": [ "test_inputs = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])\n", "assert_array_equal(linear_regression_data(test_inputs), np.array([[1, 1, 2], [1, 2, 3], [1, 3, 4], [1, 4, 5]]))" ] }, { "cell_type": "markdown", "id": "94be19f5", "metadata": {}, "source": [ "Try your function with this random samples" ] }, { "cell_type": "code", "execution_count": null, "id": "17c9fb0d", "metadata": {}, "outputs": [], "source": [ "samples, dimensions = np.random.randint(low = 2, high = 10,size = 2)\n", "test_inputs = np.random.rand(samples, dimensions)\n", "\n", "print (samples,dimensions)\n", "print (test_inputs)" ] }, { "cell_type": "markdown", "id": "6db7ac77", "metadata": {}, "source": [ "2. Write a function **linear_regression** that takes two arguments *data_matrix* and *data_outputs*, which computes and returns the solution $\\hat{\\mathbf{W}}$ of the normal equation\n", "$$\n", "\\mathbf{X}^{\\top}\\mathbf{X} \\hat{\\mathbf{W}} = \\mathbf{X}^{\\top}\\mathbf{Y}\n", "$$\n", "Here $\\mathbf{X}$ is the mathematical representation of *data_matrix*\n", "and $\\mathbf{Y}$ is the mathematical representation of *data_outputs*, while $\\hat{\\mathbf{W}}$ is a mathematical representation for weights/coefficients of the linear regression.\n", "\n", "**Hint**: you the function np.linalg.solve\n", "https://numpy.org/doc/stable/reference/generated/numpy.linalg.solve.html" ] }, { "cell_type": "code", "execution_count": null, "id": "d046954b", "metadata": {}, "outputs": [], "source": [ "def linear_regression(data_matrix, data_outputs):\n", " a=data_matrix.T@data_matrix\n", " b=data_matrix.T@data_outputs\n", " return np.linalg.solve(a, b)" ] }, { "cell_type": "markdown", "id": "ee259627", "metadata": {}, "source": [ "Let's try an example with the following data\n", "\n", "$(x^{(1)},y^{1})=(0.5,1)$\n", "\n", "$(x^{(2)},y^{2})=(\\frac32,0)$\n", "\n", "We can plot it" ] }, { "cell_type": "code", "execution_count": null, "id": "a5ee5655", "metadata": {}, "outputs": [], "source": [ "plt.scatter(data,output)\n", "plt.xlabel('$x$')\n", "plt.ylabel('$y$')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "a699e426", "metadata": {}, "outputs": [], "source": [ "data=[0.5, 1.5]\n", "output=[1,0]\n", "# first step: build the data matrix\n", "data_matrix = linear_regression_data(data)\n", "\n", "# now we can call the regression function\n", "\n", "w = linear_regression(data_matrix, output)\n", "\n", "print (w)\n", "# these are the optimal weights!" ] }, { "cell_type": "code", "execution_count": null, "id": "f843f8c1", "metadata": {}, "outputs": [], "source": [ "# the predicted models is then\n", "\n", "y_predict=data_matrix@w\n", "\n", "\n", "plt.scatter(data,output)\n", "plt.plot(data,y_predict,color=\"Red\")\n", "plt.xlabel('$x$')\n", "plt.ylabel('$y$')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "3c24d62d", "metadata": {}, "source": [ "Test your function with the following unit tests" ] }, { "cell_type": "code", "execution_count": null, "id": "85f7a539", "metadata": {}, "outputs": [], "source": [ "test_data_matrix = np.array([[1,0.98],[1,1.02]])\n", "test_outputs = np.array([[-0.1],[0.3]])\n", "assert_array_almost_equal(linear_regression(test_data_matrix, test_outputs),\n", " np.array([[-9.9], [10]]))" ] }, { "cell_type": "markdown", "id": "df10054b", "metadata": {}, "source": [ "3. Write a function **prediction_error** that evaluates a mean-squared error over the set of data inputs and outputs. The function **prediction_error** takes the arguments _data_matrix_, _data_ouputs_ and _weights_ as inputs and returns a mean squared error defined by\n", "$$\n", "\\mathrm{MSE} = \\frac{1}{2s} \\left\\|\\mathbf{X}\\mathbf{W} - \\mathbf{Y} \\right\\|^2,\n", "$$\n", "where $\\mathbf{X}$ is a mathematical representation of _data_matrix_, $\\mathbf{Y}$ is a mathematical representation of _data_outputs_ and $\\mathbf{W}$ is a mathematical representation of _weights_." ] }, { "cell_type": "code", "execution_count": null, "id": "3684486c", "metadata": {}, "outputs": [], "source": [ "def prediction_error(data_matrix,data_outputs,weights):\n", " inside=data_matrix@weights-data_outputs\n", " return np.linalg.norm(inside)**2/(2.*(len(data_outputs)))" ] }, { "cell_type": "markdown", "id": "9cef1567", "metadata": {}, "source": [ "Test your function with the following unit tests" ] }, { "cell_type": "code", "execution_count": null, "id": "ffd432c1", "metadata": {}, "outputs": [], "source": [ "test_data_matrix = np.array([[1,0.98],[1,1.02]])\n", "test_data_outputs = np.array([[-0.1],[0.3]])\n", "test_weights = np.array([[-9.9],[10]])\n", "assert_array_almost_equal(prediction_error(test_data_matrix, test_data_outputs, test_weights), 0)" ] }, { "cell_type": "code", "execution_count": null, "id": "f9271352", "metadata": {}, "outputs": [], "source": [ "test_data_matrix = np.array([[1,1,-1],[1,2,2]])\n", "test_data_outputs = np.array([[-1,2],[1,3]])\n", "test_weights = np.array([[0,0],[1,2],[3,4]])\n", "assert_array_almost_equal(prediction_error(test_data_matrix, test_data_outputs, test_weights), 36.75)" ] }, { "cell_type": "markdown", "id": "2bf0e33c", "metadata": {}, "source": [ "4. In the next two parts we apply the above to height-weight-gender data, considered in the lectures. Our goal is to build a linear regression for a weight as a function of height or height and gender. We start by reading the data from attached .csv file. **Important:** please check that the file *height_weight_genders.csv* is located in the same folder with your Jupyter notebook." ] }, { "cell_type": "code", "execution_count": null, "id": "003f7425", "metadata": {}, "outputs": [], "source": [ "converter_function=lambda x: 0 if b\"Male\" in x else 1\n", "\n", "genders = np.genfromtxt(\"height_weight_genders.csv\", delimiter=\",\", skip_header=1, usecols=[0], \\\n", " converters={0:converter_function}) # 0 here is the reference to the column\n", "heights = np.genfromtxt(\"height_weight_genders.csv\", delimiter=\",\", skip_header=1, usecols=[1])\n", "weights = np.genfromtxt(\"height_weight_genders.csv\", delimiter=\",\", skip_header=1, usecols=[2])\n", "\n", "print (genders)\n", "print (heights)\n", "print (weights)" ] }, { "cell_type": "code", "execution_count": null, "id": "ab3fde43", "metadata": {}, "outputs": [], "source": [ "# how does the lambda function works?\n", "list=[b\"Male\",b\"Male\",b\"Female\",b\"Female\"]\n", "\n", "for i in list:\n", " print (converter_function(i))" ] }, { "cell_type": "markdown", "id": "89cc3596", "metadata": {}, "source": [ "Lambda functions are very powerful and compact, perfect to be used in concert with others. In their general form the lambda function is \n", "\n", "lambda arguments: expression\n", "\n", "the function takes some arguments and evaluate some expression" ] }, { "cell_type": "markdown", "id": "4a658195", "metadata": {}, "source": [ "Let us first build a scatter plot of weight-height data (excluding gender)." ] }, { "cell_type": "code", "execution_count": null, "id": "95f4dcdb", "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n", "plt.scatter(heights, weights, s = 1)\n", "plt.xlabel('Height', fontsize=16)\n", "plt.xticks(fontsize=16)\n", "plt.ylabel('Weight', fontsize=16)\n", "plt.yticks(fontsize=16)\n", "plt.tight_layout;" ] }, { "cell_type": "markdown", "id": "07a20def", "metadata": {}, "source": [ "In the next cell you use the functions defined above to find optimal regression weights. You then asked to evaluate your training error and plot a linear regression together with the scatter plot above." ] }, { "cell_type": "code", "execution_count": null, "id": "1f183cbd", "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n", "data_inputs = heights\n", "data_outputs = weights\n", "data_matrix = linear_regression_data(data_inputs)\n", "regression_weights = linear_regression(data_matrix, data_outputs)" ] }, { "cell_type": "markdown", "id": "45e4adb2", "metadata": {}, "source": [ "Test your results with the following unit tests" ] }, { "cell_type": "code", "execution_count": null, "id": "a5c6f7b9", "metadata": {}, "outputs": [], "source": [ "assert_array_almost_equal(regression_weights,np.array([-350.737192, 7.717288]))" ] }, { "cell_type": "markdown", "id": "e580844f", "metadata": {}, "source": [ "Print the prediction error below" ] }, { "cell_type": "code", "execution_count": null, "id": "bea2d96c", "metadata": {}, "outputs": [], "source": [ "# WRITE YOUR CODE HERE\n", "print (prediction_error(data_matrix,data_outputs,regression_weights))" ] }, { "cell_type": "markdown", "id": "8ab0bf3b", "metadata": {}, "source": [ "Add a plot of linear regression (in red color) to the above scatter plot." ] }, { "cell_type": "code", "execution_count": null, "id": "adee408a", "metadata": {}, "outputs": [], "source": [ "print (regression_weights)" ] }, { "cell_type": "code", "execution_count": null, "id": "d2876114", "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n", "y_predict=data_matrix@regression_weights\n", "\n", "plt.scatter(heights, weights, s = 1)\n", "plt.plot(heights,y_predict,c='Red')\n", "plt.xlabel('Height', fontsize=16)\n", "plt.xticks(fontsize=16)\n", "plt.ylabel('Weight', fontsize=16)\n", "plt.yticks(fontsize=16)\n", "plt.tight_layout;" ] }, { "cell_type": "markdown", "id": "0438db79", "metadata": {}, "source": [ "6. In this part we include gender parameter to our linear regression. This means that now you are predicting a weight of the person by using his/her height and gender data. As before we start with the scatter plot, which is now a 3D one." ] }, { "cell_type": "code", "execution_count": null, "id": "10d929a9", "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n", "fig = plt.figure()\n", "ax = fig.add_subplot(projection='3d')\n", "ax.scatter(heights, weights, genders, marker=\"^\")\n", "ax.set_xlabel('Height', fontsize=16)\n", "ax.set_ylabel('Weight', fontsize=16)\n", "ax.set_zlabel('Gender', fontsize=16)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "e018eb5f", "metadata": {}, "source": [ "In the next cell you use the functions defined above to find optimal regression weights. You then asked to evaluate your training error and plot a linear regression together with the scatter plot above. In doing so, now your inputs it is not just the weights but also gender" ] }, { "cell_type": "code", "execution_count": null, "id": "234af2ff", "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n", "data_inputs = np.c_[heights, genders]\n", "data_ouputs = weights\n", "data_matrix = linear_regression_data(data_inputs)\n", "regression_weights = linear_regression(data_matrix, data_outputs)" ] }, { "cell_type": "markdown", "id": "a67c00a5", "metadata": {}, "source": [ "Test your function with the following unit tests" ] }, { "cell_type": "code", "execution_count": null, "id": "0d92de2a", "metadata": {}, "outputs": [], "source": [ "assert_array_almost_equal(regression_weights,np.array([-225.545792, 5.976941, -19.377711]))" ] }, { "cell_type": "markdown", "id": "5d90bc8f", "metadata": {}, "source": [ "What is the prediction error?" ] }, { "cell_type": "code", "execution_count": null, "id": "dde4e100", "metadata": {}, "outputs": [], "source": [ "print (prediction_error(data_matrix, data_ouputs, regression_weights))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.17" } }, "nbformat": 4, "nbformat_minor": 5 }