{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0b62d225",
   "metadata": {},
   "source": [
    "<table width=100%>\n",
    "    <tr>\n",
    "        <td width = 40% align = \"left\">\n",
    "            <h3> MTH786 Machine Learning with Python</h3>\n",
    "        </td>\n",
    "        <td width = 35%>            \n",
    "        </td>\n",
    "        <td width = 25% align = \"left\">\n",
    "            <h3>Semester A, 2023/2024 </h3>\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td width = 40% align = \"left\">\n",
    "            <h3> Lab Coursework 4</h3>\n",
    "        </td>\n",
    "        <td width = 35%>            \n",
    "        </td>\n",
    "        <td width = 25% align = \"left\">\n",
    "            <h3>Dr Nicola Perra </h3>\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "deff7880",
   "metadata": {},
   "source": [
    "We start by loading necessary libraries, including NumPy (used for linear algebra calculations) and MatPlotLib (used for visualisation)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "755166c6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3d6fd67",
   "metadata": {},
   "source": [
    "### Linear regression\n",
    "By completing this exercise you will write a set of functions that are used for building a linear regression for a given data samples. You will then finish by calculating a linear regression for a height-weight dataset. \n",
    "\n",
    "\n",
    "1. Implement function **linear_regression_data** that computes (and outputs) the linear regression data matrix defined as\n",
    "$$\n",
    "\\mathbf{X} = \n",
    "\\begin{pmatrix}\n",
    "1 & x^{(1)}_1 & x^{(1)}_2 & \\ldots & x^{(1)}_d \\\\\n",
    "1 & x^{(2)}_1 & x^{(2)}_2 & \\ldots & x^{(2)}_d \\\\\n",
    "\\vdots & \\vdots & \\vdots & \\ddots & \\vdots & \\\\\n",
    "1 & x^{(s)}_1 & x^{(s)}_2 & \\ldots & x^{(s)}_d \\\\\n",
    "\\end{pmatrix}\n",
    "$$\n",
    "The function **linear_regression_data**  should take the NumPy array *data_inputs* as argument. Here, *data_inputs* is supposed to be a data matrix containing all inputs in the matrix form as follows\n",
    "$$\n",
    "data\\_inputs = \n",
    "\\begin{pmatrix}\n",
    "x^{(1)}_1 & x^{(1)}_2 & \\ldots & x^{(1)}_d \\\\\n",
    "x^{(2)}_1 & x^{(2)}_2 & \\ldots & x^{(2)}_d \\\\\n",
    "\\vdots & \\vdots & \\ddots & \\vdots & \\\\\n",
    "x^{(s)}_1 & x^{(s)}_2 & \\ldots & x^{(s)}_d \\\\\n",
    "\\end{pmatrix}.\n",
    "$$\n",
    "The function should output data matrix $\\mathbf{X}$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17087197",
   "metadata": {},
   "outputs": [],
   "source": [
    "def linear_regression_data(data_inputs):\n",
    "    first_column=np.ones((len(data_inputs),1))\n",
    "    X_matrix = np.c_[first_column,data_inputs]\n",
    "    return X_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b846478b",
   "metadata": {},
   "source": [
    "Test your function with the following unit tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d856d5a6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy.testing import assert_array_almost_equal, assert_array_equal\n",
    "test_inputs = np.array([[1], [2], [3], [4]])\n",
    "assert_array_equal(linear_regression_data(test_inputs), \n",
    "                   np.array([[1, 1], [1, 2], [1, 3], [1, 4]]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ed0cb05",
   "metadata": {},
   "outputs": [],
   "source": [
    "test_inputs = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])\n",
    "assert_array_equal(linear_regression_data(test_inputs), np.array([[1, 1, 2], [1, 2, 3], [1, 3, 4], [1, 4, 5]]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94be19f5",
   "metadata": {},
   "source": [
    "Try your function with this random samples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17c9fb0d",
   "metadata": {},
   "outputs": [],
   "source": [
    "samples, dimensions = np.random.randint(low = 2, high = 10,size = 2)\n",
    "test_inputs = np.random.rand(samples, dimensions)\n",
    "\n",
    "print (samples,dimensions)\n",
    "print (test_inputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6db7ac77",
   "metadata": {},
   "source": [
    "2. Write a function **linear_regression** that takes two arguments *data_matrix* and *data_outputs*, which computes and returns the solution $\\hat{\\mathbf{W}}$ of the normal equation\n",
    "$$\n",
    "\\mathbf{X}^{\\top}\\mathbf{X} \\hat{\\mathbf{W}} = \\mathbf{X}^{\\top}\\mathbf{Y}\n",
    "$$\n",
    "Here $\\mathbf{X}$  is the mathematical representation of *data_matrix*\n",
    "and $\\mathbf{Y}$ is the mathematical representation of *data_outputs*, while $\\hat{\\mathbf{W}}$ is a mathematical representation for weights/coefficients of the linear regression.\n",
    "\n",
    "**Hint**: you the function np.linalg.solve\n",
    "https://numpy.org/doc/stable/reference/generated/numpy.linalg.solve.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d046954b",
   "metadata": {},
   "outputs": [],
   "source": [
    "def linear_regression(data_matrix, data_outputs):\n",
    "    a=data_matrix.T@data_matrix\n",
    "    b=data_matrix.T@data_outputs\n",
    "    return np.linalg.solve(a, b)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee259627",
   "metadata": {},
   "source": [
    "Let's try an example with the following data\n",
    "\n",
    "$(x^{(1)},y^{1})=(0.5,1)$\n",
    "\n",
    "$(x^{(2)},y^{2})=(\\frac32,0)$\n",
    "\n",
    "We can plot it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a5ee5655",
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.scatter(data,output)\n",
    "plt.xlabel('$x$')\n",
    "plt.ylabel('$y$')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a699e426",
   "metadata": {},
   "outputs": [],
   "source": [
    "data=[0.5, 1.5]\n",
    "output=[1,0]\n",
    "# first step: build the data matrix\n",
    "data_matrix = linear_regression_data(data)\n",
    "\n",
    "# now we can call the regression function\n",
    "\n",
    "w = linear_regression(data_matrix, output)\n",
    "\n",
    "print (w)\n",
    "# these are the optimal weights!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f843f8c1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# the predicted models is then\n",
    "\n",
    "y_predict=data_matrix@w\n",
    "\n",
    "\n",
    "plt.scatter(data,output)\n",
    "plt.plot(data,y_predict,color=\"Red\")\n",
    "plt.xlabel('$x$')\n",
    "plt.ylabel('$y$')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c24d62d",
   "metadata": {},
   "source": [
    "Test your function with the following unit tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "85f7a539",
   "metadata": {},
   "outputs": [],
   "source": [
    "test_data_matrix = np.array([[1,0.98],[1,1.02]])\n",
    "test_outputs = np.array([[-0.1],[0.3]])\n",
    "assert_array_almost_equal(linear_regression(test_data_matrix, test_outputs),\n",
    "                          np.array([[-9.9], [10]]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df10054b",
   "metadata": {},
   "source": [
    "3. Write a function **prediction_error** that evaluates a mean-squared error over the set of data inputs and outputs. The function **prediction_error** takes the arguments _data_matrix_, _data_ouputs_ and _weights_ as inputs and returns a mean squared error defined by\n",
    "$$\n",
    "\\mathrm{MSE} = \\frac{1}{2s} \\left\\|\\mathbf{X}\\mathbf{W} - \\mathbf{Y} \\right\\|^2,\n",
    "$$\n",
    "where $\\mathbf{X}$ is a mathematical representation of _data_matrix_, $\\mathbf{Y}$ is a mathematical representation of _data_outputs_ and $\\mathbf{W}$ is a mathematical representation of _weights_."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3684486c",
   "metadata": {},
   "outputs": [],
   "source": [
    "def prediction_error(data_matrix,data_outputs,weights):\n",
    "    inside=data_matrix@weights-data_outputs\n",
    "    return np.linalg.norm(inside)**2/(2.*(len(data_outputs)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9cef1567",
   "metadata": {},
   "source": [
    "Test your function with the following unit tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ffd432c1",
   "metadata": {},
   "outputs": [],
   "source": [
    "test_data_matrix = np.array([[1,0.98],[1,1.02]])\n",
    "test_data_outputs = np.array([[-0.1],[0.3]])\n",
    "test_weights = np.array([[-9.9],[10]])\n",
    "assert_array_almost_equal(prediction_error(test_data_matrix, test_data_outputs, test_weights), 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f9271352",
   "metadata": {},
   "outputs": [],
   "source": [
    "test_data_matrix = np.array([[1,1,-1],[1,2,2]])\n",
    "test_data_outputs = np.array([[-1,2],[1,3]])\n",
    "test_weights = np.array([[0,0],[1,2],[3,4]])\n",
    "assert_array_almost_equal(prediction_error(test_data_matrix, test_data_outputs, test_weights), 36.75)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bf0e33c",
   "metadata": {},
   "source": [
    "4. In the next two parts we apply the above to height-weight-gender data, considered in the lectures. Our goal is to build a linear regression for a weight as a function of height or height and gender. We start by reading the data from attached .csv file.   **Important:** please check that the file *height_weight_genders.csv* is located in the same folder with your Jupyter notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "003f7425",
   "metadata": {},
   "outputs": [],
   "source": [
    "converter_function=lambda x: 0 if b\"Male\" in x else 1\n",
    "\n",
    "genders = np.genfromtxt(\"height_weight_genders.csv\", delimiter=\",\", skip_header=1, usecols=[0], \\\n",
    "                        converters={0:converter_function}) # 0 here is the reference to the column\n",
    "heights = np.genfromtxt(\"height_weight_genders.csv\", delimiter=\",\", skip_header=1, usecols=[1])\n",
    "weights = np.genfromtxt(\"height_weight_genders.csv\", delimiter=\",\", skip_header=1, usecols=[2])\n",
    "\n",
    "print (genders)\n",
    "print (heights)\n",
    "print (weights)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ab3fde43",
   "metadata": {},
   "outputs": [],
   "source": [
    "# how does the lambda function works?\n",
    "list=[b\"Male\",b\"Male\",b\"Female\",b\"Female\"]\n",
    "\n",
    "for i in list:\n",
    "    print (converter_function(i))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89cc3596",
   "metadata": {},
   "source": [
    "Lambda functions are very powerful and compact, perfect to be used in concert with others. In their general form the lambda function is \n",
    "\n",
    "lambda arguments: expression\n",
    "\n",
    "the function takes some arguments and evaluate some expression"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a658195",
   "metadata": {},
   "source": [
    "Let us first build a scatter plot of weight-height data (excluding gender)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95f4dcdb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "plt.scatter(heights, weights, s = 1)\n",
    "plt.xlabel('Height', fontsize=16)\n",
    "plt.xticks(fontsize=16)\n",
    "plt.ylabel('Weight', fontsize=16)\n",
    "plt.yticks(fontsize=16)\n",
    "plt.tight_layout;"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07a20def",
   "metadata": {},
   "source": [
    "In the next cell you use the functions defined above to find optimal regression weights. You then asked to evaluate your training error and plot a linear regression together with the scatter plot above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1f183cbd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "data_inputs = heights\n",
    "data_outputs = weights\n",
    "data_matrix = linear_regression_data(data_inputs)\n",
    "regression_weights = linear_regression(data_matrix, data_outputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45e4adb2",
   "metadata": {},
   "source": [
    "Test your results with the following unit tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a5c6f7b9",
   "metadata": {},
   "outputs": [],
   "source": [
    "assert_array_almost_equal(regression_weights,np.array([-350.737192, 7.717288]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e580844f",
   "metadata": {},
   "source": [
    "Print the prediction error below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bea2d96c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# WRITE YOUR CODE HERE\n",
    "print (prediction_error(data_matrix,data_outputs,regression_weights))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ab0bf3b",
   "metadata": {},
   "source": [
    "Add a plot of linear regression (in red color) to the above scatter plot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "adee408a",
   "metadata": {},
   "outputs": [],
   "source": [
    "print (regression_weights)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d2876114",
   "metadata": {},
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "y_predict=data_matrix@regression_weights\n",
    "\n",
    "plt.scatter(heights, weights, s = 1)\n",
    "plt.plot(heights,y_predict,c='Red')\n",
    "plt.xlabel('Height', fontsize=16)\n",
    "plt.xticks(fontsize=16)\n",
    "plt.ylabel('Weight', fontsize=16)\n",
    "plt.yticks(fontsize=16)\n",
    "plt.tight_layout;"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0438db79",
   "metadata": {},
   "source": [
    "6. In this part we include gender parameter to our linear regression. This means that now you are predicting a weight of the person by using his/her height and gender data. As before we start with the scatter plot, which is now a 3D one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10d929a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "fig = plt.figure()\n",
    "ax = fig.add_subplot(projection='3d')\n",
    "ax.scatter(heights, weights, genders, marker=\"^\")\n",
    "ax.set_xlabel('Height', fontsize=16)\n",
    "ax.set_ylabel('Weight', fontsize=16)\n",
    "ax.set_zlabel('Gender', fontsize=16)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e018eb5f",
   "metadata": {},
   "source": [
    "In the next cell you use the functions defined above to find optimal regression weights. You then asked to evaluate your training error and plot a linear regression together with the scatter plot above. In doing so, now your inputs it is not just the weights but also gender"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "234af2ff",
   "metadata": {},
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "data_inputs = np.c_[heights, genders]\n",
    "data_ouputs = weights\n",
    "data_matrix = linear_regression_data(data_inputs)\n",
    "regression_weights = linear_regression(data_matrix, data_outputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a67c00a5",
   "metadata": {},
   "source": [
    "Test your function with the following unit tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0d92de2a",
   "metadata": {},
   "outputs": [],
   "source": [
    "assert_array_almost_equal(regression_weights,np.array([-225.545792,    5.976941,  -19.377711]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d90bc8f",
   "metadata": {},
   "source": [
    "What is the prediction error?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dde4e100",
   "metadata": {},
   "outputs": [],
   "source": [
    "print (prediction_error(data_matrix, data_ouputs, regression_weights))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.17"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}