{ "cells": [ { "cell_type": "markdown", "id": "22f22906", "metadata": {}, "source": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "

MTH786 Machine Learning with Python

\n", "
\n", " \n", "

Semester A, 2023/2024

\n", "
\n", "

Coursework 5

\n", "
\n", " \n", "

Dr Nicola Perra

\n", "
" ] }, { "cell_type": "markdown", "id": "2c1f8cb7", "metadata": {}, "source": [ "We start by loading necessary libraries, including NumPy (used for linear algebra calculations) and MatPlotLib (used for visualisation)." ] }, { "cell_type": "code", "execution_count": null, "id": "7ff2022f", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from numpy.testing import assert_array_almost_equal, assert_array_equal\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "id": "a305fb4a", "metadata": {}, "source": [ "### Polynomial regression\n", "By completing this exercise you will write a set of functions that are used for building a polynomial regression for a given data samples. You will then learn how to select an optimal degree of the polynomial regression.\n", "\n", "1. Implement function **polynomial_basis** that computes (and outputs) the polynomial data matrix $\\Phi\\left(\\mathbf{X}\\right)$ defined as\n", "$$\n", "\\Phi\\left(\\mathbf{X}\\right) = \n", "\\begin{pmatrix}\n", "1 & \\left(x^{(1)}\\right)^1 & \\left(x^{(1)}\\right)^2 & \\ldots & \\left(x^{(1)}\\right)^d \\\\\n", "1 & \\left(x^{(2)}\\right)^1 & \\left(x^{(2)}\\right)^2 & \\ldots & \\left(x^{(2)}\\right)^d \\\\\n", "\\vdots & \\vdots & \\vdots & \\ddots & \\vdots & \\\\\n", "1 & \\left(x^{(s)}\\right)^1 & \\left(x^{(s)}\\right)^2 & \\ldots & \\left(x^{(s)}\\right)^d \n", "\\end{pmatrix}\n", "$$\n", "The function **polynomial_basis**\n", "should take the $2$ arguments\n", "- NumPy array *data_inputs* representing a list of one-dimensional inputs $x^{(1)}, x^{(2)}, \\ldots, x^{(s)}$\n", "- and integer *degree* representing the degree $d$ of a polynomial." ] }, { "cell_type": "code", "execution_count": null, "id": "6f6c77aa", "metadata": {}, "outputs": [], "source": [ "def polynomial_basis(data_inputs, degree):\n", " # first column is made of ones\n", " X_matrix = np.ones((len(data_inputs), 1))\n", " # then we take powers of 1, 2, d of the data\n", " for i in range(degree):\n", " X_matrix = np.c_[X_matrix, np.power(data_inputs,i+1)]\n", " return X_matrix" ] }, { "cell_type": "markdown", "id": "6091a0be", "metadata": {}, "source": [ "Test your function with the following unit tests" ] }, { "cell_type": "code", "execution_count": null, "id": "5e4a6924", "metadata": {}, "outputs": [], "source": [ "test_inputs = np.array([1, 2, 3, 4])\n", "print (\"input = \",test_inputs)\n", "test_degree = 2;\n", "\n", "print (\"output = \"+\"\\n\",polynomial_basis(test_inputs, test_degree))" ] }, { "cell_type": "code", "execution_count": null, "id": "eafde5d0", "metadata": {}, "outputs": [], "source": [ "# let's try another example\n", "test_samples = np.random.randint(low = 2, high = 5)\n", "test_inputs = np.random.rand(test_samples)\n", "test_degree = np.random.randint(low = 2, high = 5)\n", "\n", "print (\"input = \",test_inputs)\n", "test_degree = 2;\n", "\n", "print (\"output = \"+\"\\n\",polynomial_basis(test_inputs, test_degree))" ] }, { "cell_type": "markdown", "id": "3b5c35d0", "metadata": {}, "source": [ "2. Write a function **polynomial_regression** that takes two arguments *data_matrix* and *data_outputs*, which computes and returns the solution $\\hat{\\mathbf{W}}$ of the normal equation\n", "$$\n", "\\Phi^{\\top}\\left(\\mathbf{X}\\right)\\Phi\\left(\\mathbf{X}\\right) \\hat{\\mathbf{W}} = \\Phi^{\\top}\\left(\\mathbf{X}\\right)\\mathbf{Y}\n", "$$\n", "Here $\\Phi\\left(\\mathbf{X}\\right)$ is the mathematical representation of *data_matrix*\n", "and $\\mathbf{Y}$ is the mathematical representation of *data_outputs*, while $\\hat{\\mathbf{W}}$ is a mathematical representation for coefficients of the polynomial regression." ] }, { "cell_type": "code", "execution_count": null, "id": "262b293b", "metadata": {}, "outputs": [], "source": [ "def polynomial_regression(data_matrix, data_outputs):\n", " return np.linalg.solve(data_matrix.T@data_matrix,data_matrix.T@data_outputs)" ] }, { "cell_type": "markdown", "id": "1ae795da", "metadata": {}, "source": [ "Test your function with the following unit tests" ] }, { "cell_type": "code", "execution_count": null, "id": "dc558266", "metadata": {}, "outputs": [], "source": [ "test_inputs = np.array([0, 1/4, 1/2, 3/4, 1])\n", "test_outputs = np.array([0, 1, 0, -1, 0])\n", "test_degree = 2\n", "test_data_matrix = polynomial_basis (test_inputs, test_degree)\n", "print (polynomial_regression(test_data_matrix, test_outputs))" ] }, { "cell_type": "code", "execution_count": null, "id": "d0780729", "metadata": {}, "outputs": [], "source": [ "# what happens if you change d?" ] }, { "cell_type": "markdown", "id": "ad71869c", "metadata": {}, "source": [ "3. Write a function **prediction_function** that evaluates your predicted polynomial function at the given points $\\mathbf{X} = \n", "\\left\\{x^{(1)}, x^{(2)}, \\ldots, x^{(s)}\\right\\}$ for given coefficients $\\mathbf{w} = \\left(w^{(0)}, w^{(1)},\\ldots,w^{(d)}\\right)$. The function **prediction_function** takes the arguments _data_matrix_ and _weights_ as inputs and returns a value of the polynomial prediction function evaluated for every $x \\in \\mathbf{X}$ via\n", "$$\n", "f_{\\mathbf{W}}\\left(x\\right)\n", "= w^{(0)}+w^{(1)}x+\\ldots+w^{(d)}x^d,\n", "$$\n", "where _data_matrix_ is a NumPy array representing data matrix $\\Phi\\left(\\mathbf{X}\\right)$, _weights_ is a NumPy representation of coefficients vector $\\mathbf{w}$. The function should return a vector of the polynomial values $\\left(f_{\\mathbf{w}}\\left(x^{(1)}\\right), f_{\\mathbf{w}}\\left(x^{(2)}, \\ldots, f_{\\mathbf{w}}\\left(x^{(s)}\\right)\\right)\\right)^{\\top}$." ] }, { "cell_type": "code", "execution_count": null, "id": "35a46942", "metadata": {}, "outputs": [], "source": [ "def prediction_function(data_matrix, weights):\n", " return data_matrix@weights" ] }, { "cell_type": "markdown", "id": "0c4c0235", "metadata": {}, "source": [ "Test your function with the following unit tests" ] }, { "cell_type": "code", "execution_count": null, "id": "55f50bde", "metadata": {}, "outputs": [], "source": [ "test_inputs = np.array([0, 1/4, 1/2, 3/4, 1])\n", "test_weights = np.array([2/5, -4/5])\n", "test_data_matrix = polynomial_basis(test_inputs, len(test_weights)-1)\n", "# these are the predicted outputs\n", "prediction_function(test_data_matrix, test_weights)" ] }, { "cell_type": "code", "execution_count": null, "id": "30aa0399", "metadata": {}, "outputs": [], "source": [ "# same here look at the difference same input different degree and predicted output\n", "test_inputs = np.array([0, 1/4, 1/2, 3/4, 1])\n", "test_weights = np.array([0, 32/3, -32, 64/3])\n", "test_data_matrix = polynomial_basis(test_inputs, len(test_weights)-1)\n", "prediction_function(test_data_matrix, test_weights)" ] }, { "cell_type": "markdown", "id": "3b966e4c", "metadata": {}, "source": [ "4. Write a function **prediction_error** that evaluates a mean-squared error over the set of data inputs and outputs. The function **prediction_error** takes the arguments _data_matrix_, _data_outputs_ and _weights_ as inputs and returns a mean squared error defined by\n", "$$\n", "\\mathrm{MSE} = \\frac{1}{2s} \\left\\|\\Phi\\left(\\mathbf{X}\\right)\\mathbf{W} - \\mathbf{Y} \\right\\|^2,\n", "$$\n", "where $\\Phi\\left(\\mathbf{X}\\right)$ is a mathematical representation of _data_matrix_, $\\mathbf{Y}$ is a mathematical representation of _data_outputs_ and $\\mathbf{W}$ is a mathematical representation of _weights_." ] }, { "cell_type": "code", "execution_count": null, "id": "b15ff915", "metadata": {}, "outputs": [], "source": [ "def prediction_error(data_matrix,data_outputs,weights):\n", " return 1/(2*len(data_matrix))*(np.linalg.norm(data_matrix@weights-data_outputs))**2" ] }, { "cell_type": "markdown", "id": "e3fe137b", "metadata": {}, "source": [ "Test your function with the following unit tests" ] }, { "cell_type": "code", "execution_count": null, "id": "41303ef5", "metadata": {}, "outputs": [], "source": [ "test_inputs = np.array([0, 1/4, 1/2, 3/4, 1])\n", "test_data_outputs = np.array([0, 1, 0, -1, 0])\n", "test_weights = np.array([2/5, -4/5])\n", "test_data_matrix = polynomial_basis(test_inputs, len(test_weights)-1)\n", "\n", "prediction_error(test_data_matrix, test_data_outputs, test_weights)" ] }, { "cell_type": "code", "execution_count": null, "id": "7938fb35", "metadata": {}, "outputs": [], "source": [ "\n", "test_inputs = np.array([0, 1/4, 1/2, 3/4, 1])\n", "test_data_outputs = np.array([0, 1, 0, -1, 0])\n", "test_weights = np.array([0, 32/3, -32, 64/3])\n", "test_data_matrix = polynomial_basis(test_inputs, len(test_weights)-1)\n", "\n", "prediction_error(test_data_matrix, test_data_outputs, test_weights)\n", "# what do we learn?" ] }, { "cell_type": "markdown", "id": "3e5e84f8", "metadata": {}, "source": [ "### Cross-validation and the optimal degree finder\n", "In this part you will first create a dataset we have discussed in the lecture: noisy sine-data. You will then evaluate the optimal polynomial degree by minimising the validation error. At the end of this section you will be asked to plot your optimal polynomial.\n", "\n", "5. We start by creating our own dataset by defining a function $f(x) := \\sin(2\\pi x)$, by sampling the\n", "function at points $\\left\\{x^{(i)}\\right\\}_{i=1}^s$ forming an equidistant sequence of points from the interval $\\left[0, 1\\right]$, and by computing corresponding output samples $y^{(i)} = f\\left(x^{(i)}\\right)$ for all $1\\leq i\\leq s$. We then perturb the data with normal distributed random noise with mean zero and variance $\\sigma^2$ in order to obtain noisy,\n", "simulated output data samples $\\left\\{\\tilde{y}^{(i)}\\right\\}_{i=1}^s$. We finish this part by visualising the data points." ] }, { "cell_type": "code", "execution_count": null, "id": "92a82be4", "metadata": {}, "outputs": [], "source": [ "# let's get some data\n", "def data_generator(data_inputs, noise_variance):\n", " return np.sin(2*np.pi*data_inputs) + (noise_variance)**(1/2)*np.random.randn(len(data_inputs))\n", " # note how np.random.randn(x) generates x samples extracted from a Gaussian with mu=0 and sigma=1\n", "\n", "data_size = 100\n", "noise_variance = 0.01\n", "data_inputs = np.arange(0,1,1/data_size) # steps from 0 to 1 of equal space 1/data_size\n", "data_outputs = data_generator(data_inputs, noise_variance)\n", "\n", "plt.scatter(data_inputs, data_outputs, s = 10)\n", "plt.axis([0,1,-1,1])\n", "plt.xlabel('Input', fontsize=16)\n", "plt.xticks(fontsize=8)\n", "plt.ylabel('Output', fontsize=16)\n", "plt.yticks(fontsize=8)\n", "plt.tight_layout;" ] }, { "cell_type": "markdown", "id": "d7c5f762", "metadata": {}, "source": [ "6. Write a function **data_split** that performs a data split of a given dataset into two sets; one for testing, and one for cross-validation. Your function should take 2 arguments: _data_matrix_, representing a whole dataset (including inputs and outputs), and a _validation_ratio_ representing a ratio of dataset contained in a validation set. The function **data_split(data_matrix, validation_ratio)** should return two new data matrices, such that their union coincides with complete dataset represented by _data_matrix_." ] }, { "cell_type": "code", "execution_count": null, "id": "13d73228", "metadata": {}, "outputs": [], "source": [ "def data_split(data_matrix, validation_ratio):\n", " data_size = len(data_matrix)\n", " validation_size = int(round(data_size*validation_ratio,0))\n", " # we take the int of round as 1.6 ->2, if you do just int(1,6) ->1\n", " rows_indexes = np.random.choice(data_size,validation_size, replace = False)\n", " # the line above selects a number equal to validation_size of indexes without replacement at random\n", " validation_data = data_matrix[rows_indexes,:]\n", " # we then select these random rows as validation set (remember samples runs on the rows)\n", " training_data = np.delete(data_matrix, obj = rows_indexes, axis = 0)\n", " # the training data is what is left\n", " return training_data, validation_data " ] }, { "cell_type": "markdown", "id": "b540ea25", "metadata": {}, "source": [ "Test your results with the following unit tests" ] }, { "cell_type": "code", "execution_count": null, "id": "eaa03a6a", "metadata": {}, "outputs": [], "source": [ "test_shape = np.random.randint(low = 4, high = 10, size = 2)\n", "test_data_matrix = np.random.rand(test_shape[0], test_shape[1])\n", "test_training, test_validation = data_split(test_data_matrix, 0.3)\n", "test_combined_matrix = np.r_[test_training, test_validation]\n", "assert_array_almost_equal(sorted(test_data_matrix, key=lambda x: x[0]), sorted(test_combined_matrix, key=lambda x: x[0]))" ] }, { "cell_type": "code", "execution_count": null, "id": "c0243ec2", "metadata": {}, "outputs": [], "source": [ "# let's see the input and output\n", "print ('Initial data = ')\n", "print (test_data_matrix,'Size of data = ',len(test_data_matrix))\n", "print ('---- Training data = ')\n", "print (test_training,'Size of training data = ',len(test_training))\n", "print ('---- Validation data = ')\n", "print (test_validation ,'Size of Validation data = ',len(test_validation))" ] }, { "cell_type": "code", "execution_count": null, "id": "21718270", "metadata": {}, "outputs": [], "source": [ "test_shape = np.random.randint(low = 2, high = 10, size = 2)\n", "test_data_matrix = np.random.rand(test_shape[0], test_shape[1])\n", "test_training, test_validation = data_split(test_data_matrix, np.random.rand())\n", "test_combined_matrix = np.r_[test_training, test_validation]\n", "assert_array_almost_equal(sorted(test_data_matrix, key=lambda x: x[0]), sorted(test_combined_matrix, key=lambda x: x[0]))" ] }, { "cell_type": "markdown", "id": "a0b74163", "metadata": {}, "source": [ "7. Write a function **grid_search** that performs a search for a minimum value of a given function on a given grid points. You function should have a signature *grid_search (objective, grid)*, where *objective* is a function to minimise and *grid* is a list of grid points. The function should return the grid point with the minimal value of objective function." ] }, { "cell_type": "code", "execution_count": null, "id": "e8dca95b", "metadata": {}, "outputs": [], "source": [ "def grid_search(objective, grid):\n", " values = np.array([])\n", " for point in grid:\n", " values = np.append(values, objective(point))\n", " return grid[np.argmin(values)]" ] }, { "cell_type": "code", "execution_count": null, "id": "212a17e0", "metadata": {}, "outputs": [], "source": [ "# let's take the initial data\n", "data_matrix = np.c_[data_inputs, data_outputs]\n", "\n", "# split in the class 80-20\n", "training_data, validation_data = data_split(data_matrix, 0.2)\n", "\n", "# let's split the data in training and validation and do some model selection!\n", "# which value of d is the best?\n", "\n", "# at least two possible ways to implement this,\n", "# first function is more explicit\n", "\n", "def validation_error_function(d):\n", " # let's get the data for validation first\n", " PHI_validation=polynomial_basis(validation_data[:,:-1],d) # we need to remove the last column\n", " Y_validation=validation_data[:,-1] # the last column is where we have the outputs\n", " \n", " # let's get the data for training\n", " PHI_training= polynomial_basis(training_data[:, :-1], d)\n", " Y_training=training_data[:,-1]\n", " \n", " # fit the model considering ONLY the training set\n", " w_hat=polynomial_regression(PHI_training,Y_training)\n", " \n", " # the objective function is the MSE\n", " return prediction_error(PHI_validation,Y_validation,w_hat)\n", "\n", "\n", "# let's create a grid of the search\n", "degree_grid = np.arange(0,100,1)\n", "# assess the prediction error, the best model is the one with the smallest error\n", "optimal_degree = grid_search(validation_error_function, degree_grid)\n", "\n", "# we can now see the output, meaning the best fit among those we tried\n", "optimal_weights = polynomial_regression(polynomial_basis(training_data[:,:-1],optimal_degree),training_data[:,-1])\n", "polynomial_values = prediction_function(polynomial_basis(data_inputs,optimal_degree), optimal_weights)\n", "\n", "plt.scatter(data_inputs, data_outputs, s = 10)\n", "plt.plot(data_inputs, polynomial_values, linewidth = 3, c = 'r')\n", "plt.axis([-0.1,1.1,-1.1,1.1])\n", "plt.xlabel('Input', fontsize=16)\n", "plt.xticks(fontsize=8)\n", "plt.ylabel('Output', fontsize=16)\n", "plt.yticks(fontsize=8)\n", "plt.tight_layout;\n", "\n", "print(\"The optimal polynomial has degree {} and coefficients {}.\".format(optimal_degree, optimal_weights))\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "21709415", "metadata": {}, "outputs": [], "source": [ "# second cleaner approach lambda function\n", "data_matrix = np.c_[data_inputs, data_outputs]\n", "\n", "# split in the class 80-20\n", "training_data, validation_data = data_split(data_matrix, 0.2)\n", "\n", "# let's remember the function prediction_error(data_matrix,data_outputs,weights)\n", "validation_error = lambda d: prediction_error(polynomial_basis(validation_data[:,:-1],d),\n", " validation_data[:,-1],\n", " polynomial_regression(\n", " polynomial_basis(\n", " training_data[:, :-1], d), \n", " training_data[:,-1]))\n", "degree_grid = np.arange(0,100,1)\n", "optimal_degree = grid_search(validation_error, degree_grid)\n", "optimal_weights = polynomial_regression(polynomial_basis(training_data[:,:-1],optimal_degree),training_data[:,-1])\n", "polynomial_values = prediction_function(polynomial_basis(data_inputs,optimal_degree), optimal_weights)\n", "\n", "plt.scatter(data_inputs, data_outputs, s = 10)\n", "plt.plot(data_inputs, polynomial_values, linewidth = 3, c = 'r')\n", "plt.axis([-0.1,1.1,-1.1,1.1])\n", "plt.xlabel('Input', fontsize=16)\n", "plt.xticks(fontsize=8)\n", "plt.ylabel('Output', fontsize=16)\n", "plt.yticks(fontsize=8)\n", "plt.tight_layout;\n", "\n", "print(\"The optimal polynomial has degree {} and coefficients {}.\".format(optimal_degree, optimal_weights))" ] }, { "cell_type": "code", "execution_count": null, "id": "fe9b2ff6", "metadata": {}, "outputs": [], "source": [ "# what happens if we change the data?\n", "\n", "data_size = 100\n", "noise_variance = 0.01\n", "data_inputs = np.arange(0,2,1/data_size) # steps from 0 to 1 of equal space 1/data_size\n", "data_outputs = data_generator(data_inputs, noise_variance)\n", "\n", "plt.scatter(data_inputs, data_outputs, s = 10)\n", "plt.axis([0,2,-1,1])\n", "plt.xlabel('Input', fontsize=16)\n", "plt.xticks(fontsize=8)\n", "plt.ylabel('Output', fontsize=16)\n", "plt.yticks(fontsize=8)\n", "plt.tight_layout;\n" ] }, { "cell_type": "code", "execution_count": null, "id": "7e94dc50", "metadata": {}, "outputs": [], "source": [ "# second cleaner approach lambda function\n", "data_matrix = np.c_[data_inputs, data_outputs]\n", "\n", "# split in the class 80-20\n", "training_data, validation_data = data_split(data_matrix, 0.2)\n", "\n", "# let's remember the function prediction_error(data_matrix,data_outputs,weights)\n", "validation_error = lambda d: prediction_error(polynomial_basis(validation_data[:,:-1],d),\n", " validation_data[:,-1],\n", " polynomial_regression(\n", " polynomial_basis(\n", " training_data[:, :-1], d), \n", " training_data[:,-1]))\n", "degree_grid = np.arange(0,100,1)\n", "optimal_degree = grid_search(validation_error, degree_grid)\n", "optimal_weights = polynomial_regression(polynomial_basis(training_data[:,:-1],optimal_degree),training_data[:,-1])\n", "polynomial_values = prediction_function(polynomial_basis(data_inputs,optimal_degree), optimal_weights)\n", "\n", "plt.scatter(data_inputs, data_outputs, s = 10)\n", "plt.plot(data_inputs, polynomial_values, linewidth = 3, c = 'r')\n", "plt.axis([-0.1,2.1,-1.1,1.1])\n", "plt.xlabel('Input', fontsize=16)\n", "plt.xticks(fontsize=8)\n", "plt.ylabel('Output', fontsize=16)\n", "plt.yticks(fontsize=8)\n", "plt.tight_layout;\n", "\n", "print(\"The optimal polynomial has degree {} and coefficients {}.\".format(optimal_degree, optimal_weights))" ] }, { "cell_type": "code", "execution_count": null, "id": "2884928d", "metadata": {}, "outputs": [], "source": [ "# what happens if we re-run it?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.17" } }, "nbformat": 4, "nbformat_minor": 5 }