{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "

MTH786 Machine Learning with Python

\n", "
\n", " \n", "

Semester A, 2023/2024

\n", "
\n", "

Coursework 9

\n", "
\n", " \n", "

Dr Nicola Perra

\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from numpy.testing import assert_array_almost_equal, assert_array_equal\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import seaborn as sns\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Nearest neighbour classification\n", "In this exercise you will work with a data set you are familiar with: $\\mathtt{heigh\\_weight.csv}$. By completing this exercise you will learn how to solve binary classification problem with non-parametric method. In the second part we will learn how to solve the classification problem for titanic data using logistic regression method.\n", "\n", "We start by loading data. **Important:** please check that file $\\mathtt{height\\_weight.csv}$ is located in the same folder with your Jupyter notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "types = np.genfromtxt(\"height_weight.csv\", delimiter=\",\", skip_header=1, usecols=[0], \\\n", " converters={0: lambda x: 0 if b\"A\" in x else 1})\n", "heights = np.genfromtxt(\"height_weight.csv\",\n", " delimiter=\",\",\n", " skip_header=1,\n", " usecols=[1])\n", "weights = np.genfromtxt(\"height_weight.csv\",\n", " delimiter=\",\",\n", " skip_header=1,\n", " usecols=[2])\n", "indexes = np.argsort(heights)\n", "\n", "height_weight_input = np.c_[heights[indexes], weights[indexes]]\n", "height_weight_output = types[indexes]\n", "\n", "\n", "def standardise(data_matrix):\n", " row_of_means = np.mean(data_matrix, axis=0)\n", " standardised_matrix = data_matrix - row_of_means\n", " row_of_stds = np.std(standardised_matrix, axis=0)\n", " return (standardised_matrix / row_of_stds), row_of_means, row_of_stds\n", "\n", "\n", "def de_standardise(standardised_matrix, row_of_means, row_of_stds):\n", " matrix = np.copy(standardised_matrix * row_of_stds)\n", " return matrix + row_of_means\n", "\n", "\n", "height_weight_input, height_weight_row_of_means, height_weight_row_of_stds = standardise(\n", " height_weight_input)\n", "\n", "\n", "def plot_height_weight(data_input, data_output):\n", " height_weight_de_standardised = de_standardise(data_input,\n", " height_weight_row_of_means,\n", " height_weight_row_of_stds)\n", " plt.scatter(height_weight_de_standardised[:,0][data_output == 0], \\\n", " height_weight_de_standardised[:,1][data_output == 0], s=1, c = 'r')\n", " plt.scatter(height_weight_de_standardised[:,0][data_output == 1], \\\n", " height_weight_de_standardised[:,1][data_output == 1], s=1, c = 'b')\n", " plt.xlabel('Height', fontsize=16)\n", " plt.xticks(fontsize=16)\n", " plt.ylabel('Weight', fontsize=16)\n", " plt.yticks(fontsize=16)\n", " plt.tight_layout" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_height_weight(height_weight_input, height_weight_output)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our main goal is to identify the type by studying their height and weight. We will end up with building a classifier which will split the height-weight plane into two regions: one corresponding to type A and another to type B. To achieve this we suggest to use the $K$-nearest neighbour classification method.\n", "\n", "1. Write the function $\\mathtt{pairwise\\_distances}$ that takes two arguments: $\\mathtt{from\\_data}$, $\\mathtt{to\\_data}$ - both are NumPy arrays of the same width that may have different height. These arrays represent two collections (later testing data and training data) of row-vectors, each describing a data point in $d$ -dimensional input space. In other words for $$ \n", "\\mathtt{from\\_data} =\n", "\\begin{pmatrix}\n", "x^{(1)}_1 & x^{(1)}_2 & \\ldots & x^{(1)}_d \\\\\n", "x^{(2)}_1 & x^{(2)}_2 & \\ldots & x^{(2)}_d \\\\\n", "\\vdots & \\vdots & \\ddots & \\vdots \\\\\n", "x^{(p)}_1 & x^{(p)}_2 & \\ldots & x^{(p)}_d \n", "\\end{pmatrix}\n", ", \n", "\\quad\n", "\\mathtt{to\\_data} =\n", "\\begin{pmatrix}\n", "y^{(1)}_1 & y^{(1)}_2 & \\ldots & y^{(1)}_d \\\\\n", "y^{(2)}_1 & y^{(2)}_2 & \\ldots & y^{(2)}_d \\\\\n", "\\vdots & \\vdots & \\ddots & \\vdots \\\\\n", "y^{(q)}_1 & y^{(q)}_2 & \\ldots & y^{(q)}_d \n", "\\end{pmatrix}\n", "$$\n", "your function should return a NumPy array of size $p\\times q$ of the form\n", "$$\n", "\\mathtt{pairwise\\_distances} = \n", "\\begin{pmatrix}\n", "\\mathrm{dist}\\left(\\mathbf{x^{(1)}},\\mathbf{y^{(1)}}\\right) &\n", "\\mathrm{dist}\\left(\\mathbf{x^{(1)},y^{(2)}}\\right) & \\ldots & \\mathrm{dist}\\left(\\mathbf{x^{(1)},y^{(q)}}\\right) \\\\\n", "\\mathrm{dist}\\left(\\mathbf{x^{(2)},y^{(1)}}\\right) &\n", "\\mathrm{dist}\\left(\\mathbf{x^{(2)},y^{(2)}}\\right) & \\ldots & \\mathrm{dist}\\left(\\mathbf{x^{(2)},y^{(q)}}\\right) \\\\\n", "\\vdots &\n", "\\vdots & \\ddots & \\vdots\\\\\n", "\\mathrm{dist}\\left(\\mathbf{x^{(p)},y^{(1)}}\\right) &\n", "\\mathrm{dist}\\left(\\mathbf{x^{(p)},y^{(2)}}\\right) & \\ldots & \\mathrm{dist}\\left(\\mathbf{x^{(p)},y^{(q)}}\\right)\n", "\\end{pmatrix}.\n", "$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# the pairwise distance is computed considering all the rows of from-data against all the rows of y\n", "def pairwise_distances(from_data, to_data):\n", " return np.sqrt(np.sum((from_data[:, np.newaxis, :] - to_data[np.newaxis, :, :])**2,axis=2))\n", "# note how np.newaxis is used to extend the order of the array" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# this is a more explicit alternative of the same code\n", "def pairwise_distances_2(from_data, to_data):\n", " distance=np.zeros((len(from_data),len(to_data)),float)\n", " counter_1=0\n", " for vector_x in from_data:\n", " first_vector=vector_x\n", " counter_2=0\n", " for vector_y in to_data:\n", " second_vector=vector_y\n", " distance[counter_1,counter_2]=np.sqrt(np.sum((vector_x-vector_y)**2))\n", " counter_2+=1\n", " counter_1+=1 \n", " return distance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test your function with the following unit tests" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_from_data = np.array([[0, 1, 2], [3, 4, 5]])\n", "test_to_data = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]])\n", "assert_array_almost_equal(pairwise_distances(test_from_data, test_to_data), np.array([[ 0., 5.19615242, 10.39230485, 15.58845727],\\\n", " [ 5.19615242, 0., 5.19615242, 10.39230485]]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Write a function $\\mathtt{nearest\\_neighbour\\_classification}$ that performs nearest neighbour classiffication as introduced in the lecture notes and that takes four arguments \n", "- $\\mathtt{testing\\_inputs}$ - a NumPy array of input samples to be classiffied\n", "- $\\mathtt{training\\_inputs}$ - a NumPy array of data points for which the classification is known\n", "- $\\mathtt{training\\_outputs}$ - a NumPy array of class labels for $\\mathtt{training\\_inputs}$ data\n", "- $\\mathtt{no\\_of\\_neighbours}$ - an integer argument that specifies how many neighbours we take into account to determine the class labels for the $\\mathtt{testing\\_inputs}$. Use the Euclidean norm to measure the distance between samples.\n", "\n", "To implement the classification algorithm you may follow the below steps: \n", "\n", "a. Calculate the pairwise distances between testing and training data\n", "\n", "b. Sort the pairwise distances by rows\n", "\n", "c. For every row (data input) evaluate labels of the closest $K$ points \n", "\n", "d. Assign a data input with the class label that appears the most in between $K$ closest points\n", "\n", "## Example\n", "To better understand how the general code works, let us consider a simple example where\n", "\n", "$\\text{data_inputs} =\n", "\\begin{pmatrix}\n", "15\\\\\n", "6 \\\\\n", "18\n", "\\end{pmatrix}$\n", "these are the points we would like to classify. We do that considering the training data for which we know the correct labels. \n", "\n", "$\\text{training_inputs} =\n", "\\begin{pmatrix}\n", "10\\\\\n", "8 \\\\\n", "17\\\\\n", "5\\\\\n", "12\\\\\n", "10\\\\\n", "20\\\\\n", "1\\\\\n", "11\n", "\\end{pmatrix}$\n", "\n", "The pairwise distance matrix will then be a 3x9 matrix. The first entry will be $\\sqrt{(15-10)^2}=5$. \n", "Let's see:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_inputs = np.array([[15], [6], [18]])\n", "\n", "training_inputs = np.array([[10], [8], [17], [5], [12], [10], [20], [1],[11]])\n", "\n", "\n", "distances = pairwise_distances(data_inputs, training_inputs)\n", "print (distances)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now assume that the classification/labels for the 9 data points in the training set is\n", "\n", "$\\text{training_outputs} =\n", "\\begin{pmatrix}\n", "1\\\\\n", "1 \\\\\n", "1\\\\\n", "0\\\\\n", "0\\\\\n", "1\\\\\n", "1\\\\\n", "0\\\\\n", "1\n", "\\end{pmatrix}$\n", "\n", "We have two classes in this case. This is an example of binary classification" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "training_outputs = np.array([1, 1, 1, 0, 0, 1, 1, 0, 1])\n", " \n", "no_of_classes = 1 + np.max(training_outputs) # since the max is 1 and we have 0 or 1, this is the number of classes\n", "\n", "# we can use this line to get the indices of the sorted distances\n", "sorted_indices = np.argsort(distances, axis=1)\n", "# let's check the distances\n", "print (distances)\n", "# and let's check the indices of the sorted values\n", "print (sorted_indices)\n", "# see how in the first row the value 2 is the smallest this corresponds to index=2 etc.." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# to simplify drastically the code we can make use of some numpy functionalities\n", "# Consider the following example\n", "\n", "a=np.array([3,4,5])\n", "\n", "np.broadcast_to(a,(4,3)) # this function takes an array and creates another with different dimensions replicating it" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# also consider an array\n", "a = np.array([[10, 30, 20], [60, 40, 50]])\n", "\n", "# we can use argsort to get the indices of the sorted values\n", "ai = np.argsort(a, axis=1)\n", "print (ai) # let's check the output\n", "\n", "# than we can use take_along_axis to get the array sorted! Basically we take the values of a but ordered following ai\n", "np.take_along_axis(a, ai, axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# we can use these two functions to get the labels of the closest training value for each data input!\n", "\n", "no_of_points= len(training_inputs)\n", "no_of_inputs = len(data_inputs)\n", "\n", "# we can use broadcast_to to create, for each data_input, a row with all the training outputs\n", "new_array_to_sort = np.broadcast_to(training_outputs,(no_of_inputs, no_of_points))\n", "print (new_array_to_sort)\n", "print ('----')\n", "\n", "# now we can sort them using their distances\n", "sorted_labels=np.take_along_axis(new_array_to_sort,sorted_indices, 1)\n", "print (sorted_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This means that for the data point 15, which we would like to classify, the closest point has label 1, the second closest label 0, all the other closest, but for the last two (which are the one more distant) have label 1. \n", "\n", "Now we are done, because we can classify a data input as the most common class considering the first k neighbors!\n", "So we can see the classification already looking at the matrix above. If we use k=1, the first data point (15) will be 1, the second (6) will be 0 and the third (8) will be 1. \n", "\n", "How can we do this more in general? " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# we need to speficy the number of K to consider\n", "no_of_neighbours=1\n", "\n", "\n", "# we can do the classification!\n", "predicted_labels=np.zeros(no_of_inputs,int)\n", "for id_input in range(no_of_inputs): # for each input we compute the average of label of the k closest\n", " probability_labels=np.zeros(no_of_classes,float)\n", " # this is an array that stores the probabilty that each input is member of a particular class\n", " # by looking that the fraction of k neighbors that are in the class\n", " for id_neighbor in range(0,no_of_neighbours):\n", " label=sorted_labels[id_input][id_neighbor] # this is the label of the neighbors\n", " probability_labels[label]+=1./no_of_neighbours\n", "\n", " predicted_labels[id_input]=np.argmax(probability_labels)\n", " \n", "print (predicted_labels)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# we can put all together!\n", "def nearest_neighbour_classification(testing_inputs, training_inputs,training_outputs, no_of_neighbours):\n", " \n", " # first thing is to compute pairwise distances\n", " distances = pairwise_distances(testing_inputs, training_inputs)\n", " \n", " no_of_classes = 1 + np.max(training_outputs)\n", " sorted_indices = np.argsort(distances, axis=1)\n", " \n", " no_of_inputs = len(testing_inputs) # these are the data that we are trying to classify\n", " no_of_points = len(training_inputs)\n", " \n", " new_array_to_sort = np.broadcast_to(training_outputs,(no_of_inputs, no_of_points))\n", " sorted_labels = np.take_along_axis(new_array_to_sort,sorted_indices, 1)\n", " \n", " predicted_labels=np.zeros(no_of_inputs,int)\n", " for id_input in range(no_of_inputs):\n", " \n", " probability_labels=np.zeros(no_of_classes,float)\n", " \n", " for id_neighbor in range(0,no_of_neighbours):\n", " probability_labels[sorted_labels[id_input][id_neighbor]]+=1./no_of_neighbours\n", "\n", " predicted_labels[id_input]=np.argmax(probability_labels)\n", " \n", " \n", " return predicted_labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test your function with the following unit tests" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_data_inputs = np.array([[15], [6], [18]])\n", "test_training_inputs = np.array([[10], [8], [17], [5], [12], [10], [20], [1],\n", " [11]])\n", "test_training_outputs = np.array([1, 1, 1, 0, 0, 1, 1, 0, 1])\n", "test_no_of_neighbours = 1\n", "nearest_neighbour_classification(test_data_inputs, test_training_inputs,test_training_outputs, test_no_of_neighbours)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# it works also with multiple classes\n", "test_data_inputs = np.array([[16, 0], [13, 6], [6, 5], [14, 15]])\n", "test_training_inputs = np.array([[ 7, 13], [ 2, 4], [ 8, 15], [16, 19], [18, 3], [11, 19], [13, 4], [ 7, 18], [ 4, 14],\\\n", " [10, 6], [17, 19], [19, 7], [ 4, 19], [18, 11], [18, 3], [14, 15], [14, 0], [ 5, 17],\\\n", " [ 5, 15], [ 7, 19]])\n", "test_training_outputs = np.array(\n", " [0, 2, 1, 0, 1, 0, 1, 1, 2, 1, 2, 2, 1, 1, 0, 1, 0, 0, 1, 2])\n", "test_no_of_neighbours = 3\n", "assert_array_almost_equal(nearest_neighbour_classification(test_data_inputs, test_training_inputs, \\\n", " test_training_outputs, test_no_of_neighbours),\\\n", " np.array([0, 1, 1, 0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Write a function $\\mathtt{classification\\_accuracy}$ that takes two NumPy array arguments $\\mathtt{predicted\\_labels}$ and $\\mathtt{true\\_labels}$ and evaluates the ratio of labels that coincide in both arrays to the number of all labels." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def classification_accuracy(estimated_labels, true_labels):\n", " equal_labels = estimated_labels == true_labels # this returns an array made of true and false \n", " return np.mean(equal_labels) # the average is how many true we have" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test your function with the following unit tests" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_estimated_labels = np.array([0,4,0,0,2,4,0,0,2,2,2,3,3,1,3,4,0,3,4,0,1,1,2,0,0,0,\\\n", " 1,4,4,4,3,0,4,2,4,4,4,2,2,1,4,3,2,2,1,1,4,3,3,0,4,3,\\\n", " 0,0,0,2,4,3,4,3,1,3,2,4,2,3,2,3,2,3,1,0,4,3,2,3,1,3,\\\n", " 4,1,3,1,4,0,4,4,1,2,3,1,1,4,3,1,3,0,2,0,0,1])\n", "test_true_labels = np.array([4,4,4,4,1,3,0,4,1,0,2,3,1,2,0,1,3,4,3,4,4,4,2,0,3,4,3,2,3,\\\n", " 0,1,0,4,3,2,2,1,2,2,1,4,0,4,4,3,0,0,0,3,2,3,0,0,3,3,4,2,2,\\\n", " 3,4,2,3,3,0,0,2,0,3,0,4,4,4,4,1,2,3,1,3,1,1,2,0,2,3,3,3,4,\\\n", " 0,0,3,1,0,3,0,3,1,0,0,4,4])\n", "\n", "assert_array_almost_equal(\n", " classification_accuracy(test_estimated_labels, test_true_labels), 0.26)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. After we have learned how to classify the data, we would like to learn how one can optimise a value $K$ of neighbours participating in the classification. This would be done by performing standard $K$-fold cross validation. Below we provide you with some functions discussed in previous assignments. These functions are used to perform $K$-fold cross validation and consecutive grid search for hyperparameter optimisation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def KFold_split(data_size, K):\n", " np.random.seed(123456789)\n", " indexes = np.random.permutation(data_size)\n", " m, r = divmod(data_size, K)\n", " indexes_split = [\n", " indexes[i * m + min(i, r):(i + 1) * m + min(i + 1, r)]\n", " for i in range(K)\n", " ]\n", " return indexes_split\n", "\n", "\n", "def KFold_cross_validation_knn(data_inputs, data_outputs, K, labels_evaluation,\n", " missclassification_evaluation,knn):\n", " \n", " data_size = len(data_inputs)\n", " indexes_split = KFold_split(data_size, K)\n", "\n", " average_accuracy = 0\n", " for i in range(K):\n", " training_indexes = np.concatenate([indexes_split[j] for j in range(K) if (j != i)])\n", " \n", " predicted_labels = labels_evaluation(data_inputs[indexes_split[i]],\n", " data_inputs[training_indexes],\n", " data_outputs[training_indexes],knn)\n", " \n", " accuracy = missclassification_evaluation(predicted_labels,\n", " data_outputs[indexes_split[i]])\n", " average_accuracy += accuracy / K\n", " \n", " error = 1.-average_accuracy\n", " return error" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. We will also need a grid search function being implemented to find an optimal value of a number of neighbours to be included in the classification algorithm." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def grid_search(objective, grid):\n", " values = np.array([])\n", " for point in grid:\n", " values = np.append(values, objective(point))\n", " return grid[np.argmin(values)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6. Now define a grid of $K$ (number of neighbours to be included in the classification) values to be $\\left[1,2,\\ldots,20\\right]$ and determine an optimal parameter $\\mathtt{no\\_of\\_neighbours}$ based on K-fold cross validation and grid search. Assign a variable $\\mathtt{optimal\\_no\\_of\\_neighbours}$ with your result. For $K$-fold cross validation use $K = 5$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# note that this code takes a bit to run!\n", "\n", "# as first step let's write first very explicitely the code for what we need, then we can make slicker\n", "\n", "# we need an objective function to evaluate the values of the grid as we change the number\n", "# of nearest neighbors\n", "def objective_fun(Knn,K=5):\n", " # for a given number of nearest neighbors we first get a split of the data\n", " # into training and testing\n", " indexes_split = KFold_split(data_size, K)\n", "\n", " average_accuracy = 0\n", " # then for each split we compute the predicted labels and quantify the error\n", " for i in range(K):\n", " # this is the \"leave one out\" for training which is used for valdiation\n", " training_indexes = np.concatenate([indexes_split[j] for j in range(K) if (j != i)])\n", "\n", " predicted_labels = nearest_neighbour_classification(height_weight_input[indexes_split[i]],\n", " height_weight_input[training_indexes],\n", " height_weight_output[training_indexes], Knn)\n", "\n", " accuracy = classification_accuracy(predicted_labels, height_weight_output[indexes_split[i]])\n", " average_accuracy += accuracy / K\n", " \n", " error = 1.-average_accuracy\n", "\n", " return error\n", "\n", "\n", "Knn=2\n", "data_size = len(height_weight_input)\n", "Knn = list(range(1,21,1))\n", "\n", "grid_search(objective_fun, Knn)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# we can do the same thing more concisely using the KFold_cross_validation function defined above\n", "\n", "K=5\n", "data_size = len(height_weight_input)\n", "Knn = list(range(1,21,1))\n", "\n", "\n", "evaluation = lambda testing_data_inputs,training_data_inputs, training_data_outputs, k: nearest_neighbour_classification(testing_data_inputs,training_data_inputs,training_data_outputs,no_of_neighbours=k)\n", "\n", "missclassification_evaluation= lambda predicted_labels,true_labels: classification_accuracy(predicted_labels,true_labels)\n", "\n", "\n", "K_objective_function = lambda k: KFold_cross_validation_knn(height_weight_input,height_weight_output, K,evaluation,missclassification_evaluation,k)\n", "\n", "optimal_no_of_neighbours = grid_search(K_objective_function, Knn)\n", "optimal_classification_error = K_objective_function(optimal_no_of_neighbours)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Optimal number of neighbours one needs to consider is equal to {n}. \\nIn this case the prediction error would be as little as {e:2.2f}%.\"\\\n", " .format(n = optimal_no_of_neighbours, e = 100*optimal_classification_error))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test your answers with the unit tests" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "assert_array_almost_equal(optimal_no_of_neighbours, 15)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "assert_array_almost_equal(optimal_classification_error, 0.0796)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7. Visualise your results with your favourite visualisation tools, e.g. Pyplot from Matplotlib. You need to plot data points together with the decision areas" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grid_seed_heights = 100\n", "grid_seed_weights = 100\n", "\n", "# first we build the grid for the plot\n", "grid_heights = np.linspace(np.min(height_weight_input[:, 0]),\n", " np.max(height_weight_input[:, 0]),grid_seed_heights)\n", "\n", "grid_weights = np.linspace(np.min(height_weight_input[:, 1]),\n", " np.max(height_weight_input[:, 1]),grid_seed_weights)\n", "\n", "height_grid_data, weight_grid_data = np.meshgrid(grid_heights,grid_weights,sparse=False) \n", "\n", "full_grid_data = np.c_[height_grid_data.reshape(-1, 1),weight_grid_data.reshape(-1, 1)]\n", "\n", "# then we get the labels out of the best model\n", "full_grid_labels =nearest_neighbour_classification(full_grid_data,height_weight_input,\n", " height_weight_output,\n", " optimal_no_of_neighbours)\n", "\n", "\n", "# then we can plot it\n", "plt.imshow(full_grid_labels.reshape(grid_seed_heights, grid_seed_weights),\n", " cmap='Spectral',origin='lower',extent=[54, 82, 50, 270],aspect=\"auto\",alpha=0.7)\n", "\n", "# we can also overlay the datapoints\n", "height_weight_de_standardised = de_standardise(height_weight_input,\n", " height_weight_row_of_means,\n", " height_weight_row_of_stds)\n", "plt.scatter(height_weight_de_standardised[:,0][height_weight_output == 0], \\\n", " height_weight_de_standardised[:,1][height_weight_output == 0], s=1, c = 'Red',alpha=0.1)\n", "plt.scatter(height_weight_de_standardised[:,0][height_weight_output == 1], \\\n", " height_weight_de_standardised[:,1][height_weight_output == 1], s=1, c = 'Blue',alpha=0.1)\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Binary logistic regression\n", "In the second part we will learn how to solve the binary classification problem for titanic data using logistic regression method. The data consists of famous Titanic ship passengers details including:\n", "- $\\mathtt{Age}$ - age of the passenger;\n", "- $\\mathtt{Embarkation\\,\\,\\,port}$ - one of three ports (Southampton, Queenstown, Cherbourg) where the passenger has joined the trip;\n", "- $\\mathtt{Fare}$ - price the passenger paid for a ticket;\n", "- $\\mathtt{Parent/children}$ - number of parents/children of the passenger on board;\n", "- $\\mathtt{Ticket class}$ - one of three ticket classes (1,2,3);\n", "- $\\mathtt{Sex}$ - sex of the passenger;\n", "- $\\mathtt{Siblings/spouses}$ - number of siblings/spouses of the passenger on board;\n", "- $\\mathtt{Survival}$ - binary value representing whether the passenger has survived (1) or not (0).\n", "\n", "We start by loading data. Please pay attention to how we have dealt with the embarkation port column: instead of one column with categorical data we introduce three binary columns, each representing the information on whether the passenger joined at a specific port or not. **Important:** please check that files $\\mathtt{titanic.csv}$ is located in the same folder with your Jupyter notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "titanic_dataset_path = \"titanic.csv\"\n", "age = np.genfromtxt(titanic_dataset_path,\n", " delimiter=\",\",\n", " skip_header=1,\n", " usecols=[0]).astype(float)\n", "embarkation_port = np.genfromtxt(titanic_dataset_path, delimiter=\",\", \\\n", " skip_header=1, usecols=[1], dtype = None, encoding=None)\n", "embarkation_port_S = np.where(embarkation_port == 'S', 1, 0).astype(int)\n", "embarkation_port_Q = np.where(embarkation_port == 'Q', 1, 0).astype(int)\n", "embarkation_port_C = np.where(embarkation_port == 'C', 1, 0).astype(int)\n", "fare = np.genfromtxt(titanic_dataset_path,\n", " delimiter=\",\",\n", " skip_header=1,\n", " usecols=[2]).astype(float)\n", "parent_children = np.genfromtxt(titanic_dataset_path,\n", " delimiter=\",\",\n", " skip_header=1,\n", " usecols=[3]).astype(int)\n", "ticket_class = np.genfromtxt(titanic_dataset_path,\n", " delimiter=\",\",\n", " skip_header=1,\n", " usecols=[4]).astype(int)\n", "sex = np.genfromtxt(titanic_dataset_path, delimiter=\",\", skip_header=1, usecols=[5],\\\n", " converters = {5:lambda x: 0 if x == b'male' else 1}).astype(int)\n", "siblings_spouses = np.genfromtxt(titanic_dataset_path,\n", " delimiter=\",\",\n", " skip_header=1,\n", " usecols=[6]).astype(int)\n", "survival = np.genfromtxt(titanic_dataset_path,\n", " delimiter=\",\",\n", " skip_header=1,\n", " usecols=[7]).astype(int)\n", "titanic_input = np.c_[age,fare,parent_children,ticket_class,\\\n", " sex,siblings_spouses,embarkation_port_S,embarkation_port_Q,embarkation_port_C]\n", "titanic_labels = survival.reshape(-1, 1)\n", "titanic_input, titanic_row_of_means, titanic_row_of_stds = standardise(\n", " titanic_input)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_titanic(data_input, data_output):\n", " characteristics_names = \\\n", " ['Age', 'Fare', 'ParentChildren', 'Class', 'Sex', 'SiblingsSpouses', 'Southampton', 'Queenstown','Cherbourg']\n", " number_of_characteristics = data_input.shape[1]\n", " titanic_de_standardised = de_standardise(data_input, titanic_row_of_means,\n", " titanic_row_of_stds)\n", " titanic_dataframe = pd.DataFrame(data=titanic_de_standardised,\n", " columns=characteristics_names)\n", " titanic_dataframe['Survival'] = data_output\n", " sns.set_theme(style=\"ticks\", color_codes=True)\n", " g = sns.PairGrid(titanic_dataframe,\n", " vars=titanic_dataframe[:-1],\n", " hue='Survival')\n", " g.map_offdiag(sns.scatterplot)\n", " g.map_diag(plt.hist)\n", " plt.show()\n", " plt.tight_layout" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_titanic(titanic_input, titanic_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Implement function $\\mathtt{linear\\_regression\\_data}$ that outputs the linear regression data matrix defined as\n", "$$\n", "\\mathbf{\\Phi\\left(X\\right)} = \n", "\\begin{pmatrix}\n", "1 & x^{(1)}_1 & x^{(1)}_2 & \\ldots & x^{(1)}_d \\\\\n", "1 & x^{(2)}_1 & x^{(2)}_2 & \\ldots & x^{(2)}_d \\\\\n", "\\vdots & \\vdots & \\vdots & \\ddots & \\vdots & \\\\\n", "1 & x^{(s)}_1 & x^{(s)}_2 & \\ldots & x^{(s)}_d \\\\\n", "\\end{pmatrix}\n", "$$\n", "The function $\\mathtt{linear\\_regression\\_data}$ should take the NumPy array *data_inputs* as argument. Here, *data_inputs* is supposed to be a data matrix containing all inputs in the matrix form as follows\n", "$$\n", "\\mathbf{X} = \n", "\\begin{pmatrix}\n", "x^{(1)}_1 & x^{(1)}_2 & \\ldots & x^{(1)}_d \\\\\n", "x^{(2)}_1 & x^{(2)}_2 & \\ldots & x^{(2)}_d \\\\\n", "\\vdots & \\vdots & \\ddots & \\vdots & \\\\\n", "x^{(s)}_1 & x^{(s)}_2 & \\ldots & x^{(s)}_d \\\\\n", "\\end{pmatrix}.\n", "$$\n", "The function should output data matrix $\\mathbf{\\Phi\\left(X\\right)}$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def linear_regression_data(data_inputs):\n", " first_column = np.ones((len(data_inputs), 1))\n", " X_matrix = np.c_[first_column, data_inputs]\n", " return X_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Implement a function $\\mathtt{model\\_function}$ that takes two arguments _data_matrix_ and _weights_ and outputs the values of the model function $f\\left(\\mathbf{x},\\mathbf{w}\\right) = \\left\\langle \\phi\\left(\\mathbf{x}\\right),\\mathbf{w} \\right\\rangle$ evaluated for all data samples $\\mathbf{x}_i$, $i=\\overline{1,s}$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def model_function(data_matrix, weights):\n", " return data_matrix @ weights" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test your function with the following unit tests" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_data_matrix = np.array([[1, -1, 2, -3], [1, 0, 2, 4], [1, -3, 0, 5]])\n", "test_weights = np.array([[2], [0], [1], [-1]])\n", "assert_array_almost_equal(model_function(test_data_matrix, test_weights),\n", " np.array([[7], [0], [-3]]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Write a function $\\mathtt{logistic\\_function}$ that takes an argument named *inputs* and returns the output of the $\\sigma(x)$ sigmoid function, i.e.\n", "\\begin{equation*} \n", "\\sigma(x): = \\frac{1}{1+\\mathrm{e}^{-x}} \\, , \n", "\\end{equation*}\n", "applied to the *input*. Here $x$ is the mathematical notation for the argument inputs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def logistic_function(inputs):\n", " return 1 / (1 + np.exp(-inputs))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test your function with the following unit tests" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_inputs = np.array([[0], [np.log(25)], [-6], [np.log(9)], [2]])\n", "assert_array_almost_equal(logistic_function(test_inputs), np.array([[1/2], [25/26], [0.0024726231566347743], \\\n", " [9/10], [(np.exp(2))/(1 + np.exp(2))]]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def binary_prediction_function(data_matrix, weights):\n", " # the binary classification can be obtained by applying the logistic function to the predicted ys i.e., Xw\n", " # in this first implementation we use classic regression first then use the logistic function later\n", " probability = logistic_function(model_function(data_matrix, weights))\n", " return probability > 1/2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test your function with the following unit tests" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_data_matrix = np.array([[1, -1, 2, -3], [1, 0, 2, 4], [1, -3, 0, 5]])\n", "test_weights = np.array([[2], [0], [1], [-1]])\n", "assert_array_almost_equal(\n", " binary_prediction_function(test_data_matrix, test_weights),\n", " np.array([[1], [0], [0]]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. Implement a function $\\mathtt{gradient\\_descent}$ that performs gradient descent to numerically approximate a minimiser of a convex function. The function should take the following arguments\n", "- *objective* - a lambda-function representing function $E$. This itself should take a NumPy array as its argument and return a real number.\n", "- *gradient* - a lambda-function representing function $\\nabla E$. This itself should take a NumPy array as its argument and return a NumPy array representation of the gradient $\\nabla E$.\n", "- *initial_ weights* - a NumPy array with initial values $\\mathbf{w}^{(0)}$ for the first iterate \n", "- *step_size* - a step-size parameter $\\tau$ for the gradient descent step\n", "- *no_of_iterations* - an integer parameter that controls the number of iterations\n", "- *print_output* - an integer parameter that controls how often you are printing an intermediate result. If say *print_output = 100*, then after every 100th iteration you are asked to print your current iterate and a value of the objective as *Iteration k/m, objective = o.*, where $k$ is a number of current iteration, $m$ is a total number of iterations, and $o$ is a value of the objective evaluated at current iterate.\n", "\n", "Implement the function so that it returns a NumPy array of the weights obtained after gradient descent together with a list of objective values for all iterates." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def gradient_descent(objective,gradient,initial_weights,step_size=1,no_of_iterations=100,print_output=10):\n", " \n", " objective_values = []\n", " weights = np.copy(initial_weights)\n", " objective_values.append(objective(weights))\n", " \n", " for counter in range(no_of_iterations):\n", " weights -= step_size * gradient(weights)\n", " objective_values.append(objective(weights))\n", " if (counter + 1) % print_output == 0:\n", " print(\"Iteration {k}/{m}, objective = {o}.\".format(k=counter+1,m=no_of_iterations, \n", " o=objective_values[counter]))\n", " \n", " print(\"Iteration completed after {k}/{m}, objective = {o}.\".format(k=counter+ 1,\n", " m=no_of_iterations, \n", " o=objective_values[counter]))\n", " return weights, objective_values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test your function with the following unit tests" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_matrix_m = np.array([[3, 1], [2, 4]])\n", "test_vector_v = np.array([5, 6])\n", "# let us consider the usual E(x) function\n", "test_objective = lambda x: x.T @ (test_matrix_m @ x) + x @ test_vector_v\n", "#we know the gradient\n", "test_gradient = lambda x: (test_matrix_m + test_matrix_m.T) @ x + test_vector_v\n", "\n", "test_initial_weights = np.array([0.0, 0.0])\n", "test_step_size = 0.9 / (np.linalg.norm(test_matrix_m + test_matrix_m.T))\n", "test_no_of_iterations = 100\n", "test_print_output = 10\n", "assert_array_almost_equal(gradient_descent(test_objective, test_gradient, \\\n", " test_initial_weights,test_step_size,\\\n", " test_no_of_iterations, test_print_output)[0],np.array([-0.564103, -0.538462]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6. Write two functions that implement the cost function for binary logistic regression as well as its gradient, as defined below. \n", "$$\n", "\\mathrm{L}\\left(\\mathbf{w}\\right) = \\frac{1}{s} \\left(\n", "\\sum\\limits_{i=1}^s \\log\\left[1+\\exp\\left(f\\left(\\mathbf{x}^{(i)},\\mathbf{w}\\right)\\right)\\right] - y_i\\cdot f\\left(\\mathbf{x}^{(i)},\\mathbf{w}\\right)\n", "\\right),\n", "$$\n", "where $\\mathbf{x}^{(i)}$ is a vector representing $i$-th data sample and $f$ is a model function. In the case of linear model function $f\\left(\\mathbf{x},\\mathbf{w}\\right) = \\left\\langle \\phi\\left(\\mathbf{x}\\right),\\mathbf{w} \\right\\rangle$ one has\n", "$$\n", "\\nabla \\mathrm{L}\\left(\\mathbf{w}\\right) = \\frac{1}{s} \\left(\n", "\\sum\\limits_{i=1}^s \\phi\\left(\\mathbf{x}^{(i)}\\right)\\cdot\\sigma\n", "\\left(\\left\\langle \\phi\\left(\\mathbf{x}^{(i)}\\right),\\mathbf{w} \\right\\rangle \\right) - y_i\\cdot \\phi\\left(\\mathbf{x}^{(i)}\\right)\n", "\\right),\n", "$$\n", "with $\\phi\\left(\\mathbf{x}^{(i)}\\right)$ being an augmented $i$-th data vector containing additional coordinate $1$.\n", "\n", "The function for the cost function is named $\\mathtt{binary\\_logistic\\_regression\\_cost\\_function}$ and should take the NumPy arrays _data_matrix_, _data_labels_ and _weights_ as arguments. Subsequently, write a method $\\mathtt{binary\\_logistic\\_regression\\_gradient}$ that takes the same inputs as $\\mathtt{binary\\_logistic\\_regression\\_cost\\_function}$ and computes the gradient of the binary logistic regression cost function as defined in the lecture." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def binary_logistic_regression_cost_function(data_matrix, data_labels,weights):\n", " regression_outputs = model_function(data_matrix, weights)\n", " return np.mean(np.log(1 + np.exp(regression_outputs)) -data_labels * regression_outputs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def binary_logistic_regression_gradient(data_matrix, data_labels, weights):\n", " return data_matrix.T @ (logistic_function(model_function(data_matrix, weights)) - data_labels) / len(data_matrix)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test your function with the following unit tests" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_data_matrix = np.array([[1, -1, 2, -3], [1, 0, 2, 4], [1, -3, 0, 5]])\n", "test_data_labels = np.array([[1], [0], [0]])\n", "test_weights = np.array([[2], [0], [1], [-1]])\n", "assert_array_almost_equal(binary_logistic_regression_cost_function(test_data_matrix,\n", " test_data_labels, test_weights),0.24754867)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_data_matrix = np.array([[1, -1, 2, -3], [1, 0, 2, 4], [1, -3, 0, 5]])\n", "test_data_labels = np.array([[1], [0], [0]])\n", "test_weights = np.array([[2], [0], [1], [-1]])\n", "assert_array_almost_equal(binary_logistic_regression_gradient(test_data_matrix,test_data_labels,test_weights), \\\n", " np.array([[0.182172],[-0.047122],[0.332726], [0.746621]]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7. Finally use all the above methods to evaluate predicted class labels (survival) for all Titanic passengers. Use all data to train your model and evaluate classification corresponding classification accuracy. You are expected to:\n", "- build a linear regression data matrix _titanic_data_matrix_ based on the _titanic_input_ data\n", "- evaluate optimal weights by running the gradient descent method: take initial weights to be zero and run $N = 2000$ iterations with a step-size parameter equal to $\\tau = 3.9\\cdot s \\cdot\\left\\|\\mathbf{\\Phi\\left(X\\right)}\\right\\|^{-2}$, where $s$ is a number of data samples, $\\mathbf{\\Phi\\left(X\\right)}$ is a mathematical representation of _data_matrix_. Assign a variable _titanic_optimal_weights_ with your result.\n", "- evaluate predicted class labels for all data samples. Store the result in _titanic_predicted_labels_ variable.\n", "- evaluate accuracy rate of your prediction (should be a value between 0 and 1) and store its value in a variable _titanic_accuracy_rate_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# first let's get the data matrix from the inputs\n", "titanic_data_matrix = linear_regression_data(titanic_input)\n", "\n", "# we then define the cost function\n", "titanic_cost_function = lambda weights: binary_logistic_regression_cost_function(titanic_data_matrix,\n", " titanic_labels, weights)\n", "# then we get the gradient\n", "titanic_gradient_function = lambda weights: binary_logistic_regression_gradient(titanic_data_matrix,\n", " titanic_labels, weights)\n", "\n", "# start with w^0\n", "titanic_initial_weights = np.zeros((len(titanic_data_matrix.T), 1))\n", "\n", "# set the step size\n", "titanic_step_size = 3.9 * len(titanic_data_matrix) / (np.linalg.norm(titanic_data_matrix))**2\n", "\n", "# apply gradient descent\n", "titanic_optimal_weights, titanic_objective_values_ =gradient_descent(titanic_cost_function,\n", " titanic_gradient_function,\n", " titanic_initial_weights,\n", " titanic_step_size,2000, 200)\n", "\n", "# get the predicte labels\n", "predicted_labels= binary_prediction_function(titanic_data_matrix,titanic_optimal_weights)\n", "\n", "# measure the accuracy comparing predicted VS real labels\n", "titanic_accuracy_rate = classification_accuracy(predicted_labels,titanic_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test your function with the following unit tests" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "assert_array_almost_equal(titanic_optimal_weights, np.array([[-0.65370587], [-0.55872717],\\\n", " [0.09439554], [-0.0750632], [-0.94173469],\\\n", " [1.29812358], [-0.38153153], [-0.08951093], \\\n", " [0.01944579], [0.08813609]]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "assert_array_almost_equal(titanic_accuracy_rate, 0.8024691358024691)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"The optimal weights are w = {w}.T.\".format(w=titanic_optimal_weights.T))\n", "print(\"The classification accuracy for the training set is {p} %.\".format(\n", " p=100 * titanic_accuracy_rate))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.17" } }, "nbformat": 4, "nbformat_minor": 2 }