{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<table width=100%>\n",
    "    <tr>\n",
    "        <td width = 40% align = \"left\">\n",
    "            <h3> MTH786 Machine Learning with Python</h3>\n",
    "        </td>\n",
    "        <td width = 35%>            \n",
    "        </td>\n",
    "        <td width = 25% align = \"left\">\n",
    "            <h3>Semester A, 2023/2024 </h3>\n",
    "        </td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td width = 40% align = \"left\">\n",
    "            <h3> Coursework 9</h3>\n",
    "        </td>\n",
    "        <td width = 35%>            \n",
    "        </td>\n",
    "        <td width = 25% align = \"left\">\n",
    "            <h3>Dr Nicola Perra </h3>\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from numpy.testing import assert_array_almost_equal, assert_array_equal\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Nearest neighbour classification\n",
    "In this exercise you will work with a data set you are familiar with: $\\mathtt{heigh\\_weight.csv}$. By completing this exercise you will learn how to solve binary classification problem with non-parametric method. In the second part we will learn how to solve the classification problem for titanic data using logistic regression method.\n",
    "\n",
    "We start by loading data. **Important:** please check that file $\\mathtt{height\\_weight.csv}$ is located in the same folder with your Jupyter notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "types = np.genfromtxt(\"height_weight.csv\", delimiter=\",\", skip_header=1, usecols=[0], \\\n",
    "                        converters={0: lambda x: 0 if b\"A\" in x else 1})\n",
    "heights = np.genfromtxt(\"height_weight.csv\",\n",
    "                        delimiter=\",\",\n",
    "                        skip_header=1,\n",
    "                        usecols=[1])\n",
    "weights = np.genfromtxt(\"height_weight.csv\",\n",
    "                        delimiter=\",\",\n",
    "                        skip_header=1,\n",
    "                        usecols=[2])\n",
    "indexes = np.argsort(heights)\n",
    "\n",
    "height_weight_input = np.c_[heights[indexes], weights[indexes]]\n",
    "height_weight_output = types[indexes]\n",
    "\n",
    "\n",
    "def standardise(data_matrix):\n",
    "    row_of_means = np.mean(data_matrix, axis=0)\n",
    "    standardised_matrix = data_matrix - row_of_means\n",
    "    row_of_stds = np.std(standardised_matrix, axis=0)\n",
    "    return (standardised_matrix / row_of_stds), row_of_means, row_of_stds\n",
    "\n",
    "\n",
    "def de_standardise(standardised_matrix, row_of_means, row_of_stds):\n",
    "    matrix = np.copy(standardised_matrix * row_of_stds)\n",
    "    return matrix + row_of_means\n",
    "\n",
    "\n",
    "height_weight_input, height_weight_row_of_means, height_weight_row_of_stds = standardise(\n",
    "    height_weight_input)\n",
    "\n",
    "\n",
    "def plot_height_weight(data_input, data_output):\n",
    "    height_weight_de_standardised = de_standardise(data_input,\n",
    "                                                   height_weight_row_of_means,\n",
    "                                                   height_weight_row_of_stds)\n",
    "    plt.scatter(height_weight_de_standardised[:,0][data_output == 0], \\\n",
    "                height_weight_de_standardised[:,1][data_output == 0], s=1, c = 'r')\n",
    "    plt.scatter(height_weight_de_standardised[:,0][data_output == 1], \\\n",
    "                height_weight_de_standardised[:,1][data_output == 1], s=1, c = 'b')\n",
    "    plt.xlabel('Height', fontsize=16)\n",
    "    plt.xticks(fontsize=16)\n",
    "    plt.ylabel('Weight', fontsize=16)\n",
    "    plt.yticks(fontsize=16)\n",
    "    plt.tight_layout"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_height_weight(height_weight_input, height_weight_output)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our main goal is to identify the type by studying their height and weight. We will end up with building a classifier which will split the height-weight plane into two regions: one corresponding to type A and another to type B. To achieve this we suggest to use the $K$-nearest neighbour classification method.\n",
    "\n",
    "1. Write the function $\\mathtt{pairwise\\_distances}$  that takes two arguments:  $\\mathtt{from\\_data}$, $\\mathtt{to\\_data}$ - both are NumPy arrays of the same width that may have different height. These arrays represent two collections (later testing data and training data) of row-vectors, each describing a data point in  $d$ -dimensional input space. In other words for $$ \n",
    "\\mathtt{from\\_data} =\n",
    "\\begin{pmatrix}\n",
    "x^{(1)}_1 & x^{(1)}_2 & \\ldots & X^{(1)}_d \\\\\n",
    "x^{(2)}_1 & x^{(2)}_2 & \\ldots & x^{(2)}_d \\\\\n",
    "\\vdots & \\vdots & \\ddots & \\vdots \\\\\n",
    "x^{(p)}_1 & x^{(p)}_2 & \\ldots & x^{(p)}_d \n",
    "\\end{pmatrix}\n",
    ", \n",
    "\\quad\n",
    "\\mathtt{to\\_data} =\n",
    "\\begin{pmatrix}\n",
    "y^{(1)}_1 & y^{(1)}_2 & \\ldots & y^{(1)}_d \\\\\n",
    "y^{(2)}_1 & y^{(2)}_2 & \\ldots & y^{(2)}_d \\\\\n",
    "\\vdots & \\vdots & \\ddots & \\vdots \\\\\n",
    "y^{(q)}_1 & y^{(q)}_2 & \\ldots & y^{(q)}_d \n",
    "\\end{pmatrix}\n",
    "$$\n",
    "your function should return a NumPy array of size $p\\times q$ of the form\n",
    "$$\n",
    "\\mathtt{pairwise\\_distances} = \n",
    "\\begin{pmatrix}\n",
    "\\mathrm{dist}\\left(x^{(1)},y^{(1)}\\right) &\n",
    "\\mathrm{dist}\\left(x^{(1)},y^{(2)}\\right) & \\ldots & \\mathrm{dist}\\left(x^{(2)},y^{(q)}\\right) \\\\\n",
    "\\mathrm{dist}\\left(x^{(2)},y^{(1)}\\right) &\n",
    "\\mathrm{dist}\\left(x^{(2)},y^{(2)}\\right) & \\ldots & \\mathrm{dist}\\left(x^{(1)},y^{(q)}\\right) \\\\\n",
    "\\vdots &\n",
    "\\vdots & \\ddots & \\vdots\\\\\n",
    "\\mathrm{dist}\\left(x^{(p)},y^{(1)}\\right) &\n",
    "\\mathrm{dist}\\left(x^{(p)},y^{(2)}\\right) & \\ldots & \\mathrm{dist}\\left(x^{(p)},y^{(q)}\\right)\n",
    "\\end{pmatrix}.\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def pairwise_distances(from_data, to_data):\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Test your function with the following unit tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_from_data = np.array([[0, 1, 2], [3, 4, 5]])\n",
    "test_to_data = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]])\n",
    "assert_array_almost_equal(pairwise_distances(test_from_data, test_to_data), np.array([[ 0., 5.19615242, 10.39230485, 15.58845727],\\\n",
    "                                                                                      [ 5.19615242,  0., 5.19615242, 10.39230485]]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. Write a function $\\mathtt{nearest\\_neighbour\\_classification}$ that performs nearest neighbour classiffication as introduced in the lecture notes and that takes four arguments \n",
    "- $\\mathtt{testing\\_inputs}$ - a NumPy array of input samples to be classiffied\n",
    "- $\\mathtt{training\\_inputs}$ - a NumPy array of data points for which the classification is known\n",
    "- $\\mathtt{training\\_outputs}$ - a NumPy array of class labels for $\\mathtt{training\\_inputs}$ data\n",
    "- $\\mathtt{no\\_of\\_neighbours}$ - an integer argument that specifies how many neighbours we take into account to determine the class labels for the $\\mathtt{testing\\_inputs}$. Use the Euclidean norm to measure the distance between samples.\n",
    "\n",
    "To implement the classification algorithm you may follow the below steps:  \n",
    "\n",
    "a. Calculate the pairwise distances between testing and training data\n",
    "\n",
    "b. Sort the pairwise distances by rows (sort rather indices)\n",
    "\n",
    "c. For every row (data input) evaluate labels of the closest $K$ points \n",
    "\n",
    "d. Assign a data input with the class label that appears the most in between $K$ closest points"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def nearest_neighbour_classification(testing_inputs, training_inputs,\n",
    "                                     training_outputs, no_of_neighbours):\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Test your function with the following unit tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_data_inputs = np.array([[15], [6], [18]])\n",
    "test_training_inputs = np.array([[10], [8], [17], [5], [12], [10], [20], [1],\n",
    "                                 [11]])\n",
    "test_training_outputs = np.array([1, 1, 1, 0, 0, 1, 1, 0, 1])\n",
    "test_no_of_neighbours = 1\n",
    "assert_array_almost_equal(nearest_neighbour_classification(test_data_inputs, test_training_inputs, \\\n",
    "                                                           test_training_outputs, test_no_of_neighbours),\\\n",
    "                          np.array([1, 0, 1]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_data_inputs = np.array([[16, 0], [13, 6], [6, 5], [14, 15]])\n",
    "test_training_inputs = np.array([[ 7, 13], [ 2,  4], [ 8, 15], [16, 19], [18,  3], [11, 19], [13,  4], [ 7, 18], [ 4, 14],\\\n",
    "                                 [10,  6], [17, 19], [19,  7], [ 4, 19], [18, 11], [18,  3], [14, 15], [14,  0], [ 5, 17],\\\n",
    "                                 [ 5, 15], [ 7, 19]])\n",
    "test_training_outputs = np.array(\n",
    "    [0, 2, 1, 0, 1, 0, 1, 1, 2, 1, 2, 2, 1, 1, 0, 1, 0, 0, 1, 2])\n",
    "test_no_of_neighbours = 3\n",
    "assert_array_almost_equal(nearest_neighbour_classification(test_data_inputs, test_training_inputs, \\\n",
    "                                                           test_training_outputs, test_no_of_neighbours),\\\n",
    "                          np.array([0, 1, 1, 0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. Write a function $\\mathtt{classification\\_accuracy}$ that takes two NumPy array arguments $\\mathtt{predicted\\_labels}$ and $\\mathtt{true\\_labels}$ and evaluates the ratio of labels that coincide in both arrays to the number of all labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def classification_accuracy(estimated_labels, true_labels):\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Test your function with the following unit tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_estimated_labels = np.array([0,4,0,0,2,4,0,0,2,2,2,3,3,1,3,4,0,3,4,0,1,1,2,0,0,0,\\\n",
    "                                  1,4,4,4,3,0,4,2,4,4,4,2,2,1,4,3,2,2,1,1,4,3,3,0,4,3,\\\n",
    "                                  0,0,0,2,4,3,4,3,1,3,2,4,2,3,2,3,2,3,1,0,4,3,2,3,1,3,\\\n",
    "                                  4,1,3,1,4,0,4,4,1,2,3,1,1,4,3,1,3,0,2,0,0,1])\n",
    "test_true_labels = np.array([4,4,4,4,1,3,0,4,1,0,2,3,1,2,0,1,3,4,3,4,4,4,2,0,3,4,3,2,3,\\\n",
    "                             0,1,0,4,3,2,2,1,2,2,1,4,0,4,4,3,0,0,0,3,2,3,0,0,3,3,4,2,2,\\\n",
    "                             3,4,2,3,3,0,0,2,0,3,0,4,4,4,4,1,2,3,1,3,1,1,2,0,2,3,3,3,4,\\\n",
    "                             0,0,3,1,0,3,0,3,1,0,0,4,4])\n",
    "assert_array_almost_equal(\n",
    "    classification_accuracy(test_estimated_labels, test_true_labels), 0.26)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4. After we have learned how to classify the data, we would like to learn how one can optimise a value $K$ of neighbours participating in the classification. This would be done by performing standard $K$-fold cross validation. Below we provide you with some functions discussed in previous assignments. These functions are used to perform $K$-fold cross validation and consecutive grid search for hyperparameter optimisation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def KFold_split(data_size, K):\n",
    "    np.random.seed(123456789)\n",
    "    indexes = np.random.permutation(data_size)\n",
    "    m, r = divmod(data_size, K)\n",
    "    indexes_split = [\n",
    "        indexes[i * m + min(i, r):(i + 1) * m + min(i + 1, r)]\n",
    "        for i in range(K)\n",
    "    ]\n",
    "    return indexes_split\n",
    "\n",
    "\n",
    "def KFold_cross_validation_knn(data_inputs, data_outputs, K, labels_evaluation,\n",
    "                           missclassification_evaluation,knn):\n",
    "    \n",
    "    data_size = len(data_inputs)\n",
    "    indexes_split = KFold_split(data_size, K)\n",
    "\n",
    "    average_accuracy = 0\n",
    "    for i in range(K):\n",
    "        training_indexes = np.concatenate([indexes_split[j] for j in range(K) if (j != i)])\n",
    "        \n",
    "        predicted_labels = labels_evaluation(data_inputs[indexes_split[i]],\n",
    "                                             data_inputs[training_indexes],\n",
    "                                             data_outputs[training_indexes],knn)\n",
    "        \n",
    "        accuracy = missclassification_evaluation(predicted_labels,\n",
    "                                              data_outputs[indexes_split[i]])\n",
    "        average_accuracy += accuracy / K\n",
    "        \n",
    "    error = 1.-average_accuracy\n",
    "    return error"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "5. We will also need a grid search function being implemented to find an optimal value of a number of neighbours to be included in the classification algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def grid_search(objective, grid):\n",
    "    values = np.array([])\n",
    "    for point in grid:\n",
    "        values = np.append(values, objective(point))\n",
    "    return grid[np.argmin(values)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6. Now define a grid of $K$ (number of neighbours to be included in the classification) values to be $\\left[1,2,\\ldots,20\\right]$ and determine an optimal parameter $\\mathtt{no\\_of\\_neighbours}$ based on K-fold cross validation and grid search. Assign a variable $\\mathtt{optimal\\_no\\_of\\_neighbours}$ with your result. For $K$-fold cross validation use $K = 5$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimal_no_of_neighbours = 0\n",
    "optimal_classification_error = 0\n",
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Optimal number of neighbours one needs to consider is equal to {n}. \\nIn this case the prediction error would be as little as {e:2.2f}%.\"\\\n",
    "      .format(n = optimal_no_of_neighbours, e = 100*optimal_classification_error))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Test your answers with the unit tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert_array_almost_equal(optimal_no_of_neighbours, 15)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert_array_almost_equal(optimal_classification_error, 0.0796)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "7. Visualise your results with your favourite visualisation tools, e.g. Pyplot from Matplotlib. You need to plot data points together with the decision areas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Binary logistic regression\n",
    "In the second part we will learn how to solve the binary classification problem for titanic data using logistic regression method. The data is titatic dataset:\n",
    "\n",
    "- $\\mathtt{Age}$ - age of the passenger;\n",
    "- $\\mathtt{Embarkation\\,\\,\\,port}$ - one of three ports (Southampton, Queenstown, Cherbourg) where the passenger has joined the trip;\n",
    "- $\\mathtt{Fare}$ - price the passenger paid for a ticket;\n",
    "- $\\mathtt{Parent/children}$ - number of parents/children of the passenger on board;\n",
    "- $\\mathtt{Ticket class}$ - one of three ticket classes (1,2,3);\n",
    "- $\\mathtt{Sex}$ - sex of the passenger;\n",
    "- $\\mathtt{Siblings/spouses}$ - number of siblings/spouses of the passenger on board;\n",
    "- $\\mathtt{Survival}$ - binary value representing whether the passenger has survived (1) or not (0).\n",
    "\n",
    "We start by loading data. Please pay attention to how we have dealt with the embarkation port column: instead of one column with categorical data we introduce three binary columns, each representing the information on whether the passenger joined at a specific port or not. **Important:** please check that files $\\mathtt{titanic.csv}$ is located in the same folder with your Jupyter notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "titanic_dataset_path = \"titanic.csv\"\n",
    "age = np.genfromtxt(titanic_dataset_path,\n",
    "                    delimiter=\",\",\n",
    "                    skip_header=1,\n",
    "                    usecols=[0]).astype(float)\n",
    "embarkation_port = np.genfromtxt(titanic_dataset_path, delimiter=\",\", \\\n",
    "                                 skip_header=1, usecols=[1], dtype = None, encoding=None)\n",
    "embarkation_port_S = np.where(embarkation_port == 'S', 1, 0).astype(int)\n",
    "embarkation_port_Q = np.where(embarkation_port == 'Q', 1, 0).astype(int)\n",
    "embarkation_port_C = np.where(embarkation_port == 'C', 1, 0).astype(int)\n",
    "fare = np.genfromtxt(titanic_dataset_path,\n",
    "                     delimiter=\",\",\n",
    "                     skip_header=1,\n",
    "                     usecols=[2]).astype(float)\n",
    "parent_children = np.genfromtxt(titanic_dataset_path,\n",
    "                                delimiter=\",\",\n",
    "                                skip_header=1,\n",
    "                                usecols=[3]).astype(int)\n",
    "ticket_class = np.genfromtxt(titanic_dataset_path,\n",
    "                             delimiter=\",\",\n",
    "                             skip_header=1,\n",
    "                             usecols=[4]).astype(int)\n",
    "sex = np.genfromtxt(titanic_dataset_path, delimiter=\",\", skip_header=1, usecols=[5],\\\n",
    "                    converters = {5:lambda x: 0 if x == b'male' else 1}).astype(int)\n",
    "siblings_spouses = np.genfromtxt(titanic_dataset_path,\n",
    "                                 delimiter=\",\",\n",
    "                                 skip_header=1,\n",
    "                                 usecols=[6]).astype(int)\n",
    "survival = np.genfromtxt(titanic_dataset_path,\n",
    "                         delimiter=\",\",\n",
    "                         skip_header=1,\n",
    "                         usecols=[7]).astype(int)\n",
    "titanic_input = np.c_[age,fare,parent_children,ticket_class,\\\n",
    "                      sex,siblings_spouses,embarkation_port_S,embarkation_port_Q,embarkation_port_C]\n",
    "titanic_labels = survival.reshape(-1, 1)\n",
    "titanic_input, titanic_row_of_means, titanic_row_of_stds = standardise(\n",
    "    titanic_input)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_titanic(data_input, data_output):\n",
    "    characteristics_names = \\\n",
    "    ['Age', 'Fare', 'ParentChildren', 'Class', 'Sex', 'SiblingsSpouses', 'Southampton', 'Queenstown','Cherbourg']\n",
    "    number_of_characteristics = data_input.shape[1]\n",
    "    titanic_de_standardised = de_standardise(data_input, titanic_row_of_means,\n",
    "                                             titanic_row_of_stds)\n",
    "    titanic_dataframe = pd.DataFrame(data=titanic_de_standardised,\n",
    "                                     columns=characteristics_names)\n",
    "    titanic_dataframe['Survival'] = data_output\n",
    "    sns.set_theme(style=\"ticks\", color_codes=True)\n",
    "    g = sns.PairGrid(titanic_dataframe,\n",
    "                     vars=titanic_dataframe[:-1],\n",
    "                     hue='Survival')\n",
    "    g.map_offdiag(sns.scatterplot)\n",
    "    g.map_diag(plt.hist)\n",
    "    plt.show()\n",
    "    plt.tight_layout"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_titanic(titanic_input, titanic_labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Implement function $\\mathtt{linear\\_regression\\_data}$ that outputs the linear regression data matrix defined as\n",
    "$$\n",
    "\\mathbf{\\Phi\\left(X\\right)} = \n",
    "\\begin{pmatrix}\n",
    "1 & x^{(1)}_1 & x^{(1)}_2 & \\ldots & x^{(1)}_d \\\\\n",
    "1 & x^{(2)}_1 & x^{(2)}_2 & \\ldots & x^{(2)}_d \\\\\n",
    "\\vdots & \\vdots & \\vdots & \\ddots & \\vdots & \\\\\n",
    "1 & x^{(s)}_1 & x^{(s)}_2 & \\ldots & x^{(s)}_d \\\\\n",
    "\\end{pmatrix}\n",
    "$$\n",
    "The function $\\mathtt{linear\\_regression\\_data}$  should take the NumPy array *data_inputs* as argument. Here, *data_inputs* is supposed to be a data matrix containing all inputs in the matrix form as follows\n",
    "$$\n",
    "\\mathbf{X} = \n",
    "\\begin{pmatrix}\n",
    "x^{(1)}_1 & x^{(1)}_2 & \\ldots & x^{(1)}_d \\\\\n",
    "x^{(2)}_1 & x^{(2)}_2 & \\ldots & x^{(2)}_d \\\\\n",
    "\\vdots & \\vdots & \\ddots & \\vdots & \\\\\n",
    "x^{(s)}_1 & x^{(s)}_2 & \\ldots & x^{(s)}_d \\\\\n",
    "\\end{pmatrix}.\n",
    "$$\n",
    "The function should output data matrix $\\mathbf{\\Phi\\left(X\\right)}$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def linear_regression_data(data_inputs):\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. Implement a function $\\mathtt{model\\_function}$ that takes two arguments _data_matrix_ and _weights_ and outputs the values of the model function $f\\left(\\mathbf{x},\\mathbf{w}\\right) = \\left\\langle \\phi\\left(\\mathbf{x}\\right),\\mathbf{w} \\right\\rangle$ evaluated for all data samples $\\mathbf{x}_i$, $i=[1,\\ldots,s]$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def model_function(data_matrix, weights):\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Test your function with the following unit tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_data_matrix = np.array([[1, -1, 2, -3], [1, 0, 2, 4], [1, -3, 0, 5]])\n",
    "test_weights = np.array([[2], [0], [1], [-1]])\n",
    "assert_array_almost_equal(model_function(test_data_matrix, test_weights),\n",
    "                          np.array([[7], [0], [-3]]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. Write a function $\\mathtt{logistic\\_function}$ that takes an argument named *inputs* and returns the output of the $\\sigma(x)$ sigmoid function, i.e.\n",
    "\\begin{equation*} \n",
    "\\sigma(x): = \\frac{1}{1+\\mathrm{e}^{-x}} \\, , \n",
    "\\end{equation*}\n",
    "applied to the *input*. Here $x$ is the mathematical notation for the argument inputs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def logistic_function(inputs):\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Test your function with the following unit tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_inputs = np.array([[0], [np.log(25)], [-6], [np.log(9)], [2]])\n",
    "assert_array_almost_equal(logistic_function(test_inputs), np.array([[1/2], [25/26], [0.0024726231566347743], \\\n",
    "                            [9/10], [(np.exp(2))/(1 + np.exp(2))]]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def binary_prediction_function(data_matrix, weights):\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Test your function with the following unit tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_data_matrix = np.array([[1, -1, 2, -3], [1, 0, 2, 4], [1, -3, 0, 5]])\n",
    "test_weights = np.array([[2], [0], [1], [-1]])\n",
    "assert_array_almost_equal(\n",
    "    binary_prediction_function(test_data_matrix, test_weights),\n",
    "    np.array([[1], [0], [0]]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "5. Implement a function $\\mathtt{gradient\\_descent}$ that performs gradient descent to numerically approximate a minimiser of a convex function. The function should take the following arguments\n",
    "- *objective* - a lambda-function representing function $E$. This itself should take a NumPy array as its argument and return a real number.\n",
    "- *gradient* - a lambda-function representing function $\\nabla E$. This itself should take a NumPy array as its argument and return a NumPy array representation of the gradient $\\nabla E$.\n",
    "- *initial_ weights* - a NumPy array with initial values $\\mathbf{w}^{(0)}$ for the first iterate \n",
    "- *step_size* - a step-size parameter $\\tau$ for the gradient descent step\n",
    "- *no_of_iterations* - an integer parameter that controls the number of iterations\n",
    "- *print_output* - an integer parameter that controls how often you are printing an intermediate result. If say *print_output = 100*, then after every 100th iteration you are asked to print your current iterate and a value of the objective as *Iteration k/m, objective = o.*, where $k$ is a number of current iteration, $m$ is a total number of iterations, and $o$ is a value of the objective evaluated at current iterate.\n",
    "\n",
    "Implement the function so that it returns a NumPy array of the weights obtained after gradient descent together with a list of objective values for all iterates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def gradient_descent(objective,\n",
    "                     gradient,\n",
    "                     initial_weights,\n",
    "                     step_size=1,\n",
    "                     no_of_iterations=100,\n",
    "                     print_output=10):\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Test your function with the following unit tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_matrix_m = np.array([[3, 1], [2, 4]])\n",
    "test_vector_v = np.array([5, 6])\n",
    "test_objective = lambda x: x.T @ (test_matrix_m @ x) + x @ test_vector_v\n",
    "test_gradient = lambda x: (test_matrix_m + test_matrix_m.T) @ x + test_vector_v\n",
    "test_initial_weights = np.array([0.0, 0.0])\n",
    "test_step_size = 0.9 / (np.linalg.norm(test_matrix_m + test_matrix_m.T))\n",
    "test_no_of_iterations = 100\n",
    "test_print_output = 10\n",
    "assert_array_almost_equal(gradient_descent(test_objective, test_gradient, \\\n",
    "                                           test_initial_weights,test_step_size,\\\n",
    "                                           test_no_of_iterations, test_print_output)[0],np.array([-0.564103, -0.538462]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6. Write two functions that implement the cost function for binary logistic regression as well as its gradient, as defined below. \n",
    "$$\n",
    "\\mathrm{L}\\left(\\mathbf{w}\\right) = \\frac{1}{s} \\left(\n",
    "\\sum\\limits_{i=1}^s \\log\\left[1+\\exp\\left(f\\left(\\mathbf{x}^{(i)},\\mathbf{w}\\right)\\right)\\right] - y_i\\cdot f\\left(\\mathbf{x}^{(i)},\\mathbf{w}\\right)\n",
    "\\right),\n",
    "$$\n",
    "where $\\mathbf{x}^{(i)}$ is a vector representing $i$-th data sample and $f$ is a model function. In the case of linear model function $f\\left(\\mathbf{x},\\mathbf{w}\\right) = \\left\\langle \\phi\\left(\\mathbf{x}\\right),\\mathbf{w} \\right\\rangle$ one has\n",
    "$$\n",
    "\\nabla \\mathrm{L}\\left(\\mathbf{w}\\right) = \\frac{1}{s} \\left(\n",
    "\\sum\\limits_{i=1}^s \\phi\\left(\\mathbf{x}^{(i)}\\right)\\cdot\\sigma\n",
    "\\left(\\left\\langle \\phi\\left(\\mathbf{x}^{(i)}\\right),\\mathbf{w} \\right\\rangle \\right) - y_i\\cdot \\phi\\left(\\mathbf{x}^{(i)}\\right)\n",
    "\\right),\n",
    "$$\n",
    "with $\\phi\\left(\\mathbf{x}^{(i)}\\right)$ being an augmented $i$-th data vector containing additional coordinate $1$.\n",
    "\n",
    "The function for the cost function is named $\\mathtt{binary\\_logistic\\_regression\\_cost\\_function}$ and should take the NumPy arrays _data_matrix_, _data_labels_ and _weights_ as arguments. Subsequently, write a method $\\mathtt{binary\\_logistic\\_regression\\_gradient}$ that takes the same inputs as $\\mathtt{binary\\_logistic\\_regression\\_cost\\_function}$ and computes the gradient of the binary logistic regression cost function as defined in the lecture."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def binary_logistic_regression_cost_function(data_matrix, data_labels,\n",
    "                                             weights):\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def binary_logistic_regression_gradient(data_matrix, data_labels, weights):\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Test your function with the following unit tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_data_matrix = np.array([[1, -1, 2, -3], [1, 0, 2, 4], [1, -3, 0, 5]])\n",
    "test_data_labels = np.array([[1], [0], [0]])\n",
    "test_weights = np.array([[2], [0], [1], [-1]])\n",
    "assert_array_almost_equal(\n",
    "    binary_logistic_regression_cost_function(test_data_matrix,\n",
    "                                             test_data_labels, test_weights),\n",
    "    0.24754867)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_data_matrix = np.array([[1, -1, 2, -3], [1, 0, 2, 4], [1, -3, 0, 5]])\n",
    "test_data_labels = np.array([[1], [0], [0]])\n",
    "test_weights = np.array([[2], [0], [1], [-1]])\n",
    "assert_array_almost_equal(binary_logistic_regression_gradient(test_data_matrix,test_data_labels,test_weights), \\\n",
    "                          np.array([[0.182172],[-0.047122],[0.332726], [0.746621]]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "7. Finally use all the above methods to evaluate predicted class labels (survival) for all Titanic passengers. Use all data to train your model and evaluate classification corresponding classification accuracy. You are expected to:\n",
    "- build a linear regression data matrix _titanic_data_matrix_ based on the _titanic_input_ data\n",
    "- evaluate optimal weights by running the gradient descent method: take initial weights to be zero and run $N = 2000$ iterations with a step-size parameter equal to $\\tau = 3.9\\cdot s \\cdot\\left\\|\\mathbf{\\Phi\\left(X\\right)}\\right\\|^{-2}$, where $s$ is a number of data samples, $\\mathbf{\\Phi\\left(X\\right)}$ is a mathematical representation of _data_matrix_. Assign a variable _titanic_optimal_weights_ with your result.\n",
    "- evaluate predicted class labels for all data samples. Store the result in _titanic_predicted_labels_ variable.\n",
    "- evaluate accuracy rate of your prediction (should be a value between 0 and 1) and store its value in a variable _titanic_accuracy_rate_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "titanic_optimal_weights = 0\n",
    "titanic_accuracy_rate = 0\n",
    "\n",
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Test your function with the following unit tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert_array_almost_equal(titanic_optimal_weights, np.array([[-0.65370587], [-0.55872717],\\\n",
    "                                                             [0.09439554], [-0.0750632],  [-0.94173469],\\\n",
    "                                                             [1.29812358], [-0.38153153], [-0.08951093], \\\n",
    "                                                             [0.01944579],  [0.08813609]]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert_array_almost_equal(titanic_accuracy_rate, 0.8024691358024691)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"The optimal weights are w = {w}.T.\".format(w=titanic_optimal_weights.T))\n",
    "print(\"The classification accuracy for the training set is {p} %.\".format(\n",
    "    p=100 * titanic_accuracy_rate))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.17"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}