Optimising the Minkowski distance, part 3: removing redundantĀ calculations

written in python, machine learning

In our previous blog post we discussed how to implement the Minkowski distance formula in a couple of functions which relied heavily on for loops. On our full data, this lead to a processing time of over an hour. With some simple tricks in NumPy which exploit the properties of vectors and matrices, we were able to bring this down to around 5 minutes.

While this is not bad, we still have a couple more tricks up our sleeves to bring this time down further. In this blog post we’ll talk about cutting out redundant calculations to get us to a processing time around 5-fold faster. Let’s get started.

We’ll again read in the dry bean dataset and extract the same sample arrays we used to time our functions in the previous blog post.

import numpy as np
import pandas as pd
beans = pd.read_excel("data/Dry_Bean_Dataset.xlsx")

array_sample_1 = beans[["MajorAxisLength", "MinorAxisLength", "roundness"]][:100].to_numpy()
array_sample_2 = beans[["MajorAxisLength", "MinorAxisLength", "roundness"]].to_numpy()
array_sample_3 = beans.drop(columns=["Class"]).to_numpy()

Removing duplicate and self calculations

In the previous blog post, something you have no doubt noticed is that we were calculating every distance twice, as well as unnecessarily calculating the distance between vectors and themselves. Removing these redundant calculations is likely to give us some extra speed.

Within linear algebra, we can divide square matrices (that is, matrices with the same number of rows and columns) into different areas. The area that runs down the diagonal is called, unsurprisingly, the diagonal. The area above the diagonal is called the upper triangle, and the area below the diagonal is called the lower triangle. In our matrix of returned results, we are getting our distance metrics in the upper and lower triangles of the distance matrix, and the distances between vectors and themselves on the diagonal. Let’s have a look at this by recreating the distances matrix with our three example vectors, \(X\), \(Y\) and \(Z\).

def calculate_minkowski_distance_vectorised(array1: np.ndarray,
                                            array2: np.ndarray,
                                            p: int
                                            ) -> np.ndarray:
    """
    Generalised formula for calculating both Manhattan and Euclidean distances. Calculates pairwise distances between every point in two n-dimensional arrays.
    array1: first set of points;
    array2: second set of points;
    p: power parameter which determines the distance metric used, with 1 = Manhattan and 2 = Euclidean.
    """

    diffs = array1[:, None, :] - array2[None, :, :]
    abs_diffs = np.power(np.abs(diffs), p)
    return abs_diffs.sum(axis=-1) ** (1 / p)
X = np.array([5, 7, 3, 9])
Y = np.array([1, 6, 2, 4])
Z = np.array([8, 8, 3, 1])
base_matrix = np.array([X, Y, Z])

calculate_minkowski_distance_vectorised(base_matrix, base_matrix, 2)
array([[0.        , 6.55743852, 8.60232527],
       [6.55743852, 0.        , 7.93725393],
       [8.60232527, 7.93725393, 0.        ]])

Looking at this matrix, if we aim to only return either the results in the upper or lower triangle we will cut the number of calculations we need to do by more than half. We can do this using a NumPy method called triu_indices (for the upper triangle, the one for the lower triangle is called tril_indices). As an aside, thanks to this wonderful Medium post by Shailesh Kumar for much of the below code.

In order to use this method, we first need to find out the number of rows our matrix has. We can then input this value into triu_indices, plus the argument k = 1, to get the row and column values of the upper triangle values of our distance matrix.

n = base_matrix.shape[0]
row_indices, column_indices = np.triu_indices(n, k = 1)
pd.DataFrame({
    "row_indices": row_indices,
    "column_indices": column_indices
})
row_indices column_indices
0 1
0 2
1 2

What this is essentially telling us is that in order to calculate the upper triangle of the distance matrix, we need to calculate the distance between vectors 0 and 1, vectors 0 and 2, and vectors 1 and 2. We can exploit this fact to create subdivisions of our base_matrix, dividing it up into a matrix with those vectors corresponding to the “row” indices and another corresponding to the “column” indices.

base_matrix[row_indices]
array([[5, 7, 3, 9],
       [5, 7, 3, 9],
       [1, 6, 2, 4]])
base_matrix[column_indices]
array([[1, 6, 2, 4],
       [8, 8, 3, 1],
       [8, 8, 3, 1]])

You can see that the first matrix has rows corresponding to vectors 0, 0 and 1, and matrix 2 has rows corresponding to vectors 1, 2 and 2. As they are the same size, we can just subtract them, ending up with a matrix containing the elementwise differences between vectors 0 and 1, vectors 0 and 2, and vectors 1 and 2.

base_matrix[row_indices] - base_matrix[column_indices]
array([[ 4,  1,  1,  5],
       [-3, -1,  0,  8],
       [-7, -2, -1,  3]])

You’ll likely recognise these values as the three difference vectors we’ve calculated on this matrix in the previous blog posts. We’ve now significantly simplified the most expensive step in our distance metric calculation, and can move on to applying the rest of the calculations to get the distance metrics. In order to be able to trace these distance metrics back to the rows they are comparing, I bundled the distance metrics into a Pandas DataFrame and included these values in additional columns.

def calculate_minkowski_distance_upper_triangle(s_array: np.array,
                                                p: int
                                                ) -> np.array:
    n = s_array.shape[0]
    row_indices, column_indices = np.triu_indices(n, k = 1)
    diffs = s_array[row_indices] - s_array[column_indices]
    abs_diffs = np.power(np.abs(diffs), p)
    return pd.DataFrame({
        "row1": row_indices,
        "row2": column_indices,
        "distance": abs_diffs.sum(axis=-1) ** (1 / p)
    })
calculate_minkowski_distance_upper_triangle(base_matrix, 2)
row1 row2 distance
0 1 6.557439
0 2 8.602325
1 2 7.937254

As you can see, we’re getting the exact same distance values as when calculating previously. Let’s now time it with our samples from the beans dataset.

%time triu_min_1 = calculate_minkowski_distance_upper_triangle(array_sample_1, 2)
CPU times: user 1.68 ms, sys: 1.09 ms, total: 2.76 ms
Wall time: 1.69 ms
%time triu_min_2 = calculate_minkowski_distance_upper_triangle(array_sample_2, 2)
CPU times: user 11.8 s, sys: 3.06 s, total: 14.8 s
Wall time: 14.9 s
%time triu_min_3 = calculate_minkowski_distance_upper_triangle(array_sample_3, 2)
CPU times: user 54.5 s, sys: 50.3 s, total: 1min 44s
Wall time: 2min 21s

Impressive! Although this change hasn’t made much of a difference for the already quite fast second sample, it has cut the time of calculating for the third sample by more than half. Let’s end by exploring one final optimisation that we can make.

Splitting the distance calculations

One inefficiency with our code that remains is that, under the hood, NumPy is looping over the vectors to calculate both the absolute values and powers of the elements of a vector. This means that we obviously incur a time penalty when running these operations over larger vectors versus smaller ones. While we can’t avoid this looping, we can take advantage of some mathematical properties of the powers we’re working with.

In the previous blog post when calculating the Manhattan distance, I pointed out that raising the difference calculations, as well as the final distance metric, to the power of 1 was a redundant operation as it just returns the same value. This means that when we’re applying our Minkowski distance function with p = 1, we’re wasting processing by applying the power function. Similarly, any number raised to the power of 2 will automatically become positive (e.g., \(-2^2 = 2^2 = 4\)). Therefore, when we use our Minkowski distance function with p = 2, it is pointless to use the abs function prior to squaring our values. We can therefore do one final optimisation by splitting our Minkowski function into separate Manhattan and Euclidean distance functions and removing any redundant methods.

def calculate_manhattan_distance(s_array: np.array) -> np.array:
    n = s_array.shape[0]
    i, j = np.triu_indices(n, k = 1)
    diffs = s_array[i] - s_array[j]
    abs_diffs = np.abs(diffs)
    return pd.DataFrame({
        "row1": i,
        "row2": j,
        "distance": abs_diffs.sum(axis=-1)
    })

def calculate_euclidean_distance(s_array: np.array) -> np.array:
    n = s_array.shape[0]
    i, j = np.triu_indices(n, k = 1)
    diffs = s_array[i] - s_array[j]
    sq_diffs = np.power(diffs, 2)
    return pd.DataFrame({
        "row1": i,
        "row2": j,
        "distance": sq_diffs.sum(axis=-1) ** (1 / 2)
    })

This time, let’s see how long it takes both of these functions to run over our samples.

%time vect_manhattan_1 = calculate_manhattan_distance(array_sample_1)
CPU times: user 1.37 ms, sys: 1.02 ms, total: 2.39 ms
Wall time: 1.58 ms
%time vect_manhattan_2 = calculate_manhattan_distance(array_sample_2)
CPU times: user 5.94 s, sys: 2.37 s, total: 8.31 s
Wall time: 8.33 s
%time vect_manhattan_3 = calculate_manhattan_distance(array_sample_3)
CPU times: user 20.2 s, sys: 32.8 s, total: 52.9 s
Wall time: 1min 12s

For the Manhattan distance, we’ve managed to almost halve the computational time. Now let’s check the Euclidean distance.

%time vect_euclidean_1 = calculate_euclidean_distance(array_sample_1)
CPU times: user 1.89 ms, sys: 1.14 ms, total: 3.03 ms
Wall time: 1.81 ms
%time vect_euclidean_2 = calculate_euclidean_distance(array_sample_2)
CPU times: user 11.3 s, sys: 2.38 s, total: 13.6 s
Wall time: 13.7 s
%time vect_euclidean_3 = calculate_euclidean_distance(array_sample_3)
CPU times: user 49.3 s, sys: 32.6 s, total: 1min 21s
Wall time: 1min 37s

Bam! With this final optimisation we get the calculation of the Manhattan distance down to around a minute and the Euclidean distance to a minute and a half on our largest sample. The Manhattan distance calculation is slightly faster as we’ve gotten rid of two power calculations when we split the functions, one over each of the differences and another over the final distance calculations. Comparing this to where we started, our calculation of the exact same distance metrics is now between 50 to 120 times faster!

I hope this series has given you some idea of how vectorisation works, and the confidence to know when and how to use it. With just a small amount of linear algebra knowledge and a few small changes to your code, you could be seeing significant performance gains without ever having to leave Python. Happy optimising!