In my continuing quest to improve my scores for the discrete optimization class, I stumbled upon a data structure called ‘k-d tree’ (short for k-dimensional tree). Supposedly it is a really good data structure for spatial data, allowing for fast nearest neighbor searches. The scipy.spatial library has an object called KDTree and cKDTree, both of which are implementations of the k-d tree data structure. The primary difference between the two is that KDTree is implemented in python, whereas cKDTree is implemented in Cython.

I wrote up a quick test case to get an idea of the performance improvements provided by the special data structures. For the baseline, I wrote a function that simply loops through a set of coordinates and grabs the nearest neighbor to a specified set of coordinates by calculating and comparing the distances of the specified set of coordinates with every other set. In the second test case, I used the scipy.spatial KDTree object to store the coordinates, and then used the query method to get the nearest neighbor. The third method was identical to the KDTree method except I used the cKDTree object.

I created a random set of coordinates and then ran each function 1000 times to get an idea of the speed of each method. The code is posted on gist. The results are below:

Method 1 (using for loops) time: 2.42 Method 2 (using kdtree) time: 0.37 Method 3 (using ckdtree) time: 0.12

As you can see from the results, both of the k-d tree methods were significantly faster than using a `for loop`

. Additionally, the cKDTree implementation was fairly faster than the KDTree function. One issue I ran into was an overflow error on the KDTree implementation when using a list of coordinates with more than 100,000 points. Due to this issue, and because the use of cKDTree is nearly identical to KDTree (and faster), I would recommend using cKDTree. I am going to throw this into my latest TSP solver and see if I an get some performance gains. I’ll let you know how it goes.