32647

# High performance calculation of least squares difference from all possible combinations (n lists)

I'm looking for a very efficient way to calculate all possible combinations from n lists and then keep the combination with the smallest least squares difference.

I already have a code that does it, but when it gets to several million combinations things get slow.

<strong>candidates_len</strong> contains a list of lists with lengths i.e. [[500, 490, 510, 600][300, 490, 520][305, 497, 515]] <strong>candidates_name</strong> contains a list of lists with names i.e. [['a', 'b', 'c', 'd']['mi', 'mu', 'ma']['pi', 'pu', 'pa']]

Both lists have <strong>n</strong> lists.

```# Creating the possible combinations and store the lists of lengths in vector r r=[[]] for x in candidates_len: r = [ i + [y] for y in x for i in r ] #Storing the names of the combinations and store the lists of identifiers in vector z z=[[]] for x in candidates_name: z = [ i + [y] for y in x for i in z ] #Calculating distances and storing the minimum one min_index = 0 min_dist = 0 list_best = [] for index, item in enumerate(r): n = 0 dist = 0 while n < len(candidates_len): for i in range(n,len(candidates_len)): dist = dist + (item[n]-item[i])**2 n=n+1 if index==0: min_dist = dist min_index = index list_best.append(item) elif dist < min_dist: min_dist = dist min_index = index list_best = [] list_best.append(z[index]) least_combination = min_index ```

A hard case: http://pastebin.com/BkVQTQWK

Here are some test times. Would be good to get under a minute or so. I don't know if it's possible though.

```combinations time(s) 77760 1.255663 41184 1.580333 69120 6.214786 8960 1.131834 537600 14.855361 89100 1.264126 16384 3.247404 4199040 666.853284 226800 3.935878 9289728 679.064149 ```

My first thought here is that you're spending an awful lot of time building up lists that you don't need. At the very least, scrapping them would make things a lot simpler, but no guarantees it would actually make things faster:

```r = itertools.product(*candidates_len) z = itertools.product(*candidates_name) min_dist = None for item, names in itertools.izip(r, z): dist = 0 for n in range(len(item)): for i in range(n, len(item)): dist += (item[n]-item[i])**2 if min_dist is None or dist < min_dist: min_dist = dist best = item, names print(best) ```

With your test data, the explicit lists are taking up gigabytes of memory. I'm not sure how much—my poor 4GB laptop went into swap thrash hell before it even finished generating the `z` list, and everything slowed to a crawl. The whole operation takes less time with `itertools` than just the setup part without it… That probably isn't true on a machine with, say, 16GB of RAM, but still, why use the memory if you don't need it?

My next thought is that all you're doing is calculating the LSD on a bunch of arrays. Do you have a huge array of tiny arrays? If so, can you de-jagged them (e.g., filling them out with None) and `numpy` the whole thing? On the other hand, if it's an array of big arrays, you probably want a list (or, as above, an iterator) of `numpy` arrays, so at least you can vectorize one dimension.

Either way, vectorization is the key to optimizing anything that involves simple operations over big arrays, and `numpy` will often do vectorization better than anything an expert C++-and-Fortran-and-platform-specific-assembly coder is likely to do manually.

Without either thinking through the code or trying to understand the algorithm at a deep level, my first attempt would be to generate `r` as a sequence (as in my code above) but of `numpy` row vectors (something like `matrix(x, dtype=int) for x in itertools.product(*candidates_len)`). Then you can calculate the differences on each `item` by just `item - item.T`, then sum up the squares of the lower triangle (which I'd have to look up to figure out how to do). You could probably then further improve the performance by figuring out a way to only calculate the lower triangle in the first place. The typical trick to this is figuring out how to get the lower-triangle sum into the diagonal as part of the vectorized operation, and then you just extract the diagonal, but that isn't always appropriate. See the broadcasting docs for some ideas on how to vectorize the inner loop without creating explicit matrices. Finally, see if there's a way you can create a 3D array out of the whole thing (which may mean padding the individual items out to a fixed width) and then vectorizing the whole operation. (The memory use won't be nearly as bad, as `numpy` only needs to allocate 4 bytes for each value instead of a whole `PyObject`… but it may still be bad enough that you lose more than you gain.) Sorry if this is all a little vague, but hopefully it's enough to get you started on the experimentation.

Another thought is that you can probably parallelize this. Any machine with enough memory to handle that massive list, I'm willing to bet it has at least 4 cores on it. And you've got a long sequence of completely independent operations, which is the easiest thing in the world to parallelize. As a first step, create a `multiprocessing.Pool`, and make the outer loop submit jobs to the pool instead of doing the work directly. You'll probably find that the jobs are too small, so you're drowning in overhead, but then you can always batch up each N items (either explicitly, or see the `grouper` recipe in the `itertools` docs), and make the job be "loop over these N items, and return the one with the min LSD". (It may take some tweaking to find the optimal N.) You can even do this together with the top-level `numpy`, by splitting the giant array up into chunks along the x axis and farming those out as jobs.

One more thought: Your algorithm is starting with a product of N*M, each element of which is of length N. Then, for each element, you loop over it twice. So, the best possible performance is going to be O(N^3*M). Is this really the right algorithm? If so, are you actually getting N^3*M performance from your algorithm? If the answer to either question is no, you shouldn't be trying to micro-optimize it. Only when you've actually got the most efficient algorithm, coded properly, is it worth doing things like vectorizing, avoiding excess work, moving tight loops into C++ and Fortran, etc. Otherwise, you're just going to come back and say "But it still blows up when I go to 4x as big as my last test run".