High performance calculation of least squares difference from all possible combinations (n lists)

I'm looking for a very efficient way to calculate all possible combinations from n lists and then keep the combination with the smallest least squares difference.

I already have a code that does it, but when it gets to several million combinations things get slow.

<strong>candidates_len</strong> contains a list of lists with lengths i.e. [[500, 490, 510, 600][300, 490, 520][305, 497, 515]] <strong>candidates_name</strong> contains a list of lists with names i.e. [['a', 'b', 'c', 'd']['mi', 'mu', 'ma']['pi', 'pu', 'pa']]

Both lists have <strong>n</strong> lists.

# Creating the possible combinations and store the lists of lengths in vector r r=[[]] for x in candidates_len: r = [ i + [y] for y in x for i in r ] #Storing the names of the combinations and store the lists of identifiers in vector z z=[[]] for x in candidates_name: z = [ i + [y] for y in x for i in z ] #Calculating distances and storing the minimum one min_index = 0 min_dist = 0 list_best = [] for index, item in enumerate(r): n = 0 dist = 0 while n < len(candidates_len): for i in range(n,len(candidates_len)): dist = dist + (item[n]-item[i])**2 n=n+1 if index==0: min_dist = dist min_index = index list_best.append(item) elif dist < min_dist: min_dist = dist min_index = index list_best = [] list_best.append(z[index]) least_combination = min_index

A hard case: http://pastebin.com/BkVQTQWK

Here are some test times. Would be good to get under a minute or so. I don't know if it's possible though.

combinations time(s) 77760 1.255663 41184 1.580333 69120 6.214786 8960 1.131834 537600 14.855361 89100 1.264126 16384 3.247404 4199040 666.853284 226800 3.935878 9289728 679.064149


My first thought here is that you're spending an awful lot of time building up lists that you don't need. At the very least, scrapping them would make things a lot simpler, but no guarantees it would actually make things faster:

r = itertools.product(*candidates_len) z = itertools.product(*candidates_name) min_dist = None for item, names in itertools.izip(r, z): dist = 0 for n in range(len(item)): for i in range(n, len(item)): dist += (item[n]-item[i])**2 if min_dist is None or dist < min_dist: min_dist = dist best = item, names print(best)

With your test data, the explicit lists are taking up gigabytes of memory. I'm not sure how much—my poor 4GB laptop went into swap thrash hell before it even finished generating the z list, and everything slowed to a crawl. The whole operation takes less time with itertools than just the setup part without it… That probably isn't true on a machine with, say, 16GB of RAM, but still, why use the memory if you don't need it?

My next thought is that all you're doing is calculating the LSD on a bunch of arrays. Do you have a huge array of tiny arrays? If so, can you de-jagged them (e.g., filling them out with None) and numpy the whole thing? On the other hand, if it's an array of big arrays, you probably want a list (or, as above, an iterator) of numpy arrays, so at least you can vectorize one dimension.

Either way, vectorization is the key to optimizing anything that involves simple operations over big arrays, and numpy will often do vectorization better than anything an expert C++-and-Fortran-and-platform-specific-assembly coder is likely to do manually.

Without either thinking through the code or trying to understand the algorithm at a deep level, my first attempt would be to generate r as a sequence (as in my code above) but of numpy row vectors (something like matrix(x, dtype=int) for x in itertools.product(*candidates_len)). Then you can calculate the differences on each item by just item - item.T, then sum up the squares of the lower triangle (which I'd have to look up to figure out how to do). You could probably then further improve the performance by figuring out a way to only calculate the lower triangle in the first place. The typical trick to this is figuring out how to get the lower-triangle sum into the diagonal as part of the vectorized operation, and then you just extract the diagonal, but that isn't always appropriate. See the broadcasting docs for some ideas on how to vectorize the inner loop without creating explicit matrices. Finally, see if there's a way you can create a 3D array out of the whole thing (which may mean padding the individual items out to a fixed width) and then vectorizing the whole operation. (The memory use won't be nearly as bad, as numpy only needs to allocate 4 bytes for each value instead of a whole PyObject… but it may still be bad enough that you lose more than you gain.) Sorry if this is all a little vague, but hopefully it's enough to get you started on the experimentation.

Another thought is that you can probably parallelize this. Any machine with enough memory to handle that massive list, I'm willing to bet it has at least 4 cores on it. And you've got a long sequence of completely independent operations, which is the easiest thing in the world to parallelize. As a first step, create a multiprocessing.Pool, and make the outer loop submit jobs to the pool instead of doing the work directly. You'll probably find that the jobs are too small, so you're drowning in overhead, but then you can always batch up each N items (either explicitly, or see the grouper recipe in the itertools docs), and make the job be "loop over these N items, and return the one with the min LSD". (It may take some tweaking to find the optimal N.) You can even do this together with the top-level numpy, by splitting the giant array up into chunks along the x axis and farming those out as jobs.

One more thought: Your algorithm is starting with a product of N*M, each element of which is of length N. Then, for each element, you loop over it twice. So, the best possible performance is going to be O(N^3*M). Is this really the right algorithm? If so, are you actually getting N^3*M performance from your algorithm? If the answer to either question is no, you shouldn't be trying to micro-optimize it. Only when you've actually got the most efficient algorithm, coded properly, is it worth doing things like vectorizing, avoiding excess work, moving tight loops into C++ and Fortran, etc. Otherwise, you're just going to come back and say "But it still blows up when I go to 4x as big as my last test run".


THe first thing I would do would be to put as much of this as you can in Numpy arrays. Array based operations in Numpy are executed at more or less C speed. It looks like much of this could be done in Numpy to begin with...

If that doesn't get your blood flowing, then I would profile the code, and create a function in Cython for the bottel neck. Assuming that you could put a static type on the lists / arrays, Cython is probably going to be your best bet if you want to stay in the Pythonic world. I've personally seen 100x speedups for some bottlenecks using Cython.

Here's an example of an image convolution with Cython from their docs.


  • Efficiently loading hidden/layered images in web page
  • angular.js filter has dynamic and default value
  • Why do we need to create an object in heap?
  • Setting Access-Control-Allow-Origin header in Angular2 development mode
  • I've been taught not to place most methods in a general “System” class but where do they go ins
  • Casting pointers to larger structs to pointers to smaller structs
  • Is it possible to use Microsoft 2013 sharepoint search server as my search engine for my site
  • dynamic change of templateUrl in ui-router from one state to another
  • Long running file generation on Heroku + Django
  • Sitemap script [closed]
  • Getters and Setters in Eclipse for Hungarian Style Members
  • Unique ID across multiple SQL servers
  • PHP Check If require_once() Content Is Empty
  • tomcat : How to get opened jdbc:oracle connection's stack trace?
  • Create JSON Request string using Javascript Overlay types in GWT
  • Get data file from microphone in windows phone 7
  • Whitelist a set of properties from a multidimensional json array and delete the rest
  • PE file - what's missing?
  • Can't send object to SOAP Web Service
  • PhoneGap : How to upload APK files on Google Play Store
  • Indexing datetime in MySQL
  • @Autowired for @ModelAttribute
  • Skip Characters in Oracle TO_DATE function
  • Open an application in a space using applescripts
  • Efficient & Pythonic way of finding all possible sublists of a list in given range and the minim
  • Where these are stored?
  • Synchronize windows folders
  • Memory error in python- how to use more memory
  • Do I need to seed any random number generator before using EVP_PKEY_keygen of OpenSSL?
  • Spark job failing in YARN mode
  • uniform generation of points on 3D box
  • Optimizing database types to compact database (SQLite)
  • DotNetZip - Calculate final zip size before calling Save(stream)
  • Finding past revisions of files in StarTeam w/ .NET SDK / C#
  • Benchmarking RAM performance - UWP and C#
  • python regex in pyparsing
  • embed rChart in Markdown
  • How to get NHibernate ISession to cache entity not retrieved by primary key
  • How can I use `wmic` in a Windows PE script?
  • Unable to use reactive element in my shiny app