I have several processes each completing tasks which require a single large numpy array, this is only being read (the threads are searching it for appropriate values).
If each process loads the data I receive a memory error.
I am therefore trying to minimise the memory usage by using a Manager to share the same array between the processes.
However I still receive a memory error. I <strong>can load the array</strong> once in the main process however the moment I try to make it an <strong>attribute</strong> of the manager namespace I receive a <strong>memory error</strong>. I assumed the Managers acted like pointers and allowed seperate processes (which normally only have access to their own memory) to have access to this shared memory as well. However the error mentions pickling:<pre class="lang-none prettyprint-override">
Traceback (most recent call last): File <PATH>, line 63, in <module> ns.pp = something File "C:\Program Files (x86)\Python35-32\lib\multiprocessing\managers.py", line 1021, in __setattr__ return callmethod('__setattr__', (key, value)) File "C:\Program Files (x86)\Python35-32\lib\multiprocessing\managers.py", line 716, in _callmethod conn.send((self._id, methodname, args, kwds)) File "C:\Program Files (x86)\Python35-32\lib\multiprocessing\connection.py", line 206, in send self._send_bytes(ForkingPickler.dumps(obj)) File "C:\Program Files (x86)\Python35-32\lib\multiprocessing\reduction.py", line 50, in dumps cls(buf, protocol).dump(obj) MemoryError
I assume the numpy array is actually being copied when assigned to the manager, but I may be wrong.
To make matters a little more irritating I am on a machine with 32GB of memory and watching the memory usage it only increases a little berfore crashing, maybe by 5%-10% at most.
Could someone explain <strong>why making the array an attribute of the namespace takes up even more memory?</strong> and <strong>why my program won't use some of the spare memory avaliable?</strong> (I have already read the <a href="https://docs.python.org/3.6/tutorial/classes.html#python-scopes-and-namespaces" rel="nofollow">namespace</a> and <a href="https://docs.python.org/3.6/library/multiprocessing.html#multiprocessing.managers" rel="nofollow">manager</a> docs as well as these <a href="https://stackoverflow.com/questions/22487296/multiprocessing-in-python-sharing-large-object-e-g-pandas-dataframe-between" rel="nofollow">managers</a> and <a href="https://stackoverflow.com/questions/3913217/what-are-python-namespaces-all-about" rel="nofollow">namespace</a> threads on SO.
I am running Windows Server 2012 R2 and Python 3.5.2 32bit.
Here is some code demonstrating my problem (you will need to use an alternative file to
large.txt, this file is ~75MB of tab delimited strings):
import multiprocessing import numpy as np if __name__ == '__main__': # load Price Paid Data and assign to manager mgr = multiprocessing.Manager() ns = mgr.Namespace() ns.data = np.genfromtxt('large.txt') # Alternative proving this work for smaller objects # ns.data = 'Test PP data'Answer1:
Manager types are built for flexibility not efficiency. They create a server process that holds the values, and can return proxy objects to each process they are needed in. The server and proxy communicate over tls to allow the server and proxy to be on different machines, but this necessarily means copying whatever object is in question. I haven't traced the source all the way, so it's possible the extra copy may be garbage collected after use, but at least initially there has to be a copy.
If you want shared physical memory, I suggest using <a href="https://docs.python.org/3.6/library/multiprocessing.html#shared-ctypes-objects" rel="nofollow">Shared ctypes Objects</a>. These actually do point to a common location in memory, and therefore are much faster, and resource-light. They do not support all the same things full fat python objects do, but they can be extended by creating <a href="https://docs.python.org/3.6/library/ctypes.html#ctypes.Structure" rel="nofollow">structs</a> to organize your data.