I developed a model in Keras and trained it quite a few times. Once I forcefully stopped the training of the model and since then I am getting the following error:
Traceback (most recent call last): File "inception_resnet.py", line 246, in <module> callbacks=[checkpoint, saveEpochNumber]) ## File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 87, in wrapper return func(*args, **kwargs) File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 2042, in fit_generator class_weight=class_weight) File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 1762, in train_on_batch outputs = self.train_function(ins) File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2270, in __call__ session = get_session() File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 163, in get_session _SESSION = tf.Session(config=config) File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1486, in __init__ super(Session, self).__init__(target, graph, config=config) File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 621, in __init__ self._session = tf_session.TF_NewDeprecatedSession(opts, status) File "/home/eh0/E27890/anaconda3/lib/python3.5/contextlib.py", line 66, in __exit__ next(self.gen) File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
So the error is actually<blockquote>
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.</blockquote>
And most probably, the GPU memory is still occupied. I can't even create a simple tensorflow session.
I have seen an answer <a href="https://stackoverflow.com/a/44775164/5254777" rel="nofollow">here</a>, but when I execute the following command in terminal
training of the model gets started without GPU acceleration.
Also, as I am training my model on a server and I have no root access either to the server, I can't restart the server or clear GPU memory with root access. What is the solution now?Answer1:
I found the solution in a comment of <a href="https://stackoverflow.com/questions/15197286/how-can-i-flush-gpu-memory-using-cuda-physical-reset-is-unavailable" rel="nofollow">this question</a>.
This gives a list of all the processes (and their PIDs) occupying GPU memory. I killed them one by one by using
kill -9 PID
Now everything is running smooth again.Answer2:
I am using Anaconda 4.5.12 with python 3.5, NVIDIA Driver 390.116 and also faced the same issue. In my case this was caused by incompatible cudatoolkit version
conda install tensorflow-gpu
installed cudatoolkit 9.3.0 with cudnn 7.3.x. However after going through answers <a href="https://github.com/tensorflow/tensorflow/issues/9549" rel="nofollow">here</a> and referring to my other virtual environment where I use pytorch with GPU without any problem I inferred that
cudatookit 9.0.0 will be compatible with my driver version.
conda install cudatoolkit==9.0.0
cudatoolkit 9.0.0 and
cudnn 7.3.0 from
cuda 9.0_0 build. After this I was able to create tensorflow session with GPU.</strong>
Now coming to the options of killing jobs<ul><li>If you have GPU memory occupied by other jobs then killing them one by one as suggested by @Preetam saha arko will free up GPU and may allow you to create
tfsession with GPU (provided that compatibility issues are resolved already)</li> <li>
To create Session with specified GPU, kill the previous
tf.Session() request after finding PID from
nvidia-smi and set cuda visible device to available GPU ID (
0 for this example)
tf.Session can create session with specified GPU device.
Otherwise, if nothing with GPU works then kill the previous
tf.Session() request after finding PID from
nvidia-smi and set cuda visible device to undefined
tf.Session can create session with CPU.
I had the similar problem, while working on the cluster. When I submitted the job script to Slurm server , it would run fine but while training the model on Jupytyter notebook, I would get the following error :
InternalError: Failed to create session
Reason : It was because I was running multiple jupyter notebooks under same GPU (all of them using tensorflow), so slurm server would restrict to create a new tensorflow session. The problem was solved by stopping all the jupyter notebook, and then running only one/two at a time.
Below is the log error for jupyter notebook :
Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 12786073600