59435

Tensorflow: failed to create session in server

Question:

I developed a model in Keras and trained it quite a few times. Once I forcefully stopped the training of the model and since then I am getting the following error:

Traceback (most recent call last): File "inception_resnet.py", line 246, in <module> callbacks=[checkpoint, saveEpochNumber]) ## File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 87, in wrapper return func(*args, **kwargs) File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 2042, in fit_generator class_weight=class_weight) File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 1762, in train_on_batch outputs = self.train_function(ins) File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2270, in __call__ session = get_session() File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 163, in get_session _SESSION = tf.Session(config=config) File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1486, in __init__ super(Session, self).__init__(target, graph, config=config) File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 621, in __init__ self._session = tf_session.TF_NewDeprecatedSession(opts, status) File "/home/eh0/E27890/anaconda3/lib/python3.5/contextlib.py", line 66, in __exit__ next(self.gen) File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

So the error is actually

<blockquote>

tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

</blockquote>

And most probably, the GPU memory is still occupied. I can't even create a simple tensorflow session.

I have seen an answer <a href="https://stackoverflow.com/a/44775164/5254777" rel="nofollow">here</a>, but when I execute the following command in terminal

export CUDA_VISIBLE_DEVICES=''

training of the model gets started without GPU acceleration.

Also, as I am training my model on a server and I have no root access either to the server, I can't restart the server or clear GPU memory with root access. What is the solution now?

Answer1:

I found the solution in a comment of <a href="https://stackoverflow.com/questions/15197286/how-can-i-flush-gpu-memory-using-cuda-physical-reset-is-unavailable" rel="nofollow">this question</a>.

nvidia-smi -q

This gives a list of all the processes (and their PIDs) occupying GPU memory. I killed them one by one by using

kill -9 PID

Now everything is running smooth again.

Answer2:

I am using Anaconda 4.5.12 with python 3.5, NVIDIA Driver 390.116 and also faced the same issue. In my case this was caused by incompatible cudatoolkit version

conda install tensorflow-gpu

installed cudatoolkit 9.3.0 with cudnn 7.3.x. However after going through answers <a href="https://github.com/tensorflow/tensorflow/issues/9549" rel="nofollow">here</a> and referring to my other virtual environment where I use pytorch with GPU without any problem I inferred that cudatookit 9.0.0 will be compatible with my driver version.

conda install cudatoolkit==9.0.0

<strong>This installed cudatoolkit 9.0.0 and cudnn 7.3.0 from cuda 9.0_0 build. After this I was able to create tensorflow session with GPU.</strong>

Now coming to the options of killing jobs

<ul><li>If you have GPU memory occupied by other jobs then killing them one by one as suggested by @Preetam saha arko will free up GPU and may allow you to create tf session with GPU (provided that compatibility issues are resolved already)</li> <li>

To create Session with specified GPU, kill the previous tf.Session() request after finding PID from nvidia-smi and set cuda visible device to available GPU ID (0 for this example)

import os os.environ["CUDA_VISIBLE_DEVICES"]='0'

Then using tf.Session can create session with specified GPU device.

</li> <li>

Otherwise, if nothing with GPU works then kill the previous tf.Session() request after finding PID from nvidia-smi and set cuda visible device to undefined

import os os.environ["CUDA_VISIBLE_DEVICES"]=''

Then using tf.Session can create session with CPU.

</li> </ul>

Answer3:

I had the similar problem, while working on the cluster. When I submitted the job script to Slurm server , it would run fine but while training the model on Jupytyter notebook, I would get the following error :

InternalError: Failed to create session

Reason : It was because I was running multiple jupyter notebooks under same GPU (all of them using tensorflow), so slurm server would restrict to create a new tensorflow session. The problem was solved by stopping all the jupyter notebook, and then running only one/two at a time.

Below is the log error for jupyter notebook :

Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 12786073600

Recommend