I have seen the instructions here <a href="https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook" rel="nofollow">https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook</a> for setting up Jupyter notebooks with dataproc but I can't figure out how to alter the process in order to use Cloud shell instead of creating an SSH tunnel locally. I have been able to connect to a datalab notebook by running
datalab connect vmname
from the cloud shell and then using the preview function. I would like to do something similar but with Jupyter notebooks and a dataproc cluster.Answer1:
In theory, you can mostly follow the same instructions as found <a href="https://cloud.google.com/shell/docs/features#web_preview" rel="nofollow">https://cloud.google.com/shell/docs/features#web_preview</a> to use local port forwarding to access your Jupyter notebooks on Dataproc via the Cloud Shell's same "web preview" feature. Something like the following in your cloud shell:
gcloud compute ssh my-cluster-m -- -L 8080:my-cluster-m:8123
However, there are two issues which prevent this from working:<ol><li>
You need to modify the Jupyter config to add the following to the bottom of
c.NotebookApp.allow_origin = '*'</li> <li>
Cloud Shell's web preview needs to add support for websockets.</li> </ol>
If you don't do (1) then you'll get popup errors when trying to create a notebook, due to Jupyter refusing the cloud shell proxy domain. Unfortunately (2) requires deeper support from Cloud Shell itself; it'll manifest as errors like
A connection to the notebook server could not be established.
Another possible option without waiting for (2) is to run your own nginx proxy as part of the jupyter initialization action on a Dataproc cluster, if you can get it to proxy websockets suitably. See this thread for a similar situation: <a href="https://github.com/jupyter/notebook/issues/1311" rel="nofollow">https://github.com/jupyter/notebook/issues/1311</a>
Generally this type of broken websocket support in proxy layers is a common problem since it's still relatively new; over time more and more things will start to support websockets out of the box.
Dataproc also supports using a Datalab initialization action; this is set up such that the websockets proxying is already taken care of. Thus, if you're not too dependent on just Jupyter specifically, then the following works in cloud shell:
gcloud dataproc clusters create my-datalab-cluster \ --initialization-actions gs://dataproc-initialization-actions/datalab/datalab.sh gcloud compute ssh my-datalab-cluster-m -- -L 8080:my-datalab-cluster-m:8080
And then select the usual "Web Preview" on port 8080. Or you can select other Cloud Shell supported ports for the local binding like:
gcloud compute ssh my-datalab-cluster-m -- -L 8082:my-datalab-cluster-m:8080
In which case you'd select
8082 as the web preview port.
You can't connect to Dataproc through a Datalab installed on a VM (on a GCE).
As the documentation you mentionned, you must launch a Dataproc with a Datalab Initialization Action.
Datalab connect command is only available if you have created a Datalab thanks to the
Datalab create command.
You must create a SSH tunnel to your master node ("vmname-m" if your cluster name is "vmname") with:
gcloud compute ssh --zone YOUR-ZONE --ssh-flag="-D 1080" --ssh-flag="-N" --ssh-flag="-n" "vmname-m"