50640

How do I connect to a dataproc cluster with Jupyter notebooks from cloud shell

Question:

I have seen the instructions here <a href="https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook" rel="nofollow">https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook</a> for setting up Jupyter notebooks with dataproc but I can't figure out how to alter the process in order to use Cloud shell instead of creating an SSH tunnel locally. I have been able to connect to a datalab notebook by running

datalab connect vmname

from the cloud shell and then using the preview function. I would like to do something similar but with Jupyter notebooks and a dataproc cluster.

Answer1:

In theory, you can mostly follow the same instructions as found <a href="https://cloud.google.com/shell/docs/features#web_preview" rel="nofollow">https://cloud.google.com/shell/docs/features#web_preview</a> to use local port forwarding to access your Jupyter notebooks on Dataproc via the Cloud Shell's same "web preview" feature. Something like the following in your cloud shell:

gcloud compute ssh my-cluster-m -- -L 8080:my-cluster-m:8123

However, there are two issues which prevent this from working:

<ol><li>

You need to modify the Jupyter config to add the following to the bottom of /root/.jupyter/jupyter_notebook_config.py:

c.NotebookApp.allow_origin = '*' </li> <li>

Cloud Shell's web preview needs to add support for websockets.

</li> </ol>

If you don't do (1) then you'll get popup errors when trying to create a notebook, due to Jupyter refusing the cloud shell proxy domain. Unfortunately (2) requires deeper support from Cloud Shell itself; it'll manifest as errors like A connection to the notebook server could not be established.

Another possible option without waiting for (2) is to run your own nginx proxy as part of the jupyter initialization action on a Dataproc cluster, if you can get it to proxy websockets suitably. See this thread for a similar situation: <a href="https://github.com/jupyter/notebook/issues/1311" rel="nofollow">https://github.com/jupyter/notebook/issues/1311</a>

Generally this type of broken websocket support in proxy layers is a common problem since it's still relatively new; over time more and more things will start to support websockets out of the box.

<strong>Alternatively:</strong>

Dataproc also supports using a Datalab initialization action; this is set up such that the websockets proxying is already taken care of. Thus, if you're not too dependent on just Jupyter specifically, then the following works in cloud shell:

gcloud dataproc clusters create my-datalab-cluster \ --initialization-actions gs://dataproc-initialization-actions/datalab/datalab.sh gcloud compute ssh my-datalab-cluster-m -- -L 8080:my-datalab-cluster-m:8080

And then select the usual "Web Preview" on port 8080. Or you can select other Cloud Shell supported ports for the local binding like:

gcloud compute ssh my-datalab-cluster-m -- -L 8082:my-datalab-cluster-m:8080

In which case you'd select 8082 as the web preview port.

Answer2:

You can't connect to Dataproc through a Datalab installed on a VM (on a GCE).

As the documentation you mentionned, you must launch a Dataproc with a Datalab Initialization Action.

Moreover the Datalab connect command is only available if you have created a Datalab thanks to the Datalab create command.

You must create a SSH tunnel to your master node ("vmname-m" if your cluster name is "vmname") with:

gcloud compute ssh --zone YOUR-ZONE --ssh-flag="-D 1080" --ssh-flag="-N" --ssh-flag="-n" "vmname-m"

Recommend

  • Docker-machine Google (GCE) driver not working, unable to create instance
  • Triggering a Dataflow job when new files are added to Cloud Storage
  • udp forwarding to emulator
  • How to write simple SMPP server
  • GCloud Functions debugging deployment failure
  • Drop packet with libpcap
  • Upload a Java and node.js project to Google AppEngine at once
  • Google API - Redirect URI mismatch error
  • parallelize process in missForest package
  • `docker cp` doesn't copy file into container
  • Furthest-point Voronoi diagram in Java
  • How to check disabled jobs with Jenkins server?
  • Python Paramiko send CTRL+C to an ssh shell
  • Getting errors while using neuralnet function
  • Changing media screen makes div overlay
  • CORS with socket.io
  • hibernate sets dirty flag (and issues update) even though client did not change value
  • Consuming a WCF service in a Java Client using wsHttpBinding
  • Prevent Tomcat from caching request during starup
  • How to autopopulate a field in SugarCRM form
  • SonarQube: Cannot deactivate rule with missing quality profile
  • Bash if statement with multiple conditions
  • How to generate and display a QR Code in ionic 2
  • Check for zero lines output from command over SSH
  • Access Android Market through SSH tunnel
  • Tamper-proof configuration files in .NET?
  • OOP Javascript - Is “get property” method necessary?
  • How to run “Deployd” on port 80 instead of port 5000 in webserver.
  • Abort upload large uploads after reading headers
  • Atlas images wrong size on iPad iOS 9
  • java.lang.NoClassDefFoundError: com.parse.Parse$Configuration$Builder on below Lollipop versions
  • Master page gives error
  • What is Eclipse's Declaration View used for?
  • Date difference with leap year
  • Redux, normalised entities and lodash merge
  • Matrix multiplication with MKL
  • Hits per day in Google Big Query
  • how does django model after text[] in postgresql [duplicate]
  • File not found error Google Drive API
  • Converting MP3 duration time