73968

Running periodic Dataflow job

Question:

I have to join data from Google Datastore and Google BigTable to produce some report. I need to execute that operation every minute. Is it possible to accomplish with Google Cloud Dataflow (assuming the processing itself should not take long time and/or can be split in independent parallel jobs)?

<ol><li>

Should I have endless loop inside the "main" creating and executing the same pipeline again and again?

</li> <li>

If most of time in such scenario is taken by bringing up the VMs, is it possible to instruct the Dataflow to use customer VMs instead?

</li> </ol>

Thanks,

Answer1:

If you expect that your job is small enough to complete in 60 seconds you could consider using the Datastore and BigTable APIs from within a DoFn in a Streaming job. Your pipeline might look something like:

PCollection<Long> impulse = p.apply( CountingInput.unbounded().withRate(1, Duration.standardMinutes(1))) PCollection<A> input1 = impulse.apply(ParDo.of(readFromDatastore)); PCollection<B> input2 = impulse.apply(ParDo.of(readFromBigTable)); ...

This produces a single input every minute, forever. Running as a streaming pipeline, the VMs will continue running.

After reading from both APIs you can then window/join as necessary.

Recommend

  • Find “command not found” when executed in bash loop
  • How to set diskSourceImage in google data flow pipeline
  • Can I manage guest VMs in vSphere free using Ansible without vCenter?
  • iOS: How to convert the self-drawn content of an UIView to an image (widespread general solution ret
  • jquery insert html into list before a child ul-tag
  • MYSQL: How to find player_id from surname?
  • mod_rewrite replace all slashes (/)
  • How to load the application in Web-View in a specific browser like Chrome or Firefox in Android
  • difference between three development modes in rails
  • Maven LifeCycleExecutor with an incomplete configuration error
  • Is there any way, that I can make Android Emulator run on Azure Virtual Machine?
  • Best match using MySQL and PHP
  • How to install and setup Testswarm?
  • Hudson - different build targets for different triggers
  • Is it good practice to put Edge Side Includes into my templates?
  • How can I get the maximum number of OpenMP threads that may be created during the whole execution of
  • Ray-tracing triangles
  • How do I get the number of jobs in a rq queue?
  • Moving data between processes in Spartan 3
  • How to start server for Selenium grid Java Maven setup
  • Number of threads being used during Parallel.ForEach
  • Python: Split a String Field into 3 Separate Fields using Lambda
  • Generic classes with Collection getter of other types
  • how can I compare dates in array to find the earliest one?
  • Divide a $1 by 3 and adjusting 1 cent
  • How do I shift the decimal place in Python?
  • Security issues with PHP's Readfile method
  • TextToSpeech.setEngineByPackageName() triggers NullPointerException
  • Mysterious problem with floating point in LISP - time axis generation
  • How to know when stdin is empty if it contains EOF?
  • output of program is not same as passed argument
  • Does CUDA 5 support STL or THRUST inside the device code?
  • Timeout for blocking function call, i.e., how to stop waiting for user input after X seconds?
  • Akka Routing: Reply's send to router ends up as dead letters
  • Can Visual Studio XAML designer handle font family names with spaces as a resource?
  • How can I remove ASP.NET Designer.cs files?
  • Are Kotlin's Float, Int etc optimised to built-in types in the JVM? [duplicate]
  • unknown Exception android
  • Checking variable from a different class in C#
  • Reading document lines to the user (python)