I have to join data from Google Datastore and Google BigTable to produce some report. I need to execute that operation every minute. Is it possible to accomplish with Google Cloud Dataflow (assuming the processing itself should not take long time and/or can be split in independent parallel jobs)?<ol><li>
Should I have endless loop inside the "main" creating and executing the same pipeline again and again?</li> <li>
If most of time in such scenario is taken by bringing up the VMs, is it possible to instruct the Dataflow to use customer VMs instead?</li> </ol>
If you expect that your job is small enough to complete in 60 seconds you could consider using the Datastore and BigTable APIs from within a
DoFn in a Streaming job. Your pipeline might look something like:
PCollection<Long> impulse = p.apply(
PCollection<A> input1 = impulse.apply(ParDo.of(readFromDatastore));
PCollection<B> input2 = impulse.apply(ParDo.of(readFromBigTable));
This produces a single input every minute, forever. Running as a streaming pipeline, the VMs will continue running.
After reading from both APIs you can then window/join as necessary.