Returning a large data structure from Dataflow worker node, getting stuck in serializing graph


I have large graph ~100k vertices and ~1 million edges being constructed in a DoFn function. When I try to output that graph in DoFn function execution gets stuck at c.output(graph);.

public static class Prep extends DoFn<TableRow, TableRows> { @Override public void processElement(ProcessContext c) { //Graph creation logic runs very fast, no problem here LOG.info("Starting Graph Output"); // can see this in logs c.output(graph); //outputs data from DoFn function LOG.info("Ending Graph Output"); // never see this logs } }

My graph class is just a Map of vertices being serialized with AvroCoder.

import org.apache.avro.reflect.Nullable; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import com.X.Prep; import com.google.cloud.dataflow.sdk.coders.AvroCoder; import com.google.cloud.dataflow.sdk.coders.DefaultCoder; //Class that creates Graph data structure for custom seg definitions @DefaultCoder(AvroCoder.class) public class MyGraph { @Nullable public Map<String,GraphVertex> vertexList = new HashMap<String,GraphVertex>(); }

I have tried json-simple, gson, jackson json serialization all of them take too long to serialize this graph.


The graph object is likely too large to be encoded and passed around as an element. You should explore other mechanisms for getting the graph to workers. For example, creating a multi-map-valued side input (keyed by vertex). This would allow you to have a PCollection (processed in parallel).

Alternatively, since the graph creation logic runs very fast just run that logic on each worker, rather than trying to serialize the entire graph.


  • K Shortest Path Python Not Working
  • firebase unauth with google doesn't allow change of user
  • How to add an object in my collection by only using add method? [closed]
  • Rely on Facebook user id as a permanent user identifier
  • System call time out?
  • c++ search a vector for element first seen position
  • How to add learning rate to summaries?
  • Create registry key in 32-bit hive on x64 PC using Installshield 2012 LE - Avoid redirection
  • Is there any purpose for h2-h6 headings in HTML5?
  • Double dispatch in Java example
  • Most efficient way to move table rows from one table to another
  • New Firebase failed: First argument must be a valid firebase URL and the path can't contain “.”
  • JSON encode and decode on PHP
  • Building Qt project for C++11 standard
  • Servlet stops working on Tomcat server after some hits or time
  • Check all values in string[] for length?
  • RxJava debounce by arbitrary value
  • onBackPressed() not being executed
  • Is there a way to do normal logging with EureakLog?
  • iOS: Detect app start via notification press
  • OpenGL 3.3 on Mac OSX El Capitan with LWJGL
  • Asynchronous UI Testing in Xcode With Swift
  • How to rebase a series of branches?
  • formatting the colorbar ticklabels with SymLogNorm normalization in matplotlib
  • Join two tables and save into third-sql
  • Deserializing XML into class C#
  • How to model a transition system with SPIN
  • Can a Chrome extension content script make an jQuery AJAX request for an html file that is itself a
  • ActionScript 2 vs ActionScript 3 performance
  • Upload files with Ajax and Jquery
  • Large data - storage and query
  • ORA-29908: missing primary invocation for ancillary operator
  • Function pointer “assignment from incompatible pointer type” only when using vararg ellipsis
  • AngularJs get employee from factory
  • Proper way to use connect-multiparty with express.js?
  • Why joiner is not used after Sequence generator or Update statergy
  • How can I remove ASP.NET Designer.cs files?
  • python draw pie shapes with colour filled
  • How to Embed XSL into XML
  • Converting MP3 duration time