77940

Returning a large data structure from Dataflow worker node, getting stuck in serializing graph

Question:

I have large graph ~100k vertices and ~1 million edges being constructed in a DoFn function. When I try to output that graph in DoFn function execution gets stuck at c.output(graph);.

public static class Prep extends DoFn<TableRow, TableRows> { @Override public void processElement(ProcessContext c) { //Graph creation logic runs very fast, no problem here LOG.info("Starting Graph Output"); // can see this in logs c.output(graph); //outputs data from DoFn function LOG.info("Ending Graph Output"); // never see this logs } }

My graph class is just a Map of vertices being serialized with AvroCoder.

import org.apache.avro.reflect.Nullable; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import com.X.Prep; import com.google.cloud.dataflow.sdk.coders.AvroCoder; import com.google.cloud.dataflow.sdk.coders.DefaultCoder; //Class that creates Graph data structure for custom seg definitions @DefaultCoder(AvroCoder.class) public class MyGraph { @Nullable public Map<String,GraphVertex> vertexList = new HashMap<String,GraphVertex>(); }

I have tried json-simple, gson, jackson json serialization all of them take too long to serialize this graph.

Answer1:

The graph object is likely too large to be encoded and passed around as an element. You should explore other mechanisms for getting the graph to workers. For example, creating a multi-map-valued side input (keyed by vertex). This would allow you to have a PCollection (processed in parallel).

Alternatively, since the graph creation logic runs very fast just run that logic on each worker, rather than trying to serialize the entire graph.

Recommend

  • K Shortest Path Python Not Working
  • firebase unauth with google doesn't allow change of user
  • How to add an object in my collection by only using add method? [closed]
  • Rely on Facebook user id as a permanent user identifier
  • System call time out?
  • c++ search a vector for element first seen position
  • How to add learning rate to summaries?
  • Create registry key in 32-bit hive on x64 PC using Installshield 2012 LE - Avoid redirection
  • Is there any purpose for h2-h6 headings in HTML5?
  • Double dispatch in Java example
  • Most efficient way to move table rows from one table to another
  • New Firebase failed: First argument must be a valid firebase URL and the path can't contain “.”
  • JSON encode and decode on PHP
  • Building Qt project for C++11 standard
  • Servlet stops working on Tomcat server after some hits or time
  • Check all values in string[] for length?
  • RxJava debounce by arbitrary value
  • onBackPressed() not being executed
  • Is there a way to do normal logging with EureakLog?
  • iOS: Detect app start via notification press
  • OpenGL 3.3 on Mac OSX El Capitan with LWJGL
  • Asynchronous UI Testing in Xcode With Swift
  • How to rebase a series of branches?
  • formatting the colorbar ticklabels with SymLogNorm normalization in matplotlib
  • Join two tables and save into third-sql
  • Deserializing XML into class C#
  • How to model a transition system with SPIN
  • Can a Chrome extension content script make an jQuery AJAX request for an html file that is itself a
  • ActionScript 2 vs ActionScript 3 performance
  • Upload files with Ajax and Jquery
  • Large data - storage and query
  • ORA-29908: missing primary invocation for ancillary operator
  • Function pointer “assignment from incompatible pointer type” only when using vararg ellipsis
  • AngularJs get employee from factory
  • Proper way to use connect-multiparty with express.js?
  • Why joiner is not used after Sequence generator or Update statergy
  • How can I remove ASP.NET Designer.cs files?
  • python draw pie shapes with colour filled
  • How to Embed XSL into XML
  • Converting MP3 duration time