82297

Spark repartition is slow and shuffles too much data

Question:

My cluster:

<ul><li>5 data node</li> <li>each data node has: 8 CPUs, 45GB memory</li> </ul>

Due to some other configuration limit, I can only start 5 executors on each data node. So I did

spark-submit --num-executors 30 --executor-memory 2G ...

So each executor uses 1 core.

I have two data set, each is about 20 GB. In my code, I did:

val rdd1 = sc...cache() val rdd2 = sc...cache() val x = rdd1.cartesian(rdd2).repartition(30) map ...

In the Spark UI, I saw the repartition step took more than 30 mins, and it cause data shuffle of more than 150GB.

I don't think it is right. But I could not figure out what goes wrong...

Answer1:

Did you really mean "cartesian"?

You are multiplying every row in RDD1 by every row in RDD2. So if your rows were 1k each, you had about 20,000 rows per RDD. The cartesian product will return a set with 20,000 x 20,000 or 400 million records. And note that each row would now be double in width -- 2k -- so you'd have 800 GB in RDD3 whereas you only had 20 GB each in RDD1 and RDD2.

Perhaps try:

val x = rdd1.union(rdd2).repartition(30) map ...

or maybe even:

val x = rdd1.zip(rdd2).repartition(30) map ...

?

Recommend

  • What's the difference between join and cogroup in Apache Spark
  • What is Spark Job ?
  • Print the content of streams (Spark streaming) in windows system
  • Shutting down netty 4 application throws RejectedExecutionException
  • FATAL master.HMaster: Unexpected state : .. Cannot transit it to OFFLINE
  • Why does this query make my whole database freeze?
  • Exporting SAS DataSet on to UNIX as a text file…with delimiter '~|~'
  • Fraction length
  • Can't delete li from to-do list
  • Iterate twice through a DataReader
  • JQuery Auto-Complete: How do I handle modifications?
  • jquery code not working without breakpoint
  • Where these are stored?
  • HttpClient: disabling chunked encoding
  • get iframe content as string
  • Simple linked list-C
  • Clear fused location provider's location for testing
  • Group list of tuples by item
  • android.support.v7.widget.Toolbar VectorDrawableCompat IllegalStateException when using support lib
  • Spark job failing in YARN mode
  • IE11 throwing “SCRIPT1014: invalid character” where all other browsers work
  • Angular2 component view does not update on value change via method
  • CakePHP ACL tutorial initDB function warnings
  • What is the purpose of TaskExecutor in spring?
  • jQuery .attr() and value
  • DotNetZip - Calculate final zip size before calling Save(stream)
  • MySQL WHERE-condition in procedure ignored
  • Do I've to free mysql result after storing it?
  • SetUp method failed while running tests from teamcity
  • Benchmarking RAM performance - UWP and C#
  • Acquiring multiple attributes from .xml file in c#
  • How to set the response of a form post action to a iframe source?
  • Understanding cpu registers
  • How to CLICK on IE download dialog box i.e.(Open, Save, Save As…)
  • Change div Background jquery
  • Qt: Run a script BEFORE make
  • apache spark aggregate function using min value
  • How can I remove ASP.NET Designer.cs files?
  • reshape alternating columns in less time and using less memory
  • java string with new operator and a literal