87979

apache spark aggregate function using min value

I tried one example found on http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html

val z = sc.parallelize(List("12","23","345","4567"),2) z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y) res142: String = 11

Why the min length is 1? The first partition contains ["12", "23"] and the second one ["345","4567"]. Comparing the min from any partition with the initial value "", the min value should be 0. And the expected result in my understanding would be 00

val z = sc.parallelize(List("12","23","345",""),2) z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y) res143: String = 10

for this one I understand the same, the final result should be 00

Thanks in advance.

Answer1:

First lets see how parallelize splits your data between partitions:

val x = sc.parallelize(List("12","23","345","4567"), 2) x.glom.collect // Array[Array[String]] = Array(Array(12, 23), Array(345, 4567)) val y = sc.parallelize(List("12","23","345",""), 2) y.glom.collect // Array[Array[String]] = Array(Array(12, 23), Array(345, ""))

and define two helpers:

def seqOp(x: String, y: String) = math.min(x.length, y.length).toString def combOp(x: String, y: String) = x + y

Now lets trace execution for x. Ignoring parallelism it can be represented as follows:

(combOp (seqOp (seqOp "" "12") "23") (seqOp (seqOp "" "345") "4567")) (combOp (seqOp "0" "23") (seqOp (seqOp "" "345") "4567")) (combOp "1" (seqOp (seqOp "" "345") "4567")) (combOp "1" (seqOp "0" "4567")) (combOp "1" "1") "11"

The same thing for y:

(combOp (seqOp (seqOp "" "12") "23") (seqOp (seqOp "" "345") "")) (combOp (seqOp "0" "23") (seqOp (seqOp "" "345") "")) (combOp "1" (seqOp (seqOp "" "345") "")) (combOp "1" (seqOp "0" "")) (combOp "1" "0") "10"

That being said you shouldn't use aggregate here in the first place. Since operations you apply are not associative a whole idea is simply wrong.

Recommend

  • Delphi XE3, ugly StringGrid's borders
  • Compare variables PHP
  • SharePoint Designer 2010 - Determine if today's date is within x days of a start date column us
  • rails - convert DateTime to UTC before saving to server
  • Nodemailer with Gmail “From: address” does not change
  • SyntaxError: unterminated string literal … tag not working within a string variable
  • Level-order tree traversal
  • Textbox validation in jquery
  • javascript / jquery scope differences between jQuery.each and normal for loop?
  • Vue props data not updating in child component
  • smarty nested if condition is not working properly?
  • Oracle ListaGG, Top 3 most frequent values, given in one column, grouped by ID
  • Can't delete li from to-do list
  • Azure table query partial partitionkey guid match
  • JQuery Auto-Complete: How do I handle modifications?
  • Excel distinct count with conditions
  • jquery code not working without breakpoint
  • get iframe content as string
  • How do I remove all but some records based on a threshold?
  • Simple linked list-C
  • Group list of tuples by item
  • Merging rows to columns
  • IE11 throwing “SCRIPT1014: invalid character” where all other browsers work
  • D3 get axis values on zoom event
  • Custom validator control occupying space even though display set to dynamic
  • Admob requires api-13 or later can I not deploy on old API-8 phones?
  • JSON response opens as a file, but I can't access it with JavaScript
  • jQuery .attr() and value
  • Volley JsonObjectRequest send headers in GET Request
  • Javascript convert timezone issue
  • How to set my toolbar fixed while scrolling android
  • Hazelcast - OperationTimeoutException
  • AT Commands to Send SMS not working in Windows 8.1
  • retrieve vertices with no linked edge in arangodb
  • Windows forms listbox.selecteditem displaying “System.Data.DataRowView” instead of actual value
  • FormattedException instead of throw new Exception(string.Format(…)) in .NET
  • Change div Background jquery
  • IndexOutOfRangeException on multidimensional array despite using GetLength check
  • apache spark aggregate function using min value
  • Sorting a 2D array using the second column C++