In this line, which RDD is being persisted? dropResultsN or dataSetN?
dropResultsN = dataSetN.map(s -> standin.call(s)).persist(StorageLevel.MEMORY_ONLY());
Question arises as a side issue from <a href="https://stackoverflow.com/questions/38296950/apache-spark-timing-foreach-operation-on-javardd" rel="nofollow">Apache Spark timing forEach operation on JavaRDD</a>, where I am still looking for a good answer to the core question of how best to time RDD creation.Answer1:
dropResultsN is the persisted RDD (which is the RDD produced by mapping
dataSetN onto the method
I found a good example of this in Learning Spark by O'Reilly:
It's example 3-40. persist() in Scala (assuming Java is the same)
import org.apache.spark.storage.StorageLevel val result = input.map( x => x*x ) result.persist(StorageLevel.[<your choice>])<blockquote>
NOTE in Learning Spark: Notice that we called persist() on the RDD before the first action. The persist() call on its own doesn't force evaluation.</blockquote>
MY NOTE that in this example the persist is on the next line, I think this is much more clear than my code in my question.