33853

How can I access computed metrics for each fold in a CrossValidatorModel

Question:

How can I get the computed metrics for each fold from a CrossValidatorModel in spark.ml? I know I can get the average metrics using model.avgMetrics but is it possible to get the raw results on each fold to look at eg. the variance of the results?

I am using Spark 2.0.0.

Answer1:

Studying the <a href="https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala" rel="nofollow">spark code here</a>

For the folds, you can do the iteration yourself like this:

val splits = MLUtils.kFold(dataset.toDF.rdd, $(numFolds), $(seed)) //K-folding operation starting //for each fold you have multiple models created cfm. the paramgrid splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => val trainingDataset = sparkSession.createDataFrame(training, schema).cache() val validationDataset = sparkSession.createDataFrame(validation, schema).cache() val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] trainingDataset.unpersist() var i = 0 while (i < numModels) { val metric = eval.evaluate(models(i).transform(validationDataset, epm(i))) logDebug(s"Got metric $metric for model trained with ${epm(i)}.") metrics(i) += metric i += 1 }

This is in scala, but the ideas are very clearly outlined.

Take a look at <a href="https://stackoverflow.com/questions/38874546/spark-crossvalidatormodel-access-other-models-than-the-bestmodel" rel="nofollow">this answer</a> that outlines results per fold. Hope this helps.

Recommend

  • How to choose the best model dynamically using python
  • Failed to load class for data source: Libsvm in spark ML pyspark/scala
  • multiple model accuracy json result format using python
  • use Featureunion in scikit-learn to combine two pandas columns for tfidf
  • Multiple-columns operations in Spark
  • How to split a text file into multiple columns with Spark
  • getting the new row id from pySpark SQL write to remote mysql db (JDBC)
  • Crossvalidation in Stanford NER
  • Evaluate Loss Function Value Getting From Training Set on Cross Validation Set
  • cross combine two RDDs using pyspark
  • Load spark data into Mongo / Memcached for use by a Webservice
  • Why does my println in rdd prints the string of elements?
  • How to token a word which combined by two words without whitespace
  • Count the number of non-null values in a Spark DataFrame
  • Order By Split Column
  • Can't zip RDDs with unequal numbers of partitions
  • Stacked Bar Chart with percentage composition inside the Bar and total above the Bar in JFreeChart
  • Can you use DataSet and DataTables in a Portable Class Library
  • Iterate twice through a DataReader
  • Javascript/Jquery runs fast in desktop browsers, but slow in mobile/smartphone browsers…should I spl
  • Open Existing DB in MySQL WorkBench
  • What does “t” refer to in this SQL?
  • Getting the scrolling offset when storing coordinates
  • Primefaces ManyCheckbox inside ui:repeat calls setter method only for last loop
  • Group list of tuples by item
  • How to return DataSet (xsd) in WCF
  • Autofac with Web API 2 - Parameter-less constructor error
  • Android changing fragment order inside FragmentPagerAdapter
  • how does System.Web.HttpRequest::PathInfo work?
  • IE11 throwing “SCRIPT1014: invalid character” where all other browsers work
  • calculate gradient output for Theta update rule
  • Not able to aggregate on nested fields in elasticsearch
  • MongoDb aggregation
  • jQuery .attr() and value
  • Update CALayer sublayers immediately
  • Incrementing object id automatically JS constructor (static method and variable)
  • json Serialization in asp
  • Free memory of cv::Mat loaded using FileStorage API
  • apache spark aggregate function using min value
  • unknown Exception android