72700

How do I use string array as parameter in Scala udf?

Question:

My Spark dataframe (created from a Hive table) looks like:

+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ |racist|filtered | +------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ |false |[rt, @dope_promo:, crew, beat, high, scores, fugly, frog, ??, https://time.com/sxp3onz1w8] | |false |[rt, @axolrose:, yall, call, kermit, frog, lizard?, , https://time.com/wdaeaer1ay] |

and I am trying to remove urls from the filtered field.

I have tried:

val regex = "(https?\\://)\\S+".r def removeRegex( input: Array[String] ) : Array[String] = { regex.replaceAllIn(input, "") } val removeRegexUDF = udf(removeRegex) filteredDF.withColumn("noURL", removeRegexUDF('filtered)).show

which gives this error:

<console>:60: error: overloaded method value replaceAllIn with alternatives: (target: CharSequence,replacer: scala.util.matching.Regex.Match => String)String <and> (target: CharSequence,replacement: String)String cannot be applied to (Array[String], String) regex.replaceAllIn(input, "") ^

I am very much a newbie at Scala so any guidance you can give on how to handle the filtered array in the udf is much appreciated. (Or if there is a better way of doing this I'm happy to hear it).

Answer1:

I would not replace the URLs with empty strings but rather remove them. This UDF will do the trick:

val removeRegexUDF = udf( (input: Seq[String]) => input.filterNot(s => s.matches("(https?\\://)\\S+")) )

Answer2:

Yes, you can.

At first, instead of Array the type should be Seq or WrappedArray. Secondly, function changes only one string to other string - not collection.

Your UDF should be:

def removeRegex(input: Seq[String]) : Array[String] = { input.map(x => regex.replaceAllIn(x, "")).toArray }

So map each element applying regular expression on it.

You can also use function <a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@regexp_replace(e:org.apache.spark.sql.Column,pattern:org.apache.spark.sql.Column,replacement:org.apache.spark.sql.Column):org.apache.spark.sql.Column" rel="nofollow">regexp_replace</a> from Spark functions

Recommend

  • Remove duplicate from KOObservable array
  • JavaScript function not running? [closed]
  • Are there any unintuitive side-effects of member subobjects inheriting storage duration?
  • How to generate a list of random numbers so their sum would be equal to a randomly chosen number
  • 'doc_del_count' bigger than 'doc_count' on CouchDB
  • Laravel at least one field is required
  • How can i dump blob fields from mysql tables
  • Efficient User-Agent Regex to find Safari in Python
  • How do I get the list of bad records that didn't load in Bigquery?
  • Hash Code in SQL Server?
  • How to autopopulate a field in SugarCRM form
  • Group list of tuples by item
  • IE11 throwing “SCRIPT1014: invalid character” where all other browsers work
  • Login not working in Firefox in Meteor
  • Inline R code in YAML for rmarkdown doesn't run
  • Thread safety of a fluent like class using clone() and non final fields
  • Retrieve list of sent friend requests from friend_request FQL table
  • MongoDb aggregation
  • With Hadoop, can I create a tasktracker on a machine that isn't running a datanode?
  • jQuery .attr() and value
  • Extracting HTML between tags
  • MongoDB in PHP using aggregate to group by _id is null not working
  • Using variable in a value field in jMeter
  • Linq Objects Group By & Sum
  • Java Scanner input dilemma. Automatically inputs without allowing user to type
  • Retrieving value from sql ExecuteScalar()
  • Regex thinks I'm nesting, but I'm not
  • What is the “return” in scheme?
  • what is the difference between the asp.net mvc application and asp.net web application
  • Rearranging Cells in UITableView Bug & Saving Changes
  • Matrix multiplication with MKL
  • How to disable jQuery.jplayer autoplay?
  • KeystoneJS: Relationships in Admin UI not updating
  • Benchmarking RAM performance - UWP and C#
  • Hits per day in Google Big Query
  • Angular 2 constructor injection vs direct access
  • IndexOutOfRangeException on multidimensional array despite using GetLength check
  • apache spark aggregate function using min value
  • Checking variable from a different class in C#
  • Binding checkboxes to object values in AngularJs