How do I use string array as parameter in Scala udf?


My Spark dataframe (created from a Hive table) looks like:

+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ |racist|filtered | +------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ |false |[rt, @dope_promo:, crew, beat, high, scores, fugly, frog, ??, https://time.com/sxp3onz1w8] | |false |[rt, @axolrose:, yall, call, kermit, frog, lizard?, , https://time.com/wdaeaer1ay] |

and I am trying to remove urls from the filtered field.

I have tried:

val regex = "(https?\\://)\\S+".r def removeRegex( input: Array[String] ) : Array[String] = { regex.replaceAllIn(input, "") } val removeRegexUDF = udf(removeRegex) filteredDF.withColumn("noURL", removeRegexUDF('filtered)).show

which gives this error:

<console>:60: error: overloaded method value replaceAllIn with alternatives: (target: CharSequence,replacer: scala.util.matching.Regex.Match => String)String <and> (target: CharSequence,replacement: String)String cannot be applied to (Array[String], String) regex.replaceAllIn(input, "") ^

I am very much a newbie at Scala so any guidance you can give on how to handle the filtered array in the udf is much appreciated. (Or if there is a better way of doing this I'm happy to hear it).


I would not replace the URLs with empty strings but rather remove them. This UDF will do the trick:

val removeRegexUDF = udf( (input: Seq[String]) => input.filterNot(s => s.matches("(https?\\://)\\S+")) )


Yes, you can.

At first, instead of Array the type should be Seq or WrappedArray. Secondly, function changes only one string to other string - not collection.

Your UDF should be:

def removeRegex(input: Seq[String]) : Array[String] = { input.map(x => regex.replaceAllIn(x, "")).toArray }

So map each element applying regular expression on it.

You can also use function <a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@regexp_replace(e:org.apache.spark.sql.Column,pattern:org.apache.spark.sql.Column,replacement:org.apache.spark.sql.Column):org.apache.spark.sql.Column" rel="nofollow">regexp_replace</a> from Spark functions


