
Question:
My Spark dataframe (created from a Hive table) looks like:
+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|racist|filtered |
+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|false |[rt, @dope_promo:, crew, beat, high, scores, fugly, frog, ??, https://time.com/sxp3onz1w8] |
|false |[rt, @axolrose:, yall, call, kermit, frog, lizard?, , https://time.com/wdaeaer1ay] |
and I am trying to remove urls from the filtered field.
I have tried:
val regex = "(https?\\://)\\S+".r
def removeRegex( input: Array[String] ) : Array[String] = {
regex.replaceAllIn(input, "")
}
val removeRegexUDF = udf(removeRegex)
filteredDF.withColumn("noURL", removeRegexUDF('filtered)).show
which gives this error:
<console>:60: error: overloaded method value replaceAllIn with alternatives:
(target: CharSequence,replacer: scala.util.matching.Regex.Match => String)String <and>
(target: CharSequence,replacement: String)String
cannot be applied to (Array[String], String)
regex.replaceAllIn(input, "")
^
I am very much a newbie at Scala so any guidance you can give on how to handle the filtered array in the udf is much appreciated. (Or if there is a better way of doing this I'm happy to hear it).
Answer1:I would not replace the URLs with empty strings but rather remove them. This UDF will do the trick:
val removeRegexUDF = udf(
(input: Seq[String]) => input.filterNot(s => s.matches("(https?\\://)\\S+"))
)
Answer2:Yes, you can.
At first, instead of Array the type should be Seq or WrappedArray. Secondly, function changes only one string to other string - not collection.
Your UDF should be:
def removeRegex(input: Seq[String]) : Array[String] = {
input.map(x => regex.replaceAllIn(x, "")).toArray
}
So map each element applying regular expression on it.
You can also use function <a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@regexp_replace(e:org.apache.spark.sql.Column,pattern:org.apache.spark.sql.Column,replacement:org.apache.spark.sql.Column):org.apache.spark.sql.Column" rel="nofollow">regexp_replace</a> from Spark functions