
Question:
I am using spark 2.1 and have a dataframe column contain value like AB|12|XY|4
.
I want to create a new column by removing the last element, so it should show like AB|12|XY
.
I tried to split, rsplit did not work, so need some suggestion to get the desired output.
Answer1:Use the Spark SQL <a href="https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.split" rel="nofollow">split
function</a> as follows:
>>> from pyspark.sql.functions import split
>>> json_data = ['{"c1":"AB|12|XY|4"}','{"c1":"11|22|33|44|remove"}']
>>> df = spark.read.json(sc.parallelize(json_data))
>>> df.show()
+------------------+
| c1|
+------------------+
| AB|12|XY|4|
|11|22|33|44|remove|
+------------------+
>>> df2 = df.withColumn("c2", split(df.c1, '\|\w+$')[0]) # split takes a regex pattern
>>> df2.show()
+------------------+-----------+
| c1| c2|
+------------------+-----------+
| AB|12|XY|4| AB|12|XY|
|11|22|33|44|remove|11|22|33|44|
+------------------+-----------+
If you need to do something more complicated that can't be implemented using the built-in functions, you can define your own user-defined function (UDF):
<pre class="lang-python prettyprint-override">>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import *
>>> def my_func(str):
... return str.rsplit('|',1)[0]
...
>>> my_udf = udf(my_func, StringType())
>>> json_data = ['{"c1":"AB|12|XY|4"}','{"c1":"11|22|33|44|remove"}']
>>> df = spark.read.json(sc.parallelize(json_data))
>>> df2 = df.withColumn("c2", my_udf(df.c1))
>>> df2.show()
+------------------+-----------+
| c1| c2|
+------------------+-----------+
| AB|12|XY|4| AB|12|XY|
|11|22|33|44|remove|11|22|33|44|
+------------------+-----------+
Built-in <a href="http://www.cs.sfu.ca/CourseCentral/732/ggbaker/content/spark-sql.html#udf" rel="nofollow">SQL functions are preferred</a> (also <a href="http://www.cs.sfu.ca/CourseCentral/732/ggbaker/content/spark-sql.html#python-jvm" rel="nofollow">here</a>) because your data does not get passed back and forth between the JVM process and the Python process, which is what happens when you use a UDF.