30741

remove last pipe-delimited value from dataframe column in pyspark

Question:

I am using spark 2.1 and have a dataframe column contain value like AB|12|XY|4. I want to create a new column by removing the last element, so it should show like AB|12|XY.

I tried to split, rsplit did not work, so need some suggestion to get the desired output.

Answer1:

Use the Spark SQL <a href="https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.split" rel="nofollow">split function</a> as follows:

<pre class="lang-python prettyprint-override">>>> from pyspark.sql.functions import split >>> json_data = ['{"c1":"AB|12|XY|4"}','{"c1":"11|22|33|44|remove"}'] >>> df = spark.read.json(sc.parallelize(json_data)) >>> df.show() +------------------+ | c1| +------------------+ | AB|12|XY|4| |11|22|33|44|remove| +------------------+ >>> df2 = df.withColumn("c2", split(df.c1, '\|\w+$')[0]) # split takes a regex pattern >>> df2.show() +------------------+-----------+ | c1| c2| +------------------+-----------+ | AB|12|XY|4| AB|12|XY| |11|22|33|44|remove|11|22|33|44| +------------------+-----------+

If you need to do something more complicated that can't be implemented using the built-in functions, you can define your own user-defined function (UDF):

<pre class="lang-python prettyprint-override">>>> from pyspark.sql.functions import udf >>> from pyspark.sql.types import * >>> def my_func(str): ... return str.rsplit('|',1)[0] ... >>> my_udf = udf(my_func, StringType()) >>> json_data = ['{"c1":"AB|12|XY|4"}','{"c1":"11|22|33|44|remove"}'] >>> df = spark.read.json(sc.parallelize(json_data)) >>> df2 = df.withColumn("c2", my_udf(df.c1)) >>> df2.show() +------------------+-----------+ | c1| c2| +------------------+-----------+ | AB|12|XY|4| AB|12|XY| |11|22|33|44|remove|11|22|33|44| +------------------+-----------+

Built-in <a href="http://www.cs.sfu.ca/CourseCentral/732/ggbaker/content/spark-sql.html#udf" rel="nofollow">SQL functions are preferred</a> (also <a href="http://www.cs.sfu.ca/CourseCentral/732/ggbaker/content/spark-sql.html#python-jvm" rel="nofollow">here</a>) because your data does not get passed back and forth between the JVM process and the Python process, which is what happens when you use a UDF.

Recommend

  • PySpark sqlContext read Postgres 9.6 NullPointerException
  • Custom partiotioning of JavaDStreamPairRDD
  • new spark.sql.shuffle.partitions value not used after checkpointing
  • pyspark substring and aggregation
  • detecting connection lost in spark streaming
  • iOS 6 dateFromString returns wrong date
  • distinct values from multiple fields within one table ORACLE SQL
  • What is corresponding c++ data type to SQL numeric(18,0) data type?
  • How can Delete be both a DDL and a DML statement
  • How to 'create temp table as select' in Slick?
  • NUnit 3.0 TestCase const custom object arguments
  • ASP.NET MVC 2 Preview 2 - display directory list rather than home/index
  • python script hangs on input method when running spark
  • pyodbc doesn't report sql server error
  • Breaking out column by groups in Pandas
  • Parsing a CSV string while ignoring commas inside the individual columns
  • With Hadoop, can I create a tasktracker on a machine that isn't running a datanode?
  • Spark fat jar to run multiple versions on YARN
  • R - Combining Columns to String Based on Logical Match
  • PHPUnit_Framework_TestCase class is not available. Fix… - Makegood , Eclipse
  • Java Scanner input dilemma. Automatically inputs without allowing user to type
  • Regex thinks I'm nesting, but I'm not
  • What is the “return” in scheme?
  • PHP - How to update data to MySQL when click a radio button
  • Counter field in MS Access, how to generate?
  • Modifying destination and filename of gulp-svg-sprite
  • Join two tables and save into third-sql
  • Deserializing XML into class C#
  • Function pointer “assignment from incompatible pointer type” only when using vararg ellipsis
  • How to disable jQuery.jplayer autoplay?
  • python draw pie shapes with colour filled
  • reshape alternating columns in less time and using less memory
  • Binding checkboxes to object values in AngularJs
  • Observable and ngFor in Angular 2
  • How to Embed XSL into XML
  • UserPrincipal.Current returns apppool on IIS
  • Conditional In-Line CSS for IE and Others?
  • Net Present Value in Excel for Grouped Recurring CF
  • jQuery Masonry / Isotope and fluid images: Momentary overlap on window resize
  • How to load view controller without button in storyboard?