在Pyspark中选择字符(“-”)之前/之后的特定字符串

时间:2019-04-21 18:26:26

标签: pyspark

我使用子字符串来获取第一个和最后一个值。但是,如何在字符串中找到特定字符并获取其前后的值

1 个答案:

答案 0 :(得分:1)

尝试这些...听起来像您要找的东西

参考文档:

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.substring_index https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.split

df = spark.createDataFrame([('hello-there',)], ['text'])

from pyspark.sql.functions import substring_index
df.select(substring_index(df.text, '-', 1).alias('left')).show() # left of delim
df.select(substring_index(df.text, '-', -1).alias('right')).show() # right of delim

+-----+
| left|
+-----+
|hello|
+-----+

+-----+
|right|
+-----+
|there|
+-----+

from pyspark.sql.functions import split
split_df = df.select(split(df.text, '-').alias('split_text'))
split_df.selectExpr("split_text[0] as left").show() # left of delim
split_df.selectExpr("split_text[1] as right").show() # right of delim

+-----+
| left|
+-----+
|hello|
+-----+

+-----+
|right|
+-----+
|there|
+-----+

from pyspark.sql.functions import substring_index, substring, concat, col, lit

df = spark.createDataFrame([('will-smith',)], ['text'])

df = df\
.withColumn("left", substring_index(df.text, '-', 1))\
.withColumn("right", substring_index(df.text, '-', -1))\

df = df\
.withColumn("left_sub", substring(df.left, -2, 2))\
.withColumn("right_sub", substring(df.right, 1, 2))

df = df\
.withColumn("concat_sub", concat(col("left_sub"), lit("-"), col("right_sub")))

df.show()

+----------+----+-----+--------+---------+----------+
|      text|left|right|left_sub|right_sub|concat_sub|
+----------+----+-----+--------+---------+----------+
|will-smith|will|smith|      ll|       sm|     ll-sm|
+----------+----+-----+--------+---------+----------+