Question

我有一个数据框df，其中包含类似这样的字符串列表：

+-------------+
   Products
+-------------+
|     Z9L57.W3|
|     H9L23.05|
|     PRL57.AF|
+-------------+

我想在“。”之后截断列表。这样的字符看起来像：

+--------------+
 Products_trunc
+--------------+
|     Z9L57    |
|     H9L23    |
|     PRL57    |
+--------------+

我尝试使用split函数，但是它仅适用于单个字符串而不适用于列表。我也尝试过

df['Products_trunc'] = df['Products'].str.split('.').str[0]

但是出现以下错误：

TypeError：“列”对象不可调用

有人对此有任何见解吗？谢谢

Answer 1

您的代码看起来就像您习惯了大熊猫一样。 pyspark中的截断有些不同。在下面看看：

from pyspark.sql import functions as F
l = [
(  'Z9L57.W3'  , ),
(  'H9L23.05'  ,),
(  'PRL57.AF'  ,)
]

columns = ['Products']

df=spark.createDataFrame(l, columns)

通过withColumn函数，您可以修改现有列或创建新列。该函数采用2个参数：column name和columne expression。当列名已经存在时，您将修改列。

df = df.withColumn('Products', F.split(df.Products, '\.').getItem(0))
df.show()

输出：

+--------+
|Products|
+--------+
|   Z9L57|
|   H9L23|
|   PRL57|
+--------+

当您选择一个不存在的列名时，您将创建一个新列。

df = df.withColumn('Products_trunc', F.split(df.Products, '\.').getItem(0))
df.show()

输出：

+--------+--------------+ 
|Products|Products_trunc| 
+--------+--------------+ 
|Z9L57.W3|         Z9L57| 
|H9L23.05|         H9L23| 
|PRL57.AF|         PRL57| 
+--------+--------------+

使用pyspark在特定字符之后截断dataframe列中的所有字符串

1 个答案: