Question

我是pyspark的新手，我想用pyspark数据帧列中的数字动态替换名称，因为我的数据帧中有500万个以上的名称。如何进行？

----------
| Name   |
----------
| nameone|
----------
| nametwo|
----------

应该成为

--------
| Name |
--------
|   1  |
--------
|   2  |
--------

Answer 1

我可以想到两种选择。如果只有唯一的名称，则可以简单地应用monotonically_increasing_id函数。这将为每行创建一个唯一但不连续的ID。

import pyspark.sql.functions as F
from pyspark.ml.feature import StringIndexer

l = [
('nameone', ),
('nametwo', ),
('nameone', )
]

columns = ['Name']

df=spark.createDataFrame(l, columns)
#use Name instead of uniqueId to overwrite the column
df = df.withColumn('uniqueId', F.monotonically_increasing_id())
df.show()

输出：

+-------+----------+ 
|   Name|  uniqueId| 
+-------+----------+ 
|nameone|         0| 
|nametwo|8589934592| 
|nameone|8589934593| 
+-------+----------+

如果要将相同的ID分配给Name具有相同值的行，则必须使用StringIndexer：

indexer = StringIndexer(inputCol="Name", outputCol="StringINdex")
df = indexer.fit(df).transform(df)
df.show()

输出：

+-------+----------+-----------+ 
|   Name|  uniqueId|StringINdex| 
+-------+----------+-----------+ 
|nameone|         0|        0.0| 
|nametwo|8589934592|        1.0| 
|nameone|8589934593|        0.0| 
+-------+----------+-----------+

在pyspark数据框中使用数字替换字符串

1 个答案: