根据pyspark中的现有列值创建新列

时间:2020-11-02 17:55:13

标签: pyspark apache-spark-sql pyspark-dataframes

我有一个数据框,其中有一个包含机场名称的现有列,并且我想用其缩写创建另一个列。

例如,我有一个具有以下值的现有列:

SEATTLE TACOMA AIRPORT, WA US
MIAMI INTERNATIONAL AIRPORT, FL US
SAN FRANCISCO INTERNATIONAL AIRPORT, CA US
MIAMI INTERNATIONAL AIRPORT, FL US
MIAMI INTERNATIONAL AIRPORT, FL US
SAN FRANCISCO INTERNATIONAL AIRPORT, CA US
SEATTLE TACOMA AIRPORT, WA US

我想创建一个新列及其相关的缩写,例如SEA,MIA和SFO。我当时以为可以使用for循环来实现这一点,但是我不确定如何准确地对其进行编码。

2 个答案:

答案 0 :(得分:1)

这里有2种示例方法:

  1. 使用字典和UDF
  2. 使用第二个DataFrame与之连接
from pyspark.sql.functions import col, udf, StringType

s = """\
SEATTLE TACOMA AIRPORT, WA US
MIAMI INTERNATIONAL AIRPORT, FL US
SAN FRANCISCO INTERNATIONAL AIRPORT, CA US
MIAMI INTERNATIONAL AIRPORT, FL US
MIAMI INTERNATIONAL AIRPORT, FL US
SAN FRANCISCO INTERNATIONAL AIRPORT, CA US
SEATTLE TACOMA AIRPORT, WA US"""

abbr = {
    "SEATTLE TACOMA AIRPORT": "SEA",
    "MIAMI INTERNATIONAL AIRPORT": "MIA",
    "SAN FRANCISCO INTERNATIONAL AIRPORT": "SFO",
}

df = spark.read.csv(sc.parallelize(s.splitlines()))

print("=== df ===")
df.show()

# =================================
#  1. using a UDF
# =================================
print("=== using a UDF ===")
udf_airport_to_abbr = udf(lambda airport: abbr[airport], StringType())
df.withColumn("abbr", udf_airport_to_abbr("_c0")).show()

# =================================
#  2. using a join
# =================================
# you may want to create this df in some different way ;)
df_abbrs = spark.read.csv(sc.parallelize(["%s,%s" % x for x in abbr.items()]))
print("=== df_abbrs ===")
df_abbrs.show()
print("=== using a join ===")
df.join(df_abbrs, on="_c0").show()

输出:

=== df ===
+--------------------+------+
|                 _c0|   _c1|
+--------------------+------+
|SEATTLE TACOMA AI...| WA US|
|MIAMI INTERNATION...| FL US|
|SAN FRANCISCO INT...| CA US|
|MIAMI INTERNATION...| FL US|
|MIAMI INTERNATION...| FL US|
|SAN FRANCISCO INT...| CA US|
|SEATTLE TACOMA AI...| WA US|
+--------------------+------+

=== using a UDF ===
+--------------------+------+----+
|                 _c0|   _c1|abbr|
+--------------------+------+----+
|SEATTLE TACOMA AI...| WA US| SEA|
|MIAMI INTERNATION...| FL US| MIA|
|SAN FRANCISCO INT...| CA US| SFO|
|MIAMI INTERNATION...| FL US| MIA|
|MIAMI INTERNATION...| FL US| MIA|
|SAN FRANCISCO INT...| CA US| SFO|
|SEATTLE TACOMA AI...| WA US| SEA|
+--------------------+------+----+

=== df_abbrs ===
+--------------------+---+
|                 _c0|_c1|
+--------------------+---+
|SEATTLE TACOMA AI...|SEA|
|MIAMI INTERNATION...|MIA|
|SAN FRANCISCO INT...|SFO|
+--------------------+---+

=== using a join ===
+--------------------+------+---+
|                 _c0|   _c1|_c1|
+--------------------+------+---+
|SEATTLE TACOMA AI...| WA US|SEA|
|SEATTLE TACOMA AI...| WA US|SEA|
|SAN FRANCISCO INT...| CA US|SFO|
|SAN FRANCISCO INT...| CA US|SFO|
|MIAMI INTERNATION...| FL US|MIA|
|MIAMI INTERNATION...| FL US|MIA|
|MIAMI INTERNATION...| FL US|MIA|
+--------------------+------+---+

答案 1 :(得分:0)

您可以在数据框中添加新列,它将创建新的数据框 您可以使用 dataframe.withColumn(newcolumnname,case语句将名称解码为缩写)