将字符串转换为int null问题

时间:2017-03-10 02:47:50

标签: apache-spark pyspark

我有一个spark数据帧,结果,有两个字符串列我想要转换为数字:

>>> results.show()
+--------------------+-----------------+------------------------+
|       Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+--------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC...|             "43"|                    "20"|
|"BAYLOR MEDICAL C...|             "32"|                    "20"|
|"GOOD SHEPHERD ME...|             "25"|                    "20"|
|"GOOD SHEPHERD ME...|             "25"|                    "20"|
|"MASONIC HOME AND...|  "Not Available"|         "Not Available"|
|"ST HELENA HOSPITAL"|             "41"|                    "20"|
|   "TOURO INFIRMARY"|             "15"|                    "18"|
|"WAHIAWA GENERAL ...|             "17"|                    "10"|
|"ANNA JAQUES HOSP...|             "27"|                    "18"|
|    "CMC-BLUE RIDGE"|             "31"|                    "18"|
|"EVANSTON REGIONA...|             "15"|                    "15"|
|"OKLAHOMA SPINE H...|             "79"|                    "20"|
|"PICKENS COUNTY M...|  "Not Available"|         "Not Available"|
|"PORTNEUF MEDICAL...|             "11"|                    "17"|
|"PRESENCE SAINT J...|             "20"|                    "17"|
|"RIVERSIDE MEDICA...|             "39"|                    "20"|
|"RIVERSIDE MEDICA...|             "39"|                    "20"|
|"RIVERSIDE MEDICA...|             "39"|                    "20"|
|"SOUTH GEORGIA ME...|    "3 out of 10"|                    "24"|
|"TAMPA GENERAL HO...|             "23"|                    "16"|
+--------------------+-----------------+------------------------+

尝试这样可以得到一个空值表:

>>> results2 = results.select( results["Hospital Name"], results["HCAHPS Base Score"].cast(pe()).alias("HCAHPS Base Score"), results["HCAHPS Consistency Score"].cast(IntegerType()).aHPS Consistency Score") )
>>> results2.show()
+--------------------+-----------------+------------------------+
|       Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+--------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC...|             null|                    null|
|"BAYLOR MEDICAL C...|             null|                    null|
|"GOOD SHEPHERD ME...|             null|                    null|
|"GOOD SHEPHERD ME...|             null|                    null|
|"MASONIC HOME AND...|             null|                    null|
|"ST HELENA HOSPITAL"|             null|                    null|
|   "TOURO INFIRMARY"|             null|                    null|
|"WAHIAWA GENERAL ...|             null|                    null|
|"ANNA JAQUES HOSP...|             null|                    null|
|    "CMC-BLUE RIDGE"|             null|                    null|
|"EVANSTON REGIONA...|             null|                    null|
|"OKLAHOMA SPINE H...|             null|                    null|
|"PICKENS COUNTY M...|             null|                    null|
|"PORTNEUF MEDICAL...|             null|                    null|
|"PRESENCE SAINT J...|             null|                    null|
|"RIVERSIDE MEDICA...|             null|                    null|
|"RIVERSIDE MEDICA...|             null|                    null|
|"RIVERSIDE MEDICA...|             null|                    null|
|"SOUTH GEORGIA ME...|             null|                    null|
|"TAMPA GENERAL HO...|             null|                    null|
+--------------------+-----------------+------------------------+

only showing top 20 rows

是否无法在pyspark中将字符串列转换为整数?

1 个答案:

答案 0 :(得分:6)

首先你最好剥离双引号,然后你应该能够转换为IntegerType。您可以使用以下udf来完成它。

>>> def stripDQ(string):
...  return string.replace('"', "")
... 
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import StringType, IntegerType
>>> udf_stripDQ = udf(stripDQ, StringType())

我们会用它..

您的实际DataFrame:

>>> results.show()
+------------------+-----------------+------------------------+
|     Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC"|             "43"|                    "20"|
|"BAYLOR MEDICAL C"|             "32"|                    "20"|
|"GOOD SHEPHERD ME"|             "25"|                    "20"|
|"GOOD SHEPHERD ME"|             "25"|                    "20"|
|"MASONIC HOME AND"|  "Not Available"|         "Not Available"|
+------------------+-----------------+------------------------+

现在,我们将使用我们的udf从两列中删除双引号。

>>> results1 = results.withColumn("HCAHPS Base Score", udf_stripDQ(results["HCAHPS Base Score"]) ).withColumn("HCAHPS Consistency Score", udf_stripDQ(results["HCAHPS Consistency Score"]) )
>>> results1.show()
+------------------+-----------------+------------------------+
|     Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC"|               43|                      20|
|"BAYLOR MEDICAL C"|               32|                      20|
|"GOOD SHEPHERD ME"|               25|                      20|
|"GOOD SHEPHERD ME"|               25|                      20|
|"MASONIC HOME AND"|    Not Available|           Not Available|
+------------------+-----------------+------------------------+

现在转换为整数:

>>> results2 = results1.select( results1["Hospital Name"], results1["HCAHPS Base Score"].cast(IntegerType()).alias("HCAHPS Base Score"), results1["HCAHPS Consistency Score"].cast(IntegerType()).alias("HPS Consistency Score") )
>>> results2.show()
+------------------+-----------------+---------------------+
|     Hospital Name|HCAHPS Base Score|HPS Consistency Score|
+------------------+-----------------+---------------------+
|"ADIRONDACK MEDIC"|               43|                   20|
|"BAYLOR MEDICAL C"|               32|                   20|
|"GOOD SHEPHERD ME"|               25|                   20|
|"GOOD SHEPHERD ME"|               25|                   20|
|"MASONIC HOME AND"|             null|                 null|
+------------------+-----------------+---------------------+