我有一个spark数据帧,结果,有两个字符串列我想要转换为数字:
>>> results.show()
+--------------------+-----------------+------------------------+
| Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+--------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC...| "43"| "20"|
|"BAYLOR MEDICAL C...| "32"| "20"|
|"GOOD SHEPHERD ME...| "25"| "20"|
|"GOOD SHEPHERD ME...| "25"| "20"|
|"MASONIC HOME AND...| "Not Available"| "Not Available"|
|"ST HELENA HOSPITAL"| "41"| "20"|
| "TOURO INFIRMARY"| "15"| "18"|
|"WAHIAWA GENERAL ...| "17"| "10"|
|"ANNA JAQUES HOSP...| "27"| "18"|
| "CMC-BLUE RIDGE"| "31"| "18"|
|"EVANSTON REGIONA...| "15"| "15"|
|"OKLAHOMA SPINE H...| "79"| "20"|
|"PICKENS COUNTY M...| "Not Available"| "Not Available"|
|"PORTNEUF MEDICAL...| "11"| "17"|
|"PRESENCE SAINT J...| "20"| "17"|
|"RIVERSIDE MEDICA...| "39"| "20"|
|"RIVERSIDE MEDICA...| "39"| "20"|
|"RIVERSIDE MEDICA...| "39"| "20"|
|"SOUTH GEORGIA ME...| "3 out of 10"| "24"|
|"TAMPA GENERAL HO...| "23"| "16"|
+--------------------+-----------------+------------------------+
尝试这样可以得到一个空值表:
>>> results2 = results.select( results["Hospital Name"], results["HCAHPS Base Score"].cast(pe()).alias("HCAHPS Base Score"), results["HCAHPS Consistency Score"].cast(IntegerType()).aHPS Consistency Score") )
>>> results2.show()
+--------------------+-----------------+------------------------+
| Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+--------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC...| null| null|
|"BAYLOR MEDICAL C...| null| null|
|"GOOD SHEPHERD ME...| null| null|
|"GOOD SHEPHERD ME...| null| null|
|"MASONIC HOME AND...| null| null|
|"ST HELENA HOSPITAL"| null| null|
| "TOURO INFIRMARY"| null| null|
|"WAHIAWA GENERAL ...| null| null|
|"ANNA JAQUES HOSP...| null| null|
| "CMC-BLUE RIDGE"| null| null|
|"EVANSTON REGIONA...| null| null|
|"OKLAHOMA SPINE H...| null| null|
|"PICKENS COUNTY M...| null| null|
|"PORTNEUF MEDICAL...| null| null|
|"PRESENCE SAINT J...| null| null|
|"RIVERSIDE MEDICA...| null| null|
|"RIVERSIDE MEDICA...| null| null|
|"RIVERSIDE MEDICA...| null| null|
|"SOUTH GEORGIA ME...| null| null|
|"TAMPA GENERAL HO...| null| null|
+--------------------+-----------------+------------------------+
only showing top 20 rows
是否无法在pyspark中将字符串列转换为整数?
答案 0 :(得分:6)
首先你最好剥离双引号,然后你应该能够转换为IntegerType。您可以使用以下udf来完成它。
>>> def stripDQ(string):
... return string.replace('"', "")
...
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import StringType, IntegerType
>>> udf_stripDQ = udf(stripDQ, StringType())
我们会用它..
您的实际DataFrame:
>>> results.show()
+------------------+-----------------+------------------------+
| Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC"| "43"| "20"|
|"BAYLOR MEDICAL C"| "32"| "20"|
|"GOOD SHEPHERD ME"| "25"| "20"|
|"GOOD SHEPHERD ME"| "25"| "20"|
|"MASONIC HOME AND"| "Not Available"| "Not Available"|
+------------------+-----------------+------------------------+
现在,我们将使用我们的udf从两列中删除双引号。
>>> results1 = results.withColumn("HCAHPS Base Score", udf_stripDQ(results["HCAHPS Base Score"]) ).withColumn("HCAHPS Consistency Score", udf_stripDQ(results["HCAHPS Consistency Score"]) )
>>> results1.show()
+------------------+-----------------+------------------------+
| Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC"| 43| 20|
|"BAYLOR MEDICAL C"| 32| 20|
|"GOOD SHEPHERD ME"| 25| 20|
|"GOOD SHEPHERD ME"| 25| 20|
|"MASONIC HOME AND"| Not Available| Not Available|
+------------------+-----------------+------------------------+
现在转换为整数:
>>> results2 = results1.select( results1["Hospital Name"], results1["HCAHPS Base Score"].cast(IntegerType()).alias("HCAHPS Base Score"), results1["HCAHPS Consistency Score"].cast(IntegerType()).alias("HPS Consistency Score") )
>>> results2.show()
+------------------+-----------------+---------------------+
| Hospital Name|HCAHPS Base Score|HPS Consistency Score|
+------------------+-----------------+---------------------+
|"ADIRONDACK MEDIC"| 43| 20|
|"BAYLOR MEDICAL C"| 32| 20|
|"GOOD SHEPHERD ME"| 25| 20|
|"GOOD SHEPHERD ME"| 25| 20|
|"MASONIC HOME AND"| null| null|
+------------------+-----------------+---------------------+