我正在使用Pandas读取csv文件,它是两列数据框,然后尝试转换为spark数据框。的代码是:
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(df)
数据框:
print(df)
给出了:
Name Category
0 EDSJOBLIST apply at www.edsjoblist.com ['biotechnology', 'clinical', 'diagnostic', 'd...
1 Power Direct Marketing ['advertising', 'analytics', 'brand positionin...
2 CHA Hollywood Medical Center, L.P. ['general medical and surgical hospital', 'hea...
3 JING JING GOURMET [nan]
4 TRUE LIFE KINGDOM MINISTRIES ['religious organization']
5 fasterproms ['microsoft .net']
6 STEREO ZONE ['accessory', 'audio', 'car audio', 'chrome', ...
7 SAN FRANCISCO NEUROLOGICAL SOCIETY [nan]
8 Fl Advisors ['comprehensive financial planning', 'financia...
9 Fortunatus LLC ['bottle', 'bottling', 'charitable', 'dna', 'f...
10 TREADS LLC ['retail', 'wholesaling']
有人可以帮助我吗?
答案 0 :(得分:1)
Spark可能难以处理object
数据类型。可能的解决方法是先将所有内容都转换为字符串:
sdf = sqlCtx.createDataFrame(df.astype(str))
一个结果是,包括nan
在内的所有内容都将转换为字符串。您将需要注意正确处理这些转换并将列转换为适当的类型。
例如,如果您有一个带有浮点值的列"colA"
,则可以使用类似以下的内容将字符串"nan"
转换为null
:
from pyspark.sql.functions import col, when
sdf = sdf.withColumn("colA", when(col("colA") != "nan", col("colA").cast("float")))