我有一个Pandas df,其中包含以下字段:
import { hashHistory } from 'react-router';
hashHistory.replace('/');
survey_id gender
0 000de5b8-651b-4a47-961c-548d80c84df2 Male
1 003e38f2-4196-4b10-9637-46774b21bcda Male
2 00595d21-4c1a-469a-9724-804b7883822b Female
3 0095bcd8-1184-41bc-8742-d14c3e3642c4 Male
4 00986df4-717c-4a6f-818b-a539a0b45fb2 Female
只是一个uuid。以下是从pandas版本创建Spark DataFrame(在PySpark中)的代码片段:
survey_id
函数调用schema = StructType([
StructField("survey_id", StringType()),
StructField("gender", StringType()),
])
res = sql(survey_demos)[['survey_id', 'gender']]
self._survey_demographics = self._spark.createDataFrame(res, schema)
logging.info(self._survey_demographics.head(log))
logging.info(type(self._survey_demographics))
正在查询数据库并返回Pandas df。现在,查看日志记录输出:
sql(survey_demos)
Spark在这里做什么,为什么它不只是给我一个2016-08-18 09:39:11,795 root INFO [Row(survey_id='{__class__=uuid.UUID, int=72159140233069381076719472719973874}', gender='Male'), Row(survey_id='{__class__=uuid.UUID, int=323077413680511253270271409764023514}', gender='Male'), Row(survey_id='{__class__=uuid.UUID, int=464003322584728470087346523203666475}', gender='Female'), Row(survey_id='{__class__=uuid.UUID, int=777482443631412718495518514480562884}', gender='Male'), Row(survey_id='{__class__=uuid.UUID, int=791459271937809735598486155808759730}', gender='Female')]
2016-08-18 09:39:11,795 root INFO <class 'pyspark.sql.dataframe.DataFrame'>
正如我在架构中指定的那样?