Pandas to Spark DataFrame转换将uuid字符串转换为整数

时间:2016-08-18 16:49:50

标签: pandas apache-spark dataframe pyspark

我有一个Pandas df,其中包含以下字段:

import { hashHistory } from 'react-router';

hashHistory.replace('/');

survey_id gender 0 000de5b8-651b-4a47-961c-548d80c84df2 Male 1 003e38f2-4196-4b10-9637-46774b21bcda Male 2 00595d21-4c1a-469a-9724-804b7883822b Female 3 0095bcd8-1184-41bc-8742-d14c3e3642c4 Male 4 00986df4-717c-4a6f-818b-a539a0b45fb2 Female 只是一个uuid。以下是从pandas版本创建Spark DataFrame(在PySpark中)的代码片段:

survey_id

函数调用schema = StructType([ StructField("survey_id", StringType()), StructField("gender", StringType()), ]) res = sql(survey_demos)[['survey_id', 'gender']] self._survey_demographics = self._spark.createDataFrame(res, schema) logging.info(self._survey_demographics.head(log)) logging.info(type(self._survey_demographics)) 正在查询数据库并返回Pandas df。现在,查看日志记录输出:

sql(survey_demos)

Spark在这里做什么,为什么它不只是给我一个2016-08-18 09:39:11,795 root INFO [Row(survey_id='{__class__=uuid.UUID, int=72159140233069381076719472719973874}', gender='Male'), Row(survey_id='{__class__=uuid.UUID, int=323077413680511253270271409764023514}', gender='Male'), Row(survey_id='{__class__=uuid.UUID, int=464003322584728470087346523203666475}', gender='Female'), Row(survey_id='{__class__=uuid.UUID, int=777482443631412718495518514480562884}', gender='Male'), Row(survey_id='{__class__=uuid.UUID, int=791459271937809735598486155808759730}', gender='Female')] 2016-08-18 09:39:11,795 root INFO <class 'pyspark.sql.dataframe.DataFrame'>  正如我在架构中指定的那样?

0 个答案:

没有答案