我想将我的词典列表转换为DataFrame。这是列表:
mylist =
[
{"type_activity_id":1,"type_activity_name":"xxx"},
{"type_activity_id":2,"type_activity_name":"yyy"},
{"type_activity_id":3,"type_activity_name":"zzz"}
]
这是我的代码:
from pyspark.sql.types import StringType
df = spark.createDataFrame(mylist, StringType())
df.show(2,False)
+-----------------------------------------+
| value|
+-----------------------------------------+
|{type_activity_id=1,type_activity_id=xxx}|
|{type_activity_id=2,type_activity_id=yyy}|
|{type_activity_id=3,type_activity_id=zzz}|
+-----------------------------------------+
我假设我应该为每列提供一些映射和类型,但是我不知道该怎么做。
更新:
我也尝试过:
schema = ArrayType(
StructType([StructField("type_activity_id", IntegerType()),
StructField("type_activity_name", StringType())
]))
df = spark.createDataFrame(mylist, StringType())
df = df.withColumn("value", from_json(df.value, schema))
但是我得到null
值:
+-----+
|value|
+-----+
| null|
| null|
+-----+
答案 0 :(得分:4)
过去,您可以简单地将字典传递给spark.createDataFrame()
,但现在不推荐使用此方法:
mylist = [
{"type_activity_id":1,"type_activity_name":"xxx"},
{"type_activity_id":2,"type_activity_name":"yyy"},
{"type_activity_id":3,"type_activity_name":"zzz"}
]
df = spark.createDataFrame(mylist)
#UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
# warnings.warn("inferring schema from dict is deprecated,"
如该警告消息所述,您应该改用pyspark.sql.Row
。
from pyspark.sql import Row
spark.createDataFrame(Row(**x) for x in mylist).show(truncate=False)
#+----------------+------------------+
#|type_activity_id|type_activity_name|
#+----------------+------------------+
#|1 |xxx |
#|2 |yyy |
#|3 |zzz |
#+----------------+------------------+
在这里,我使用**
(keyword argument unpacking)将字典传递给Row
构造函数。
答案 1 :(得分:3)
您可以这样做。您将获得一个包含2列的数据框。
mylist = [
{"type_activity_id":1,"type_activity_name":"xxx"},
{"type_activity_id":2,"type_activity_name":"yyy"},
{"type_activity_id":3,"type_activity_name":"zzz"}
]
myJson = sc.parallelize(mylist)
myDf = sqlContext.read.json(mylist)
输出:
+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
| 1| xxx|
| 2| yyy|
| 3| zzz|
+----------------+------------------+
答案 2 :(得分:1)
在Spark 2.4版中,可以直接使用 df = spark.createDataFrame(mylist)
>>> mylist = [
... {"type_activity_id":1,"type_activity_name":"xxx"},
... {"type_activity_id":2,"type_activity_name":"yyy"},
... {"type_activity_id":3,"type_activity_name":"zzz"}
... ]
>>> df1=spark.createDataFrame(mylist)
>>> df1.show()
+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
| 1| xxx|
| 2| yyy|
| 3| zzz|
+----------------+------------------+
答案 3 :(得分:0)
从词典列表创建dataframe
时,我也面临着同样的问题。
我已经使用namedtuple
解决了这个问题。
下面是我使用提供的数据的代码。
from collections import namedtuple
final_list = []
mylist = [{"type_activity_id":1,"type_activity_name":"xxx"},
{"type_activity_id":2,"type_activity_name":"yyy"},
{"type_activity_id":3,"type_activity_name":"zzz"}
]
ExampleTuple = namedtuple('ExampleTuple', ['type_activity_id', 'type_activity_name'])
for my_dict in mylist:
namedtupleobj = ExampleTuple(**my_dict)
final_list.append(namedtupleobj)
sqlContext.createDataFrame(final_list).show(truncate=False)
输出
+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
|1 |xxx |
|2 |yyy |
|3 |zzz |
+----------------+------------------+
我的版本信息如下
spark: 2.4.0
python: 3.6
没有必要使用my_list
变量。因为它可用,所以我用它来创建namedtuple对象,否则可以直接创建namedtuple
对象。