我有一个类型列表列表:
<top (required)>'
C:/RailsInstaller/Ruby2.3.3/lib/ruby/gems/2.3.0/gems/bundler-1.15.3/lib/bundler/runtime.rb:82:in
每个列表都包含属性&#39; A1&#39;,&#39; A2&#39;和&#39; A3&#39;的值。
我想获得下一个数据帧:
block (2 levels) in require'
C:/RailsInstaller/Ruby2.3.3/lib/ruby/gems/2.3.0/gems/bundler-1.15.3/lib/bundler/runtime.rb:77:in
我该怎么做?
答案 0 :(得分:1)
您可以使用标题作为字段创建一个Row Class,并使用zip
循环遍历列表并为每一行构造一个行对象:
lst = [[1, 2, 3], ['A', 'B', 'C'], ['aa', 'bb', 'cc']]
from pyspark.sql import Row
R = Row("A1", "A2", "A3")
sc.parallelize([R(*r) for r in zip(*lst)]).toDF().show()
+---+---+---+
| A1| A2| A3|
+---+---+---+
| 1| A| aa|
| 2| B| bb|
| 3| C| cc|
+---+---+---+
如果您安装了pandas,请先创建一个pandas数据框;您可以使用spark.createDataFrame
:
import pandas as pd
headers = ['A1', 'A2', 'A3']
pdf = pd.DataFrame.from_dict(dict(zip(headers, lst)))
spark.createDataFrame(pdf).show()
+---+---+---+
| A1| A2| A3|
+---+---+---+
| 1| A| aa|
| 2| B| bb|
| 3| C| cc|
+---+---+---+
答案 1 :(得分:0)
from pyspark.sql import Row
names=['A1', 'A2', 'A3']
data=sc.parallelize(zip(*[[1, 2, 3], ['A', 'B', 'C'], ['aa', 'bb', 'cc']])).\
map(lambda x: Row(**{names[i]: elt for i, elt in enumerate(x)})).toDF()