Question

Spark：1.4.0

我有一个来自Amazon S3的平面文件，我加载到HDFS（在我的EC2 Spark集群的主节点中）。 flatfile是一个Hive输出。注意：我无法更改已定义的上下文。 pyspark shell中使用以下代码：

每个'行'对应1行数据：

row = sc.textFile("/data/file")
row.first()

u'E8B98 \ x01John \ x01Smith \ x01Male \ x01Gold \ x0125 ''

然后我使用flatmap（）分割每一行，因为由于某种原因map（）似乎没有分隔它（使用'\ x01'作为分隔符）：

elements = row.flatMap(lambda x: x.split('\x01'))
elements.take(8)

[u'E8B98'，u'John'，u'Smith'，u'Male'，u'Gold'，u'25'，u'E8B99'，u'Alice']

由于我知道数据每行有6列，我如何将数据导入数据帧？我打算按属性，总和等进行排序。

我尝试了以下但是没有用：

id = row.flatMap(lambda x: x.split('\x01')[0])
id.first()

电子

Answer 1

在python中有很多方法可以将rdd转换为数据帧：

考虑以下rdd

rdd = sc.parallelize(list(["E8B98\x01John\x01Smith\x01Male\x01Gold\x0125","E8B2\x01Joe\x01Smith\x01Female\x01Gold\x0125"]))
rdd.first()

输出：

'E8B98\x01John\x01Smith\x01Male\x01Gold\x0125'

现在让我们创建一个元组rdd：

rdd2 = rdd.map(lambda x : x.split("\x01"))
rdd2.first()

输出：

['E8B98', 'John', 'Smith', 'Male', 'Gold', '25']

我们现在可以使用以下方法之一创建数据框：

直接从元组rdd：

创建

sqlContext.createDataFrame(rdd2).collect()

输出：

[Row(_1=u'E8B98', _2=u'John', _3=u'Smith', _4=u'Male', _5=u'Gold', _6=u'25'), Row(_1=u'E8B2', _2=u'Joe', _3=u'Smith', _4=u'Female', _5=u'Gold', _6=u'25')]

或使用指定列名称的相同rdd创建它：

df = sqlContext.createDataFrame(rdd2, ['id', 'name', 'surname', 'gender', 'description', 'age'])
df.collect()

输出：

[Row(id=u'E8B98', name=u'John', surname=u'Smith', gender=u'Male', description=u'Gold', age=u'25'), Row(id=u'E8B2', name=u'Joe', surname=u'Smith', gender=u'Female', description=u'Gold', age=u'25')]

或使用推断架构创建

pyspark.sql.types import *
schema = StructType([
    StructField("id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("surname", StringType(), True),
    StructField("gender", StringType(), True),
    StructField("description", StringType(), True),
    StructField("age", StringType(), True)])
df2 = sqlContext.createDataFrame(rdd2, schema)
df2.collect()

输出：

[Row(id=u'E8B98', name=u'John', surname=u'Smith', gender=u'Male', description=u'Gold', age=u'25'),Row(id=u'E8B2', name=u'Joe', surname=u'Smith', gender=u'Female', description=u'Gold', age=u'25')]

或指定您的行类架构如下：

from pyspark.sql import Row
Person = Row('id', 'name', 'surname', 'gender', 'description', 'age')
person = rdd2.map(lambda r: Person(*r))
df3 = sqlContext.createDataFrame(person)
df3.collect()

输出：

[Row(id=u'E8B98', name=u'John', surname=u'Smith', gender=u'Male', description=u'Gold', age=u'25'), Row(id=u'E8B2', name=u'Joe', surname=u'Smith', gender=u'Female', description=u'Gold', age=u'25')]

我希望这有帮助！

NB： Spark版本＆gt; = 1.3.0

Pyspark：将来自S3的'\ x01'分隔文件转换为数据帧

1 个答案: