SparkContext Parallelize在错误的地方拆分

时间:2017-08-21 12:40:26

标签: python apache-spark pyspark

我下载了一个文件,现在我正在尝试将其作为数据帧写入hdfs。

import requests
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName('Write Data').setMaster('local')
sc = SparkContext(conf=conf)

file = requests.get('https://data.nasa.gov/resource/y77d-th95.csv')

data = sc.parallelize(file)

当我打印文件内容时,我看到以下输出:

print(file.text)
":@computed_region_cbhk_fwbd",":@computed_region_nnqa_25f4","fall","geolocation","geolocation_address","geolocation_city","geolocation_state","geolocation_zip","id","mass","name","nametype","recclass","reclat","reclong","year"
,,"Fell","POINT (6.08333 50.775)",,,,,"1","21","Aachen","Valid","L5","50.775000","6.083330","1880-01-01T00:00:00.000"
,,"Fell","POINT (10.23333 56.18333)",,,,,"2","720","Aarhus","Valid","H6","56.183330","10.233330","1951-01-01T00:00:00.000"

这正是我想要看到的。现在我正试图从我使用data = sc.parallelize(file)

创建的RDD中获取标题
print(data.first())
":@computed_region_cbhk_fwbd",":@computed_region_nnqa_25f4","fall","geolocation","geolocation_address","geolocation_city","geolo

为什么我没有像我期待的第一张照片那样得到第一行?它在中途停在某处,我没有看到我标题的其他组成部分。

2 个答案:

答案 0 :(得分:1)

它不起作用,因为Response.__iter__不能识别格式。它只是iterates over fixed size chunks

如果你真的需要阅读这样的数据,请使用text.splitlines

sc.parallelize(file.text.splitlines())

或更好:

import csv
import io

sc.parallelize(csv.reader(io.StringIO(file.text)))

答案 1 :(得分:1)

答案很简单。要并行化Python对象,需要为Spark提供一个列表。在这种情况下,您将提供回复:

>>> file = requests.get('https://data.nasa.gov/resource/y77d-th95.csv')
>>> file
<Response [200]>

如果您提取数据,并且您将通过自己拆分来帮助Spark,Spark会理解它:

import requests
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName('Write Data').setMaster('local')
sc = SparkContext(conf=conf)

file = requests.get('https://data.nasa.gov/resource/y77d-th95.csv').text.split('\n')

data = sc.parallelize(file)
data.first()
>>> u'":@computed_region_cbhk_fwbd",":@computed_region_nnqa_25f4","fall","geolocation","geolocation_address","geolocation_city","geolocation_state","geolocation_zip","id","mass","name","nametype","recclass","reclat","reclong","year"'

当你有一个像Hadoop这样的文件系统时,hadoop会为你做拆分并安排HDFS块,以便在断行时分割。

希望这有帮助。

干杯,福科