我下载了一个文件,现在我正在尝试将其作为数据帧写入hdfs。
import requests
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('Write Data').setMaster('local')
sc = SparkContext(conf=conf)
file = requests.get('https://data.nasa.gov/resource/y77d-th95.csv')
data = sc.parallelize(file)
当我打印文件内容时,我看到以下输出:
print(file.text)
":@computed_region_cbhk_fwbd",":@computed_region_nnqa_25f4","fall","geolocation","geolocation_address","geolocation_city","geolocation_state","geolocation_zip","id","mass","name","nametype","recclass","reclat","reclong","year"
,,"Fell","POINT (6.08333 50.775)",,,,,"1","21","Aachen","Valid","L5","50.775000","6.083330","1880-01-01T00:00:00.000"
,,"Fell","POINT (10.23333 56.18333)",,,,,"2","720","Aarhus","Valid","H6","56.183330","10.233330","1951-01-01T00:00:00.000"
这正是我想要看到的。现在我正试图从我使用data = sc.parallelize(file)
print(data.first())
":@computed_region_cbhk_fwbd",":@computed_region_nnqa_25f4","fall","geolocation","geolocation_address","geolocation_city","geolo
为什么我没有像我期待的第一张照片那样得到第一行?它在中途停在某处,我没有看到我标题的其他组成部分。
答案 0 :(得分:1)
它不起作用,因为Response.__iter__
不能识别格式。它只是iterates over fixed size chunks。
如果你真的需要阅读这样的数据,请使用text.splitlines
:
sc.parallelize(file.text.splitlines())
或更好:
import csv
import io
sc.parallelize(csv.reader(io.StringIO(file.text)))
答案 1 :(得分:1)
答案很简单。要并行化Python对象,需要为Spark提供一个列表。在这种情况下,您将提供回复:
>>> file = requests.get('https://data.nasa.gov/resource/y77d-th95.csv')
>>> file
<Response [200]>
如果您提取数据,并且您将通过自己拆分来帮助Spark,Spark会理解它:
import requests
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('Write Data').setMaster('local')
sc = SparkContext(conf=conf)
file = requests.get('https://data.nasa.gov/resource/y77d-th95.csv').text.split('\n')
data = sc.parallelize(file)
data.first()
>>> u'":@computed_region_cbhk_fwbd",":@computed_region_nnqa_25f4","fall","geolocation","geolocation_address","geolocation_city","geolocation_state","geolocation_zip","id","mass","name","nametype","recclass","reclat","reclong","year"'
当你有一个像Hadoop这样的文件系统时,hadoop会为你做拆分并安排HDFS块,以便在断行时分割。
希望这有帮助。
干杯,福科