我想用Spark阅读一个大文本文件,并将textinputformat.record.delimiter
更改为H
。此代码不起作用:
appName = "My Test"
fname="myfile.txt"
import operator
from pyspark import SparkContext, SparkConf
if __name__=="__main__":
conf = SparkConf().setAppName(appName)
conf.set("textinputformat.record.delimiter", "H")
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
rdd1 = sc.textFile(fname)
print("normal:",rdd1.collect())
看来,执行此操作的唯一方法是使用sc.newAPIHadoopFile
API调用。但是,我无法弄清楚如何从Python中做到这一点。这不起作用:
conf2 = SparkConf()
conf2.set("textinputformat.record.delimiter", "H")
rdd2 = sc.newAPIHadoopFile(fname, "org.apache.hadoop.mapreduce.lib.input.TextInputFormat", "org.apache.hadoop.io.LongWritable", "org.apache.hadoop.io.Text", conf2)
print("improved:",rdd2.collect())
我收到此错误:
AttributeError: 'SparkConf' object has no attribute '_get_object_id'
虽然它使用与问题creating spark data structure from multiline record相同的API,但它是API的不同用法。在该问题中,创建了一个允许一次读取多行的函数。在此用途中,我们更改了textinputformat.record.delimiter
答案 0 :(得分:0)
以下是设置配置和调用newAPIHadoopFile
appName = "My Test"
fname="myfile.txt"
from pyspark import SparkContext, SparkConf
if __name__=="__main__":
conf = SparkConf().setAppName(appName)
conf.set("textinputformat.record.delimiter", "H")
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
rdd1 = sc.textFile(fname)
print("normal:",rdd1.collect())
rconf = { "textinputformat.record.delimiter": "H" }
rdd2 = sc.newAPIHadoopFile(fname,
"org.apache.hadoop.mapreduce.lib.input.TextInputFormat", # inputFormatClass
"org.apache.hadoop.io.Text", # keyClass
"org.apache.hadoop.io.LongWritable", # valueClass
conf=rconf).map(lambda a:'H'+a[1])
print("improved:",rdd2.collect())