PySpark DataFrame读取TXT文件时出现以下错误,它让我发疯

时间:2016-09-08 06:05:31

标签: dataframe pyspark

这是我的计划:

from pyspark import SparkConf, SparkContext, SQLContext

def main():
    conf = SparkConf().setAppName("test").setMaster("local[*]")
    sc = SparkContext(conf=conf)
    sc.setLogLevel("OFF")
    sqlContext = SQLContext(sc)

    rdd = sc.textFile('samples.txt').map(lambda line: line.strip().split(' '))
    print rdd.take(5)
    df = sqlContext.createDataFrame(rdd, ['raw'])
    df.show(False)

if __name__ == "__main__":
    main()

文件sample.txt似乎是:

I need your help
Why it has this error
I can not handle it

该计划的结果如下:

[[u'I', u'need', u'your', u'help'], [u'Why', u'it', u'has', u'this', u'error'], [u'I', u'can', u'not', u'handle', u'it']]
Traceback (most recent call last):
  File "/mypath/getIndex.py", line 35, in <module>
main()
  File "/mypath/getIndex.py", line 20, in main
df.show(False)
  File "/Users/lyj/Programs/Apache/spark2/python/pyspark/sql/dataframe.py", line 287, in show
    print(self._jdf.showString(n, truncate))
  File "/Library/Python/2.7/site-packages/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File "/Users/lyj/Programs/Apache/spark2/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
  File "/Library/Python/2.7/site-packages/py4j/protocol.py", line 323, in get_return_value
format(target_id, ".", name, value))
py4j.protocol.Py4JError: An error occurred while calling o65.showString. Trace:
py4j.Py4JException: Method showString([class java.lang.Boolean, class java.lang.Boolean]) does not exist
at
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)



Process finished with exit code 1

这个问题出了什么问题?我之前从未遇到过这个错误,我认为这是因为Pyspark DataFrame。

1 个答案:

答案 0 :(得分:1)

Dataframe需要一个具有恒定列数的rdd,替换你的sample.txt

Please I need your help
Why it has this error
I can not handle it

然后

df.show()

输出

+------+---+----+------+-----+
|   raw| _2|  _3|    _4|   _5|
+------+---+----+------+-----+
|Please|  I|need|  your| help|
|   Why| it| has|  this|error|
|     I|can| not|handle|   it|
+------+---+----+------+-----+

请注意,您只重命名了第一列