带有SQLContext :: IndexError的Apache SPARK

时间:2016-06-28 06:18:34

标签: apache-spark pyspark apache-spark-sql pyspark-sql

我正在尝试执行Apache SPARK文档中使用反射推断架构部分中提供的基本示例。

我在Cloudera Quickstart VM(CDH5)

上执行此操作

我尝试执行的示例如下所示:

# sc is an existing SparkContext.
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)

# Load a text file and convert each line to a Row.
lines = sc.textFile("/user/cloudera/analytics/book6_sample.csv")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))

# Infer the schema, and register the DataFrame as a table.
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.registerTempTable("people")

# SQL can be run over DataFrames that have been registered as a table.
teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

# The results of SQL queries are RDDs and support all the normal RDD operations.
teenNames = teenagers.map(lambda p: "Name: " + p.name)
for teenName in teenNames.collect():
  print(teenName)

我完全按照上面的说明运行了代码,但始终收到错误&#34; IndexError:列表索引超出范围&#34;当我执行最后一个命令(for循环)时。

输入文件book6_sample可在 book6_sample.csv

我完全如上所示运行代码,但总是得到错误&#34; IndexError:list index超出范围&#34;当我执行最后一个命令(for循环)时。

请建议我出错的地方。

提前致谢。

此致 斯

1 个答案:

答案 0 :(得分:0)

您的文件末尾有一个空行,导致此错误。在文本编辑器中打开文件并删除该行,希望它能正常工作