PySpark groupBy indexerror:列出超出范围

时间:2018-03-22 05:26:03

标签: python python-3.x apache-spark pyspark apache-spark-sql

我的本​​地系统中有一个csv文件(包括标题),我试图执行groupBy函数,即按目的分组和每个目的的金额总和。我在pyspark控制台上输入的命令如下

from pyspark import SparkContext, SparkConf
from pyspark.sql.types import *
from pyspark.sql import Row
csv_data=sc.textFile("/project/sample.csv").map(lambda p: p.split(",")) 
header = csv_data.first()
csv_data = csv_data.filter(lambda p:p != header)
df_csv  =  csv_data.map(lambda p: Row(checkin_acc  =  p[0],  duration = 
int(p[1]), credit_history = p[2], purpose = p[3], amount = int(p[4]),
svaing_acc = p[5], present_emp_since = p[6], inst_rate = int(p[7]), 
personal_status = p[8], other_debtors = p[9],residing_since = int(p[10]), 
property = p[11], age = int(p[12]), inst_plans = p[13], housing = p[14], 
num_credits = int(p[15]), job = p[16], dependents = int(p[17]), telephone = 
p[18], foreign_worker = p[19], status = p[20])).toDF()

grouped = df_csv.groupBy('purpose').sum('amount')
grouped.show()
[Stage 9:>                                                          (0 + 2) / 2]18/03/22 10:34:52 ERROR executor.Executor: Exception in task 1.0 in stage 9.0 (TID 10)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 177, in main
    process()
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 172, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "<stdin>", line 1, in <lambda>
IndexError: list index out of range

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.next(PythonRDD.scala:156)
    at org.apache.spark.api.python.PythonRunner$$anon$1.next(PythonRDD.scala:152)
    at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:4

我如何解决此错误?

2 个答案:

答案 0 :(得分:0)

如果您使用pyspark 2+,则可以使用spark.read.csv

git clone https://username:password@github.com/...

如果您想自己设置列名称和标题,您也可以使用StructType定义架构,并将其与df = spark.read.csv ("project/sample.csv", header=True) kwarg一起传递。

答案 1 :(得分:0)

  
    

IndexError:列表索引超出范围

  

以上错误仅表示使用

分割文本文件的行时不会生成20个字段

过滤解决方案

一种方法是过滤掉长度小于20的所有行

csv_data=sc.textFile("/project/sample.csv").map(lambda p: p.split(","))
header = csv_data.first()
csv_data = csv_data.filter(lambda p:p != header)\
    .filter(lambda x: len(x) == len(header))    #filter added 
df_csv  =  csv_data.map(lambda p: Row(checkin_acc=p[0],
                                      duration=int(p[1]),
                                      credit_history=p[2],
                                      purpose=p[3],
                                      amount=int(p[4]),
                                      ..... #the rest of the codes are same

添加虚拟数据

另一个解决方案是将虚拟数据添加到长度小于20的行

#function definition for adding dummy strings in case of less fields in data
def addDummy(arr, header):
    headerLength = len(header)
    arrayLength = len(arr)
    if arrayLength > headerLength:
        return arr[:headerLength-1]
    elif arrayLength < headerLength:
        return arr + ["dummy" for x in range(0, headerLength-arrayLength)]
    else:
        return arr

csv_data=sc.textFile("/project/sample.csv").map(lambda p: p.split(","))
header = csv_data.first()
csv_data = csv_data.filter(lambda p:p != header)\
    .map(lambda p: addDummy(p, header))   #map function added for checking length and adding dummy string in case of less fields
df_csv  =  csv_data.map(lambda p: Row(checkin_acc=p[0],
                                      duration=int(p[1]),
                                      credit_history=p[2],
                                      purpose=p[3],
                                      amount=int(p[4]),
                                      svaing_acc=p[5],
                                      .... #the rest of the codes are same as in the question