我的本地系统中有一个csv文件(包括标题),我试图执行groupBy函数,即按目的分组和每个目的的金额总和。我在pyspark控制台上输入的命令如下
from pyspark import SparkContext, SparkConf
from pyspark.sql.types import *
from pyspark.sql import Row
csv_data=sc.textFile("/project/sample.csv").map(lambda p: p.split(","))
header = csv_data.first()
csv_data = csv_data.filter(lambda p:p != header)
df_csv = csv_data.map(lambda p: Row(checkin_acc = p[0], duration =
int(p[1]), credit_history = p[2], purpose = p[3], amount = int(p[4]),
svaing_acc = p[5], present_emp_since = p[6], inst_rate = int(p[7]),
personal_status = p[8], other_debtors = p[9],residing_since = int(p[10]),
property = p[11], age = int(p[12]), inst_plans = p[13], housing = p[14],
num_credits = int(p[15]), job = p[16], dependents = int(p[17]), telephone =
p[18], foreign_worker = p[19], status = p[20])).toDF()
grouped = df_csv.groupBy('purpose').sum('amount')
grouped.show()
[Stage 9:> (0 + 2) / 2]18/03/22 10:34:52 ERROR executor.Executor: Exception in task 1.0 in stage 9.0 (TID 10)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 177, in main
process()
File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<stdin>", line 1, in <lambda>
IndexError: list index out of range
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.next(PythonRDD.scala:156)
at org.apache.spark.api.python.PythonRunner$$anon$1.next(PythonRDD.scala:152)
at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:4
我如何解决此错误?
答案 0 :(得分:0)
如果您使用pyspark 2+,则可以使用spark.read.csv。
git clone https://username:password@github.com/...
如果您想自己设置列名称和标题,您也可以使用StructType定义架构,并将其与df = spark.read.csv ("project/sample.csv", header=True)
kwarg一起传递。
答案 1 :(得分:0)
IndexError:列表索引超出范围
以上错误仅表示使用
分割文本文件的行时不会生成20个字段过滤解决方案
一种方法是过滤掉长度小于20的所有行
csv_data=sc.textFile("/project/sample.csv").map(lambda p: p.split(","))
header = csv_data.first()
csv_data = csv_data.filter(lambda p:p != header)\
.filter(lambda x: len(x) == len(header)) #filter added
df_csv = csv_data.map(lambda p: Row(checkin_acc=p[0],
duration=int(p[1]),
credit_history=p[2],
purpose=p[3],
amount=int(p[4]),
..... #the rest of the codes are same
添加虚拟数据
另一个解决方案是将虚拟数据添加到长度小于20的行
#function definition for adding dummy strings in case of less fields in data
def addDummy(arr, header):
headerLength = len(header)
arrayLength = len(arr)
if arrayLength > headerLength:
return arr[:headerLength-1]
elif arrayLength < headerLength:
return arr + ["dummy" for x in range(0, headerLength-arrayLength)]
else:
return arr
csv_data=sc.textFile("/project/sample.csv").map(lambda p: p.split(","))
header = csv_data.first()
csv_data = csv_data.filter(lambda p:p != header)\
.map(lambda p: addDummy(p, header)) #map function added for checking length and adding dummy string in case of less fields
df_csv = csv_data.map(lambda p: Row(checkin_acc=p[0],
duration=int(p[1]),
credit_history=p[2],
purpose=p[3],
amount=int(p[4]),
svaing_acc=p[5],
.... #the rest of the codes are same as in the question