我正在尝试重新创建此分析:https://rstudio-pubs-static.s3.amazonaws.com/203258_d20c1a34bc094151a0a1e4f4180c5f6f.html
我无法让shell脚本在我的计算机上运行,所以我创建了一个基本上就是这样的代码:
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
in_fp = open(input_file,"r")
out_fp = open(output_file,"w")
count = 0
for line in in_fp:
if count == 1:
out_fp.write(line+"\n")
elif count>1:
elems = line.split(",")
loan = elems[16].upper()
if loan == "FULLY PAID" or loan == "LATE (31-120 DAYS)" or loan == "DEFAULT" or loan == "CHARGED OFF":
out_fp.write(line+"\n")
count+=1
in_fp.close()
out_fp.close()
虽然此代码适用于2015年的数据,但当我运行2012-2013数据时,我收到错误消息:
File "ShellScript.py", line 16, in <module>
loan = elems[16].upper()
IndexError: list index out of range
有人可以告诉我如何修复此错误以获取数据排序?谢谢
答案 0 :(得分:0)
你的一行没有17个元素,因此elems[16]
失败。这通常是由数据中的空行引起的。它也可能是由带有嵌入换行符的带引号的字段引起的。如果它是带有嵌入换行符的带引号的字段,则需要使用csv
模块。
这是使用csv模块的重写。它报告并跳过短线。我把它变成了Pythonic。
import sys
import csv
input_file = sys.argv[1]
output_file = sys.argv[2]
ncolumns = 17 # IS THIS RIGHT?
keep_loans = {"FULLY PAID", "LATE (31-120 DAYS)", "DEFAULT", "CHARGED OFF"}
# with statment automatically closes files after block
with open(input_file, "rb") as in_fp, open(output_file, "wb") as out_fp:
reader = csv.reader(in_fp)
writer = csv.writer(out_fp)
# you are currently skipping line 0
next(reader)
# copy headers
writer.writerow(next(reader))
# you are currently adding an extra newline to headers
# writer.writerow([]) # uncomment if you want that extra newline
for row_num, row in enumerate(reader, start=2):
if len(row) < ncolumns:
# report and skip short rows
print "row %s shorter than expected. skipping row. row: %s" % (row_num, row)
continue
# use `in` rather than multiple == statements
if row[16].upper() in keep_loans
writer.writerow(row)