我有两个文件:一个是单个列(称为pred)并且没有标题,另一个有两列:ID和IsClick(它有标题)。我的目标是使用列ID作为pred的索引。
import pandas as pd
import numpy as np
def LinesInFile(path):
with open(path) as f:
for linecount, line in enumerate(f):
pass
f.close()
print 'Found ' + str(linecount) + ' lines'
return linecount
path ='/Users/mas/Documents/workspace/Avito/input/' # path to testing file
submission = path + 'submission1234.csv'
lines = LinesInFile(submission)
lines = LinesInFile(path + 'sampleSubmission.csv')
sample = pd.read_csv(path + 'sampleSubmission.csv')
preds = np.array(pd.read_csv(submission, header = None))
index = sample.ID.values - 1
print index
print len(index)
sample['IsClick'] = preds[index]
sample.to_csv('submission.csv', index=False)
输出结果为:
Found 7816360 lines
Found 7816361 lines
[ 0 4 5 ..., 15961507 15961508 15961511]
7816361
Traceback (most recent call last):
File "/Users/mas/Documents/workspace/Avito/July3b.py", line 23, in <module>
sample['IsClick'] = preds[index]
IndexError: index 7816362 is out of bounds for axis 0 with size 7816361
似乎出现了问题,因为我的文件有7816361行计算标题,而我的列表有一个额外的元素(len列表7816361)
答案 0 :(得分:0)
我没有让您的csv文件重新创建问题,但问题似乎是由于您使用index
造成的。
index = sample.ID.values - 1
正在获取每个样本ID并减去1.这些不是pred
中的索引值,因为它只有7816360长。索引数组中最后3项中的每一项(基于您的打印输出)将超出范围,因为它们是> 7816360。我怀疑错误是向您显示超出范围的ID-1
中的第一个。
假设您只想根据其行号加入文件,您可以执行以下操作:
sample=pd.concat((pd.read_csv(path + 'sampleSubmission.csv'),pd.read_csv(submission, header = None).rename(columns={0:'IsClick'})),axis=1)
否则,您需要在两个数据帧上执行连接或合并。