使用列表作为索引

时间:2015-07-13 13:52:47

标签: python-2.7 pandas

我有两个文件:一个是单个列(称为pred)并且没有标题,另一个有两列:ID和IsClick(它有标题)。我的目标是使用列ID作为pred的索引。

import pandas as pd
import numpy as np

def LinesInFile(path):
    with open(path) as f:
        for linecount, line in enumerate(f):
            pass
    f.close()
    print 'Found ' + str(linecount) + ' lines' 
    return linecount

path ='/Users/mas/Documents/workspace/Avito/input/'                          # path to testing file
submission = path + 'submission1234.csv' 

lines = LinesInFile(submission)
lines = LinesInFile(path + 'sampleSubmission.csv')


sample = pd.read_csv(path + 'sampleSubmission.csv')
preds = np.array(pd.read_csv(submission, header = None))
index = sample.ID.values - 1
print index
print len(index)
sample['IsClick'] = preds[index]
sample.to_csv('submission.csv', index=False)

输出结果为:

Found 7816360 lines
Found 7816361 lines
[       0        4        5 ..., 15961507 15961508 15961511]
7816361
Traceback (most recent call last):
  File "/Users/mas/Documents/workspace/Avito/July3b.py", line 23, in <module>
    sample['IsClick'] = preds[index]
IndexError: index 7816362 is out of bounds for axis 0 with size 7816361

似乎出现了问题,因为我的文件有7816361行计算标题,而我的列表有一个额外的元素(len列表7816361)

1 个答案:

答案 0 :(得分:0)

我没有让您的csv文件重新创建问题,但问题似乎是由于您使用index造成的。

index = sample.ID.values - 1正在获取每个样本ID并减去1.这些不是pred中的索引值,因为它只有7816360长。索引数组中最后3项中的每一项(基于您的打印输出)将超出范围,因为它们是> 7816360。我怀疑错误是向您显示超出范围的ID-1中的第一个。

假设您只想根据其行号加入文件,您可以执行以下操作:

sample=pd.concat((pd.read_csv(path + 'sampleSubmission.csv'),pd.read_csv(submission, header = None).rename(columns={0:'IsClick'})),axis=1)

否则,您需要在两个数据帧上执行连接或合并。