Question

我有两个文件：一个是单个列（称为pred）并且没有标题，另一个有两列：ID和IsClick（它有标题）。我的目标是使用列ID作为pred的索引。

import pandas as pd
import numpy as np

def LinesInFile(path):
    with open(path) as f:
        for linecount, line in enumerate(f):
            pass
    f.close()
    print 'Found ' + str(linecount) + ' lines' 
    return linecount

path ='/Users/mas/Documents/workspace/Avito/input/'                          # path to testing file
submission = path + 'submission1234.csv' 

lines = LinesInFile(submission)
lines = LinesInFile(path + 'sampleSubmission.csv')


sample = pd.read_csv(path + 'sampleSubmission.csv')
preds = np.array(pd.read_csv(submission, header = None))
index = sample.ID.values - 1
print index
print len(index)
sample['IsClick'] = preds[index]
sample.to_csv('submission.csv', index=False)

输出结果为：

Found 7816360 lines
Found 7816361 lines
[       0        4        5 ..., 15961507 15961508 15961511]
7816361
Traceback (most recent call last):
  File "/Users/mas/Documents/workspace/Avito/July3b.py", line 23, in <module>
    sample['IsClick'] = preds[index]
IndexError: index 7816362 is out of bounds for axis 0 with size 7816361

似乎出现了问题，因为我的文件有7816361行计算标题，而我的列表有一个额外的元素（len列表7816361）

Answer 1

我没有让您的csv文件重新创建问题，但问题似乎是由于您使用index造成的。

index = sample.ID.values - 1正在获取每个样本ID并减去1.这些不是pred中的索引值，因为它只有7816360长。索引数组中最后3项中的每一项（基于您的打印输出）将超出范围，因为它们是> 7816360。我怀疑错误是向您显示超出范围的ID-1中的第一个。

假设您只想根据其行号加入文件，您可以执行以下操作：

sample=pd.concat((pd.read_csv(path + 'sampleSubmission.csv'),pd.read_csv(submission, header = None).rename(columns={0:'IsClick'})),axis=1)

否则，您需要在两个数据帧上执行连接或合并。

使用列表作为索引

1 个答案: