Question

我有一个脚本，它将测量的电子表格读入pandas.DataFrame，并添加来自第二个源的相同声音文件的测量值。因此，我可以索引电子表格中出现的文件名以及次要测量文件的名称。

示例：我有一个名为100.txt的文件，其中包含一些我想要添加到电子表格中的测量值到'filenumber'列中包含100.txt的行。只有一个这样的行。

问题：我没有电子表格中每一行的文件，即某些数据点的条目类似于NA。所以我想我不能只迭代我的所有输入文件？

到目前为止，我已经尝试了这个（代码被简化，因为这是一个更大的脚本的一部分）：

inputfilis=[u'3-1.txt', u'2-1.txt', '1-1.txt'] 
inputspread=pandas.DataFrame({'col1': ['2', '2', '4'], 'filename':['5-1.txt', u'2-1.txt', '1-1.txt'] , 'col3':['pre', 'pre1', 'pre2']})

for fili in inputfilis:
    matches= [line for line in inputspread['filename'] if line == fili]
#what now???

我想要的输出将是这样的，假设要为2-1.txt添加外部测量 12 ，为1-1.txt添加 20 ：

outputspread=pandas.DataFrame({'col1': ['2', '2', '4'], 'filename':['5-1.txt', u'2-1.txt', '1-1.txt'] , 'col3':['pre', 'pre1', 'pre2'], 'added_column': [NA, 12, 20]})

但现在我被困了，因为我不知道

如何存储数据 - 以某种方式将其添加到DataFrame，还是以词典或类似方式添加？
或者最好只创建一个新的NAs列，然后在需要时用实际值替换它们？
我该怎么做？

非常感谢任何建议！

可能的解决方案

感谢Psidom的帮助，这就是我的想法！

#add column, fill with NAs
inputspread['added_column']="NA"

#replace NA with actual data whenever match is found
for fili in inputfilis:
    matches= [line for line in inputspread['filename'] if line == fili]     
    inputspread.loc[inputspread[filename]==matches[0], 'added_column']=open(fili, "r").read()

如果文件包含多个项目，例如以制表符分隔，我们最后可以open(fili, "r").read().split("\t")。在这种情况下，在开始时，我们可以通过执行

添加正确数量的列

for h in open(fili, "r").read().split("\t"):
    inputspread[h]="NA"

将数据添加到Pandas中的特定行

可能的解决方案

0 个答案: