有没有一种更快的方法可以针对列表搜索熊猫列并读取文件?

时间:2020-10-27 12:03:19

标签: python pandas numpy

我有一个名为df1的数据帧,其中包含很多行:

AA    ID      info
H5R   SSF43   up
V53P  FG46Z   up
X1M   HJ44-2  down
P324N 2HUVG   up
L2F   SSF43   down
G223J FG46Z   up

以及包含许多文件名的列表:

['SSF43_354635.csv', 'HJ44-2_GF6453.csv', 'FG46Z_45362.csv', '2HUVG_223IU.csv', 
'SSF43_00202E.csv', 'FG46Z_01873GF.csv']

我正在寻找一种浏览ID列的快捷方法,如果ID以任何文件名出现,请读取文件并在AA列中查找值。

到目前为止,我已经尝试过:

import pandas as pd
from os.path import isfile, join
from os import listdir
import numpy as np

df1 = pd.read_csv('data_info.csv', sep = '\t') 
file_names = [i for i in listdir('/content/data_files') if isfile(join('/content/data_files', i))]

df1["In_List"] = np.where(df1["ID"].isin([i.split('_', 1)[0] for i in file_names]), "True", "False")


# This part is slowing me down as it takes too long to run

for i in df1.iloc[:,1]:
  if i in [i.split('_', 1)[0] for i in file_names]:
# DO Something

2 个答案:

答案 0 :(得分:0)

不确定是什么问题

import pandas as pd
from collections import defaultdict as dd

f_names = ['SSF43_354635.csv', 'HJ44-2_GF6453.csv', 'FG46Z_45362.csv', '2HUVG_223IU.csv', 'SSF43_00202E.csv', 'FG46Z_01873GF.csv']
file_name_dict = dd(list)

for current_file_name in file_name:
    splited = current_file_name.split('_')
    file_name_dict[splited[0]].append(splited[1])



id_dict = dd(list)

#to avoid multiple use of df.loc, iterate one time and create a dict
for index,row in df.iterrows():
    id_dict[row['ID']].append(row['AA'])

for ID in file_name_dict:
    if ID not in id_dict:
        continue
    AA_set = set(id_dict['ID'])

    for currnet_fname in file_name_dict['ID']:
        tmp_df = pd.read_csv(currnet_fname)
        
        for index,row in tmp_df.iterrows():
            if row['column_name_you_need'] in AA_set:
                #do what you need


在打开一个文件时,为所有想要的值扔一个文件可能更为合理

答案 1 :(得分:0)

似乎仍然无法解决这一问题。

tmp = df1.loc[df1['ID'].isin(file_name_dict.keys())]

for index, row in tmp.iterrows():
  for fn in file_name_dict[row['ID']]:
    with open(fn, 'r') as f:
      if row['AA'] in f:
        print(row)