我有一个名为df1的数据帧,其中包含很多行:
AA ID info
H5R SSF43 up
V53P FG46Z up
X1M HJ44-2 down
P324N 2HUVG up
L2F SSF43 down
G223J FG46Z up
以及包含许多文件名的列表:
['SSF43_354635.csv', 'HJ44-2_GF6453.csv', 'FG46Z_45362.csv', '2HUVG_223IU.csv',
'SSF43_00202E.csv', 'FG46Z_01873GF.csv']
我正在寻找一种浏览ID列的快捷方法,如果ID以任何文件名出现,请读取文件并在AA列中查找值。
到目前为止,我已经尝试过:
import pandas as pd
from os.path import isfile, join
from os import listdir
import numpy as np
df1 = pd.read_csv('data_info.csv', sep = '\t')
file_names = [i for i in listdir('/content/data_files') if isfile(join('/content/data_files', i))]
df1["In_List"] = np.where(df1["ID"].isin([i.split('_', 1)[0] for i in file_names]), "True", "False")
# This part is slowing me down as it takes too long to run
for i in df1.iloc[:,1]:
if i in [i.split('_', 1)[0] for i in file_names]:
# DO Something
答案 0 :(得分:0)
不确定是什么问题
import pandas as pd
from collections import defaultdict as dd
f_names = ['SSF43_354635.csv', 'HJ44-2_GF6453.csv', 'FG46Z_45362.csv', '2HUVG_223IU.csv', 'SSF43_00202E.csv', 'FG46Z_01873GF.csv']
file_name_dict = dd(list)
for current_file_name in file_name:
splited = current_file_name.split('_')
file_name_dict[splited[0]].append(splited[1])
id_dict = dd(list)
#to avoid multiple use of df.loc, iterate one time and create a dict
for index,row in df.iterrows():
id_dict[row['ID']].append(row['AA'])
for ID in file_name_dict:
if ID not in id_dict:
continue
AA_set = set(id_dict['ID'])
for currnet_fname in file_name_dict['ID']:
tmp_df = pd.read_csv(currnet_fname)
for index,row in tmp_df.iterrows():
if row['column_name_you_need'] in AA_set:
#do what you need
在打开一个文件时,为所有想要的值扔一个文件可能更为合理
答案 1 :(得分:0)
似乎仍然无法解决这一问题。
tmp = df1.loc[df1['ID'].isin(file_name_dict.keys())]
for index, row in tmp.iterrows():
for fn in file_name_dict[row['ID']]:
with open(fn, 'r') as f:
if row['AA'] in f:
print(row)