我有一个相当大的数据帧df2(〜50,000行x 2,000列)。列标题是样本名称。另外,我有一个数据框df1,其中包含要在分析中包含的样本列表作为df1索引。我想使用df1索引中的样本列表来为那些选定的样本仅选择df2中的列,其余部分则丢弃。我还想保留df1索引中的样本顺序。
示例数据:
# df1
data1 = {'Sample': ['Sample_A','Sample_D', 'Sample_E'],
'Location': ['Bangladesh', 'Myanmar', 'Thailand'],
'Year':[2012, 2014, 2015]}
df1 = pd.DataFrame(data1)
df1.set_index('Sample')
# df2
data2 = {'Num': ['Value_1','Value_2','Value_3','Value_4','Value_5'],
'Sample_A': [0,1,0,0,1],
'Sample_B':[0,0,1,0,0],
'Sample_C':[1,0,0,0,1],
'Sample_D':[0,0,1,1,0]}
df2 = pd.DataFrame(data2)
df2.set_index('Num')
首先,我从df1的索引生成想要的样本列表,例如
samples = df1['Sample'].tolist()
“样本”是
['Sample_A', 'Sample_D', 'Sample_E']
使用'samples',我想要的输出数据帧df3应该如下所示:
index Sample_A Sample_D
Value_1 0 0
Value_2 1 0
Value_3 0 1
Value_4 0 1
Value_5 1 0
但是如果我使用
df3 = df2[samples]
然后我收到错误消息:
"['Sample_E'] not in index"
那么,如何避免在df2中找不到样本来避免出现此错误消息?
更新 有效的解决方案-
# 1. Define samples to use from df1
samples = df1['Sample'].tolist()
# Only include samples that are found in df2 as well
final_samples = list(set(list(df2.columns)) & set(samples ))
# Make new df with columns corresponding to final_samples
df3 = df2.loc[:, final_samples]
答案 0 :(得分:2)
您可以这样做。它们的列数组是您实际想要的顺序。
import pandas as pd
data = {'index': ['Value_1','Value_2','Value_3','Value_4','Value_5'],
'Sample_A': [0,1,0,0,1],
'Sample_B':[0,0,1,0,0],
'Sample_C':[1,0,0,0,1],
'Sample_D':[0,0,1,1,0]}
df = pd.DataFrame(data)
df.set_index('index')
df1 = df[['index']+['Sample_A','Sample_D']]
输出:
index Sample_A Sample_D
0 Value_1 0 0
1 Value_2 1 0
2 Value_3 0 1
3 Value_4 0 1
4 Value_5 1 0
但要忽略丢失的列,请仅将这些列属于您要进行分析的df
。
samples = ['index', 'Sample_A', 'Sample_D','Extra_Sample']
final_samples = list(set(list(df1.columns)) & set(samples ))
现在,您可以传递仅包含df2列的final_samples
。
df3 = df2[final_samples]
答案 1 :(得分:2)
尝试这样。.
df = pd.read_csv("data.csv", usecols=['Sample_A','Sample_D']).fillna('')
print(df)
选择所有行和某些列,可以使用单个冒号选择所有行。
>>> df.loc[:, ['Sample_A','Sample_D']]
您从提供的数据集中得到的答案:
>>> data2 = {'Num': ['Value_1','Value_2','Value_3','Value_4','Value_5'],
... 'Sample_A': [0,1,0,0,1],
... 'Sample_B':[0,0,1,0,0],
... 'Sample_C':[1,0,0,0,1],
... 'Sample_D':[0,0,1,1,0]}
>>> df2 = pd.DataFrame(data2)
>>> df2.set_index('Num').loc[:, ['Sample_A','Sample_D']]
Sample_A Sample_D
Num
Value_1 0 0
Value_2 1 0
Value_3 0 1
Value_4 0 1
Value_5 1 0
====================================
>>> df3 = df2.loc[:, samples]
>>> df3
Sample_A Sample_D Sample_E
0 0 0 NaN
1 1 0 NaN
2 0 1 NaN
3 0 1 NaN
4 1 0 NaN
OR
>>> df3 = df2.reindex(columns=samples)
>>> df3
Sample_A Sample_D Sample_E
0 0 0 NaN
1 1 0 NaN
2 0 1 NaN
3 0 1 NaN
4 1 0 NaN