如何在Python中使用数组选择和排序数据框中的列

时间:2018-10-03 03:42:02

标签: python arrays pandas dataframe

我有一个相当大的数据帧df2(〜50,000行x 2,000列)。列标题是样本名称。另外,我有一个数据框df1,其中包含要在分析中包含的样本列表作为df1索引。我想使用df1索引中的样本列表来为那些选定的样本仅选择df2中的列,其余部分则丢弃。我还想保留df1索引中的样本顺序。

示例数据:

# df1
data1 = {'Sample': ['Sample_A','Sample_D', 'Sample_E'], 
        'Location': ['Bangladesh', 'Myanmar', 'Thailand'],
        'Year':[2012, 2014, 2015]}
df1 = pd.DataFrame(data1)
df1.set_index('Sample')

# df2
data2 = {'Num': ['Value_1','Value_2','Value_3','Value_4','Value_5'], 
        'Sample_A': [0,1,0,0,1],
        'Sample_B':[0,0,1,0,0],
        'Sample_C':[1,0,0,0,1],
        'Sample_D':[0,0,1,1,0]}
df2 = pd.DataFrame(data2)
df2.set_index('Num')

首先,我从df1的索引生成想要的样本列表,例如

samples = df1['Sample'].tolist()

“样本”是

['Sample_A', 'Sample_D', 'Sample_E']

使用'samples',我想要的输出数据帧df3应该如下所示:

index  Sample_A  Sample_D
Value_1  0  0
Value_2  1  0
Value_3  0  1
Value_4  0  1
Value_5  1  0

但是如果我使用

df3 = df2[samples]

然后我收到错误消息:

"['Sample_E'] not in index"

那么,如何避免在df2中找不到样本来避免出现此错误消息?

更新 有效的解决方案-

# 1. Define samples to use from df1
samples = df1['Sample'].tolist()
# Only include samples that are found in df2 as well
final_samples = list(set(list(df2.columns)) & set(samples ))
# Make new df with columns corresponding to final_samples
df3 = df2.loc[:, final_samples]

2 个答案:

答案 0 :(得分:2)

您可以这样做。它们的列数组是您实际想要的顺序。

import pandas as pd

data = {'index': ['Value_1','Value_2','Value_3','Value_4','Value_5'], 
        'Sample_A': [0,1,0,0,1],
        'Sample_B':[0,0,1,0,0],
        'Sample_C':[1,0,0,0,1],
        'Sample_D':[0,0,1,1,0]}
df = pd.DataFrame(data)
df.set_index('index')
df1 = df[['index']+['Sample_A','Sample_D']]

输出:

     index  Sample_A  Sample_D
0  Value_1         0         0
1  Value_2         1         0
2  Value_3         0         1
3  Value_4         0         1
4  Value_5         1         0

但要忽略丢失的列,请仅将这些列属于您要进行分析的df

samples = ['index', 'Sample_A', 'Sample_D','Extra_Sample']
final_samples = list(set(list(df1.columns)) & set(samples ))

现在,您可以传递仅包含df2列的final_samples

df3 = df2[final_samples]

答案 1 :(得分:2)

尝试这样。.

df = pd.read_csv("data.csv", usecols=['Sample_A','Sample_D']).fillna('')
print(df)

选择所有行和某些列,可以使用单个冒号选择所有行。

>>> df.loc[:, ['Sample_A','Sample_D']]

您从提供的数据集中得到的答案:

>>> data2 = {'Num': ['Value_1','Value_2','Value_3','Value_4','Value_5'],
...         'Sample_A': [0,1,0,0,1],
...         'Sample_B':[0,0,1,0,0],
...         'Sample_C':[1,0,0,0,1],
...         'Sample_D':[0,0,1,1,0]}
>>> df2 = pd.DataFrame(data2)
>>> df2.set_index('Num').loc[:, ['Sample_A','Sample_D']]
         Sample_A  Sample_D
Num
Value_1         0         0
Value_2         1         0
Value_3         0         1
Value_4         0         1
Value_5         1         0

====================================

>>> df3 = df2.loc[:, samples]
>>> df3
   Sample_A  Sample_D  Sample_E
0         0         0       NaN
1         1         0       NaN
2         0         1       NaN
3         0         1       NaN
4         1         0       NaN

OR

>>> df3 = df2.reindex(columns=samples)
>>> df3
   Sample_A  Sample_D  Sample_E
0         0         0       NaN
1         1         0       NaN
2         0         1       NaN
3         0         1       NaN
4         1         0       NaN