Question

使用数据帧字典，每个键都是一个整数0, ..., 999，每个值都是这样的数据帧：

     A         B
1    10010001  17
2    10020001  5
3    10020002  11
4    10020003  2
5    10030001  86
...

我需要遍历整个字典，并将新的数据框放在一起，所有A列中第3和第4位的行等于02。在我的示例中，只有第2,3和4行会形成新的数据帧。列A的所有值都是字符串。

在pandas内最有效的方法是什么？

Answer 1

如下所示：d是你的词典：

pd.concat((v[v.A.str[2:4] == '02'] for v in d.itervalues()))

将您的样本数据框组成的dict重复3次，然后键0-2

d = dict(zip(range(3), [df]*3))

这会产生：

          A   B
2  10020001   5
3  10020002  11
4  10020003   2
2  10020001   5
3  10020002  11
4  10020003   2
2  10020001   5
3  10020002  11
4  10020003   2

这应该比创建行列表或使用列表推导更具内存效率，因为它使用生成器表达式。由于直接索引（假设您的数据值是标准化的），它也应该比使用正则表达式更快。</ p>

如果您不喜欢组合数组的索引，则可以始终reset_index()。例如：

y = pd.concat((v[v.A.str[2:4] == '02'] for v in d.itervalues()))
y.reset_index.drop('index', axis=1)

          A   B
0  10020001   5
1  10020002  11
2  10020003   2
3  10020001   5
4  10020002  11
5  10020003   2
6  10020001   5
7  10020002  11
8  10020003   2

Answer 2

第一行创建一个索引器，用于检查A列的第3个和第4个字符，并为“02”的任何内容返回一个布尔索引器True / Falses。

第二行在应用该索引器后从原始数据框创建一个新的数据框。

indexer = df['A'].apply(lambda x: x[2:4] == '02')
results = df.loc[indexer]

编辑：以上解决方案适用于数据帧字典。

frames = list()
for k in dictionary.keys():
    df = dictionary[k]
    indexer = df['A'].apply(lambda x: x[2:4] == '02')
    results = df.loc[results]
    frames.append(results)
output = pd.concat(frames)

Answer 3

试试这个：

keep = [] #hold all the rows you want to keep
for key in frame_dict.keys():
    frame = frame_dict[key]
    keep.append(
        frame[frame['A'].astype(str).str.contains('^\d\d02', regex=True)].copy()
    ) #append the rows matching regex for start of word (^), digit (\d), digit (\d), 02 
final = pd.concat(keep) #concatenate the matching rows

Pandas：根据应用于字符串的条件选择行

3 个答案: