Question

我有一个看起来像这样的输入字符串：

ms = 'hello stack overflow friends'

还有一个具有以下结构的熊猫数据框：

      string  priority  value
0         hi         1      2
1  astronaut        10      3
2   overflow         3     -1
3     varfoo         4      1
4      hello         2      0

然后我正在尝试执行以下简单算法：

按df['priority']列升序排列熊猫数据框。
检查字符串ms是否包含df['string']单词。
如果是，请返回其df['value']。

因此，这是我这样做的方法：

import pandas as pd

ms = 'hello stack overflow friends'

df = pd.DataFrame({'string': ['hi', 'astronaut', 'overflow', 'varfoo', 'hello'],
                   'priority': [1, 10, 3, 4, 2],
                   'value': [2, 3, -1, 1, 0]})

final_val = None

for _, row in df.sort_values('priority').iterrows():
    # just printing the current row for debug purposes
    print (row['string'], row['priority'], row['value'])

    if ms.find(row['string']) > -1:
        final_val = row['value']
        break

print()
print("The final value for '", ms, "' is ", final_val)

哪个返回以下内容：

hi 1 2
hello 2 0

The final value for ' hello stack overflow friends ' is  0

此代码可以正常工作，但问题是我的df有2万行，因此我需要执行这种搜索超过1K次。

这大大降低了我的过程的性能。那么，有没有一种比我的方法更简单（更简单）的方法，那就是使用纯熊猫并避免不必要的循环？

Answer 1

编写一个可以应用于数据框而不是使用iterrows

的函数

match_set = set(ms.split())
def check_matches(row):
    return row['value'] if row['string'] in match_set else None

df['matched'] = df.apply(check_matches, axis=1)

哪个给你：

   priority     string  value  matched
0         1         hi      2      NaN
1        10  astronaut      3      NaN
2         3   overflow     -1     -1.0
3         4     varfoo      1      NaN
4         2      hello      0      0.0

然后，您可以对值进行排序，并从NaN中获取第一个非df.matched值，以得到您所说的final_value。

df.sort_values('priority').matched.dropna().iloc[0]
0.0

或者，您可以对df进行排序并将其转换为元组列表：

l = df.sort_values('priority').apply(lambda r: (r['string'], r['value']), axis=1).tolist()

给予：

l
[('hi', 2), ('hello', 0), ('overflow', -1), ('varfoo', 1), ('astronaut', 3)]

并编写一个在第一个匹配项时停止的函数：

def check_matches(l):
    for (k, v) in l:
        if k in match_set:
            return v
check_matches(l)
0

如何在熊猫数据框中执行排序搜索？

1 个答案: