Question

我有一个DataFrame有4列，其中2列包含字符串值。我想知道是否有办法根据与特定列的部分字符串匹配来选择行？

换句话说，函数或lambda函数可以执行类似

的操作

re.search(pattern, cell_in_question)

返回一个布尔值。我熟悉df[df['A'] == "hello world"]的语法，但似乎找不到使用部分字符串匹配'hello'来执行相同操作的方法。

有人能指出我正确的方向吗？

Answer 1

基于github问题#620，看起来你很快就能做到以下几点：

df[df['A'].str.contains("hello")]

更新：vectorized string methods (i.e., Series.str)在pandas 0.8.1及更高版本中可用。

Answer 2

我在ipython笔记本上的macos上使用pandas 0.14.1。我尝试了上面提出的那条线：

df[df['A'].str.contains("Hello|Britain")]

并收到错误：

"cannot index with vector containing NA / NaN values"

但是当添加“== True”条件时它完美地工作，如下所示：

df[df['A'].str.contains("Hello|Britain")==True]

Answer 3

如果有人想知道如何执行相关问题： ＆＃34;按部分字符串选择列＆＃34;

使用：

df.filter(like='hello')  # select columns which contain the word hello

要通过部分字符串匹配选择行，请将axis=0传递给过滤器：

# selects rows which contain the word hello in their index label
df.filter(like='hello', axis=0)

Answer 4

快速注意：如果您想根据索引中包含的部分字符串进行选择，请尝试以下操作：

df['stridx']=df.index
df[df['stridx'].str.contains("Hello|Britain")]

Answer 5

假设您有以下DataFrame：

>>> df = pd.DataFrame([['hello', 'hello world'], ['abcd', 'defg']], columns=['a','b'])
>>> df
       a            b
0  hello  hello world
1   abcd         defg

您始终可以在lambda表达式中使用in运算符来创建过滤器。

>>> df.apply(lambda x: x['a'] in x['b'], axis=1)
0     True
1    False
dtype: bool

这里的技巧是使用axis=1中的apply选项逐行将元素传递给lambda函数，而不是逐列传递。

Answer 6

如何从熊猫DataFrame中按部分字符串选择？

这篇文章是为想要

的读者准备的

在字符串列中搜索子字符串（最简单的情况）
搜索多个子字符串（类似于isin）
匹配文本中的整个单词（例如，“蓝色”应匹配“天空是蓝色”，而不是“ bluejay”）
匹配多个完整词

...，并且想进一步了解应该优先使用哪些方法。

（P.S .：我在类似主题上看到了很多问题，我认为最好把它留在这里。）

基本子字符串搜索

df1 = pd.DataFrame({'col': ['foo', 'foobar', 'bar', 'baz']})
df1

      col
0     foo
1  foobar
2     bar
3     baz

要选择所有包含“ foo”的行，请使用str.contains：

df1[df1['col'].str.contains('foo')]

      col
0     foo
1  foobar

请注意，这是一个纯子字符串搜索，因此您可以安全地禁用基于正则表达式的匹配。

df1[df1['col'].str.contains('foo', regex=False)]

      col
0     foo
1  foobar

在性能方面，确实有所作为。

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

如果不需要，请避免使用基于正则表达式的搜索。

注意
  可以使用str.startswith或str.endswith完成锚定在字符串开头或结尾的部分子字符串   分别。

此外，对于一开始就基于正则表达式的搜索，请使用str.match。

基于正则表达式的搜索
大多数str方法都支持正则表达式。例如，要在df1中查找包含“ foo”后跟其他内容的行，我们可以使用

df1[df1['col'].str.contains(r'foo(?!$)')]

      col
1  foobar

多个子字符串搜索

通过使用正则表达式OR管道进行正则表达式搜索，最容易实现这一点。

# Slightly modified example.
df4 = pd.DataFrame({'col': ['foo abc', 'foobar xyz', 'bar32', 'baz 45']})
df4

          col
0     foo abc
1  foobar xyz
2       bar32
3      baz 45

df4[df4['col'].str.contains(r'foo|baz')]

          col
0     foo abc
1  foobar xyz
3      baz 45

您还可以创建一个术语列表，然后将其加入：

terms = ['foo', 'baz']
df4[df4['col'].str.contains('|'.join(terms))]

          col
0     foo abc
1  foobar xyz
3      baz 45

有时候，明智的做法是在术语中包含可以解释为regex metacharacters的字符时，将其转义。如果您的条款包含以下任何字符...

. ^ $ * + ? { } [ ] \ | ( )

然后，您需要使用re.escape来逃避：

import re
df4[df4['col'].str.contains('|'.join(map(re.escape, terms)))]

          col
0     foo abc
1  foobar xyz
3      baz 45

re.escape具有转义特殊字符的效果，因此可以对它们进行字面处理。

re.escape(r'.foo^')
# '\\.foo\\^'

匹配整个单词

默认情况下，子字符串搜索将搜索指定的子字符串/模式，而不管其是否为完整单词。要仅匹配完整的单词，我们将需要在此处使用正则表达式-特别是，我们的模式将需要指定单词边界（\b）。

例如

df3 = pd.DataFrame({'col': ['the sky is blue', 'bluejay by the window']})
df3

                     col
0        the sky is blue
1  bluejay by the window

现在考虑，

df3[df3['col'].str.contains('blue')]

                     col
0        the sky is blue
1  bluejay by the window

v / s

df3[df3['col'].str.contains(r'\bblue\b')]

               col
0  the sky is blue

多个全字搜索

类似于上面的内容，除了我们在连接的模式中添加了单词边界（\b）。

p = r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df4[df4['col'].str.contains(p)]

       col
0  foo abc
3   baz 45

p如下所示，

p
# '\\b(?:foo|baz)\\b'

一个不错的选择：使用List Comprehensions！

因为可以！ And you should!它们通常比字符串方法快一点，因为字符串方法难以向量化并且通常具有循环实现。

而不是

df1[df1['col'].str.contains('foo', regex=False)]

在列表组合中使用in运算符，

df1[['foo' in x for x in df1['col']]]

       col
0  foo abc
1   foobar

而不是

regex_pattern = r'foo(?!$)'
df1[df1['col'].str.contains(regex_pattern)]

在列表组件中使用re.compile（用于缓存正则表达式）+ Pattern.search

p = re.compile(regex_pattern, flags=re.IGNORECASE)
df1[[bool(p.search(x)) for x in df1['col']]]

      col
1  foobar

如果“ col”具有NaN，则代替

df1[df1['col'].str.contains(regex_pattern, na=False)]

使用

def try_search(p, x):
    try:
        return bool(p.search(x))
    except TypeError:
        return False

p = re.compile(regex_pattern)
df1[[try_search(p, x) for x in df1['col']]]

      col
1  foobar

部分字符串匹配的更多选项：`np.char.find`，`np.vectorize`，`DataFrame.query`。

除了str.contains和列表理解之外，您还可以使用以下替代方法。

np.char.find
仅支持子字符串搜索（读取：无正则表达式）。

df4[np.char.find(df4['col'].values.astype(str), 'foo') > -1]

          col
0     foo abc
1  foobar xyz

np.vectorize
这是一个循环包装器，但是开销比大多数pandas str方法要少。

f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df1['col'], 'foo')
# array([ True,  True, False, False])

df1[f(df1['col'], 'foo')]

       col
0  foo abc
1   foobar

可能的正则表达式解决方案：

regex_pattern = r'foo(?!$)'
p = re.compile(regex_pattern)
f = np.vectorize(lambda x: pd.notna(x) and bool(p.search(x)))
df1[f(df1['col'])]

      col
1  foobar

DataFrame.query
通过python引擎支持字符串方法。这没有明显的性能优势，但是对于了解是否需要动态生成查询还是很有用的。

df1.query('col.str.contains("foo")', engine='python')

      col
0     foo
1  foobar

有关query和eval方法系列的更多信息，请访问Dynamic Expression Evaluation in pandas using pd.eval()。

推荐使用优先顺序

（第一）str.contains，为简单起见
列出理解，以表彰其表现
np.vectorize
（最后一个）df.query

Answer 7

这是我最终为部分字符串匹配做的事情。如果有人有更有效的方法，请告诉我。

def stringSearchColumn_DataFrame(df, colName, regex):
    newdf = DataFrame()
    for idx, record in df[colName].iteritems():

        if re.search(regex, record):
            newdf = concat([df[df[colName] == record], newdf], ignore_index=True)

    return newdf

Answer 8

是否需要对熊猫数据框列中的字符串进行不区分大小写搜索

df[df['A'].str.contains("hello", case=False)]

Answer 9

假设我们在数据框 df 中有一个名为“ENTITY”的列。我们可以过滤我们的 df，以获得整个数据框 df，其中“实体”列的行不包含“DM”，使用掩码如下：

mask = df['ENTITY'].str.contains('DM')

df = df.loc[~(mask)].copy(deep=True)

Answer 10

您可以尝试将它们视为字符串：

df[df['A'].astype(str).str.contains("Hello|Britain")]

Answer 11

对于包含特殊字符的字符串，使用contains效果不佳。找到工作了。

df[df['A'].str.find("hello") != -1]

Answer 12

也许您想在Pandas数据框的所有列中搜索某些文本，而不仅仅是在它们的子集中。在这种情况下，以下代码会有所帮助。

df[df.apply(lambda row: row.astype(str).str.contains('String To Find').any(), axis=1)]

警告。这种方法相对较慢，但很方便。

Answer 13

一个更通用的例子 - 如果在字符串中查找单词的一部分或特定单词：

df = pd.DataFrame([('cat andhat', 1000.0), ('hat', 2000000.0), ('the small dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])

句子或单词的特定部分：

searchfor = '.*cat.*hat.*|.*the.*dog.*'

创建列显示受影响的行（可以随时根据需要过滤掉）

df["TrueFalse"]=df['col1'].str.contains(searchfor, regex=True)

    col1             col2           TrueFalse
0   cat andhat       1000.0         True
1   hat              2000000.0      False
2   the small dog    1000.0         True
3   fog              330000.0       False
4   pet 3            30000.0        False

Answer 14

在此之前，有一些答案可以完成所要求的功能，无论如何，我想以最普遍的方式来说明：

df.filter(regex=".*STRING_YOU_LOOK_FOR.*")

通过这种方式，无论您采用哪种编写方式，都可以获取要查找的列。

（很明显，您必须为每种情况编写正确的regex表达式）

Answer 15

我的 2c 价值：

我做了以下事情：

sale_method = pd.DataFrame(model_data['Sale Method'].str.upper()) sale_method['sale_classification'] = np.where(sale_method['Sale Method'].isin (['PRIVATE']), 'private' ,np.where(sale_method['Sale Method'].str.contains('AUCTION'), 'auction' , 'other'))

从pandas DataFrame中选择部分字符串

15 个答案:

如何从熊猫DataFrame中按部分字符串选择？

基本子字符串搜索

多个子字符串搜索

匹配整个单词

多个全字搜索

一个不错的选择：使用List Comprehensions！

部分字符串匹配的更多选项：`np.char.find`，`np.vectorize`，`DataFrame.query`。

推荐使用优先顺序

从pandas DataFrame中选择部分字符串

15 个答案:

如何从熊猫DataFrame中按部分字符串选择？

基本子字符串搜索

多个子字符串搜索

匹配整个单词

多个全字搜索

一个不错的选择：使用List Comprehensions！

部分字符串匹配的更多选项：np.char.find，np.vectorize，DataFrame.query。

推荐使用优先顺序

部分字符串匹配的更多选项：`np.char.find`，`np.vectorize`，`DataFrame.query`。