通过匹配子字符串列表

时间:2017-02-25 12:35:06

标签: python string pandas dataframe pattern-matching

我有一个字符串列表如下:

drop_list = [ "www.stackoverflow", "www.youtube."]

我有一个Pandas数据框df,其中包含一个名称为column_name1的列,该列可能包含也可能不包含drop_list中的子字符串。样本df如下:

columnname1
------------
https://stackoverflow.com/python-pandas-drop-rows-string-match-from-list
https://stackoverflow.com/deleting-dataframe-row-in-pandas-column-value
https://textblob.readthedocs.io/en/dev/quickstart.html#create-a-textblob
https://textblob.readthedocs.io/en/dev/
https://textblob.readthedocs.io/en/stackoverflow
https://textblob.readthedocs.io/en/youtube
https://www.youtube.com/watch?v=sHkjdkjsk
https://www.youtube.com/watch?v=_jksjdkjklI
https://www.youtube.com/watch?v=jkjskjkjkjkw

所以,我想从df中删除所有行,其中包含来自drop_list的子字符串。

如果我理解正确,如果drop_list是我想要匹配的确切值,我可以按照SO问题使用以下代码。

df[~df['column_name1'].isin(to_drop)]

或使用this答案中建议的str.contains方法,如果它只是一个值

df[~df['column_name1'].str.contains("XYZ")]

现在,如何结合两种方法来删除列?所需的输出是删除包含 stackoverflow youtube 的所有行从我的数据框:

columnname1
------------
https://textblob.readthedocs.io/en/dev/quickstart.html#create-a-textblob
https://textblob.readthedocs.io/en/dev/
https://textblob.readthedocs.io/en/stackoverflow
https://textblob.readthedocs.io/en/youtube

如果我使用df = df[~df['col1'].str.contains('|'.join(to_drop))]原样运行to_drop,它会保留stackoverflow网址,但会删除youtube网址。

如果我将列表更改为更通用,如下所示 to_drop = [" stackoverflow"," youtube"]

删除

https://textblob.readthedocs.io/en/stackoverflow
https://textblob.readthedocs.io/en/youtube

所以,我所要做的就是删除包含stackoverflow和youtube网址的所有行。我正在避免使用urlparse库!

Here是一个MWE。

1 个答案:

答案 0 :(得分:0)

试试这个:

import re

drp = [re.sub(r'www\.|\.$|\.com', '', x) for x in to_drop]
df[~df.col1.str.extract(r'http[s]*://([^/]*).*', expand=False)
  .str.contains('|'.join(drp))]

的产率:

                                                                       col1
2  https://textblob.readthedocs.io/en/dev/quickstart.html#create-a-textblob
3                                   https://textblob.readthedocs.io/en/dev/
4                          https://textblob.readthedocs.io/en/stackoverflow
5                                https://textblob.readthedocs.io/en/youtube

说明:

In [38]: drp
Out[38]: ['stackoverflow', 'youtube']

In [41]: df.col1.str.extract(r'http[s]*://([^/]*).*', expand=False)
Out[41]:
0          stackoverflow.com
1          stackoverflow.com
2    textblob.readthedocs.io
3    textblob.readthedocs.io
4    textblob.readthedocs.io
5    textblob.readthedocs.io
6            www.youtube.com
7            www.youtube.com
8            www.youtube.com
Name: col1, dtype: object

In [42]: df.col1.str.extract(r'http[s]*://([^/]*).*', expand=False).str.contains('|'.join(drp))
Out[42]:
0     True
1     True
2    False
3    False
4    False
5    False
6     True
7     True
8     True
Name: col1, dtype: bool