Question

我有一个字符串列表如下：

drop_list = [ "www.stackoverflow", "www.youtube."]

我有一个Pandas数据框df，其中包含一个名称为column_name1的列，该列可能包含也可能不包含drop_list中的子字符串。样本df如下：

columnname1
------------
https://stackoverflow.com/python-pandas-drop-rows-string-match-from-list
https://stackoverflow.com/deleting-dataframe-row-in-pandas-column-value
https://textblob.readthedocs.io/en/dev/quickstart.html#create-a-textblob
https://textblob.readthedocs.io/en/dev/
https://textblob.readthedocs.io/en/stackoverflow
https://textblob.readthedocs.io/en/youtube
https://www.youtube.com/watch?v=sHkjdkjsk
https://www.youtube.com/watch?v=_jksjdkjklI
https://www.youtube.com/watch?v=jkjskjkjkjkw

所以，我想从df中删除所有行，其中包含来自drop_list的子字符串。

如果我理解正确，如果drop_list是我想要匹配的确切值，我可以按照SO问题使用以下代码。

df[~df['column_name1'].isin(to_drop)]

或使用this答案中建议的str.contains方法，如果它只是一个值

df[~df['column_name1'].str.contains("XYZ")]

现在，如何结合两种方法来删除列？所需的输出是删除包含 stackoverflow 或 youtube 的所有行从我的数据框：

columnname1
------------
https://textblob.readthedocs.io/en/dev/quickstart.html#create-a-textblob
https://textblob.readthedocs.io/en/dev/
https://textblob.readthedocs.io/en/stackoverflow
https://textblob.readthedocs.io/en/youtube

如果我使用df = df[~df['col1'].str.contains('|'.join(to_drop))]原样运行to_drop，它会保留stackoverflow网址，但会删除youtube网址。

如果我将列表更改为更通用，如下所示 to_drop = [＆＃34; stackoverflow＆＃34;，＆＃34; youtube＆＃34;]

删除

https://textblob.readthedocs.io/en/stackoverflow
https://textblob.readthedocs.io/en/youtube

所以，我所要做的就是删除包含stackoverflow和youtube网址的所有行。我正在避免使用urlparse库！

Here是一个MWE。

Answer 1

试试这个：

import re

drp = [re.sub(r'www\.|\.$|\.com', '', x) for x in to_drop]
df[~df.col1.str.extract(r'http[s]*://([^/]*).*', expand=False)
  .str.contains('|'.join(drp))]

的产率：

                                                                       col1
2  https://textblob.readthedocs.io/en/dev/quickstart.html#create-a-textblob
3                                   https://textblob.readthedocs.io/en/dev/
4                          https://textblob.readthedocs.io/en/stackoverflow
5                                https://textblob.readthedocs.io/en/youtube

说明：

In [38]: drp
Out[38]: ['stackoverflow', 'youtube']

In [41]: df.col1.str.extract(r'http[s]*://([^/]*).*', expand=False)
Out[41]:
0          stackoverflow.com
1          stackoverflow.com
2    textblob.readthedocs.io
3    textblob.readthedocs.io
4    textblob.readthedocs.io
5    textblob.readthedocs.io
6            www.youtube.com
7            www.youtube.com
8            www.youtube.com
Name: col1, dtype: object

In [42]: df.col1.str.extract(r'http[s]*://([^/]*).*', expand=False).str.contains('|'.join(drp))
Out[42]:
0     True
1     True
2    False
3    False
4    False
5    False
6     True
7     True
8     True
Name: col1, dtype: bool

通过匹配子字符串列表

1 个答案: