我有一个字符串列表如下:
drop_list = [ "www.stackoverflow", "www.youtube."]
我有一个Pandas数据框df
,其中包含一个名称为column_name1
的列,该列可能包含也可能不包含drop_list中的子字符串。样本df如下:
columnname1
------------
https://stackoverflow.com/python-pandas-drop-rows-string-match-from-list
https://stackoverflow.com/deleting-dataframe-row-in-pandas-column-value
https://textblob.readthedocs.io/en/dev/quickstart.html#create-a-textblob
https://textblob.readthedocs.io/en/dev/
https://textblob.readthedocs.io/en/stackoverflow
https://textblob.readthedocs.io/en/youtube
https://www.youtube.com/watch?v=sHkjdkjsk
https://www.youtube.com/watch?v=_jksjdkjklI
https://www.youtube.com/watch?v=jkjskjkjkjkw
所以,我想从df中删除所有行,其中包含来自drop_list
的子字符串。
如果我理解正确,如果drop_list是我想要匹配的确切值,我可以按照SO问题使用以下代码。
df[~df['column_name1'].isin(to_drop)]
或使用this答案中建议的str.contains
方法,如果它只是一个值
df[~df['column_name1'].str.contains("XYZ")]
现在,如何结合两种方法来删除列?所需的输出是删除包含 stackoverflow 或 youtube 的所有行从我的数据框:
columnname1
------------
https://textblob.readthedocs.io/en/dev/quickstart.html#create-a-textblob
https://textblob.readthedocs.io/en/dev/
https://textblob.readthedocs.io/en/stackoverflow
https://textblob.readthedocs.io/en/youtube
如果我使用df = df[~df['col1'].str.contains('|'.join(to_drop))]
原样运行to_drop
,它会保留stackoverflow网址,但会删除youtube网址。
如果我将列表更改为更通用,如下所示 to_drop = [" stackoverflow"," youtube"]
删除
https://textblob.readthedocs.io/en/stackoverflow
https://textblob.readthedocs.io/en/youtube
所以,我所要做的就是删除包含stackoverflow和youtube网址的所有行。我正在避免使用urlparse库!
Here是一个MWE。
答案 0 :(得分:0)
试试这个:
import re
drp = [re.sub(r'www\.|\.$|\.com', '', x) for x in to_drop]
df[~df.col1.str.extract(r'http[s]*://([^/]*).*', expand=False)
.str.contains('|'.join(drp))]
的产率:
col1
2 https://textblob.readthedocs.io/en/dev/quickstart.html#create-a-textblob
3 https://textblob.readthedocs.io/en/dev/
4 https://textblob.readthedocs.io/en/stackoverflow
5 https://textblob.readthedocs.io/en/youtube
说明:
In [38]: drp
Out[38]: ['stackoverflow', 'youtube']
In [41]: df.col1.str.extract(r'http[s]*://([^/]*).*', expand=False)
Out[41]:
0 stackoverflow.com
1 stackoverflow.com
2 textblob.readthedocs.io
3 textblob.readthedocs.io
4 textblob.readthedocs.io
5 textblob.readthedocs.io
6 www.youtube.com
7 www.youtube.com
8 www.youtube.com
Name: col1, dtype: object
In [42]: df.col1.str.extract(r'http[s]*://([^/]*).*', expand=False).str.contains('|'.join(drp))
Out[42]:
0 True
1 True
2 False
3 False
4 False
5 False
6 True
7 True
8 True
Name: col1, dtype: bool