有没有办法检查字符串的任何部分是否与python中的另一个字符串匹配?
例如:我的网址看起来像这样
url = pd.DataFrame({'urls' : ['www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA', 'www.ulta.com/beautyservices/benefitbrowbar/']})
我的字符串看起来像:
string_list = ['Benefit Cosmetics', 'Anastasia Beverly Hills']
string = '|'.join(string_list)
我想将string
与url
相匹配。
Anastasia Beverly Hills
www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA
和
www.ulta.com/beautyservices/benefitbrowbar/
与Benefit Cosmetics
。
我一直在尝试url['urls'].str.contains('('+string+')', case = False)
,但这不匹配。
什么是正确的方法?
答案 0 :(得分:1)
我不能在一行中使用正则表达式,但这是我尝试使用itertools和任何:
import pandas as pd
from itertools import product
url = pd.DataFrame({'urls' : ['www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA', 'www.ulta.com/beautyservices/benefitbrowbar/']})
string_list = ['Benefit Cosmetics', 'Anastasia Beverly Hills']
"""
For each of Cartesian product (the different combinations) of
string_list and urls.
"""
for x in list(product(string_list, url['urls'])):
"""
If any of the words in the string (x[0]) are present in
the URL (x[1]) disregarding case.
"""
if any (word.lower() in x[1].lower() for word in x[0].split()):
"""
Show the match.
"""
print ("Match String: %s URL: %s" % (x[0], x[1]))
输出:
Match String: Benefit Cosmetics URL: www.ulta.com/beautyservices/benefitbrowbar/
Match String: Anastasia Beverly Hills URL: www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA
<强>更新强>
你看待它的方式你也可以使用:
import pandas as pd
import warnings
pd.set_option('display.width', 100)
"""
Supress the warning it will give on a match.
"""
warnings.filterwarnings("ignore", 'This pattern has match groups')
string_list = ['Benefit Cosmetics', 'Anastasia Beverly Hills']
"""
Create a pandas DataFrame.
"""
url = pd.DataFrame({'urls' : ['www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA', 'www.ulta.com/beautyservices/benefitbrowbar/']})
"""
Using one string at a time.
"""
for string in string_list:
"""
Get the individual words in the string and concatenate them
using a pipe to create a regex pattern.
"""
s = "|".join(string.split())
"""
Update the DataFrame with True or False where the regex
matches the URL.
"""
url[string] = url['urls'].str.contains('('+s+')', case = False)
"""
Show the result
"""
print (url)
将输出:
urls Benefit Cosmetics Anastasia Beverly Hills
0 www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00... False True
1 www.ulta.com/beautyservices/benefitbrowbar/ True False
我想,如果你想在DataFrame中使用它,可能会更好,但我更喜欢第一种方式。