包含在另一个字符串regex python中的字符串的一部分

时间:2016-12-14 20:42:22

标签: regex string python-2.7

有没有办法检查字符串的任何部分是否与python中的另一个字符串匹配?

例如:我的网址看起来像这样

url = pd.DataFrame({'urls' : ['www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA', 'www.ulta.com/beautyservices/benefitbrowbar/']})

我的字符串看起来像:

string_list = ['Benefit Cosmetics', 'Anastasia Beverly Hills']
string = '|'.join(string_list)

我想将stringurl相匹配。

Anastasia Beverly Hills www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA

{p} www.ulta.com/beautyservices/benefitbrowbar/Benefit Cosmetics

我一直在尝试url['urls'].str.contains('('+string+')', case = False),但这不匹配。

什么是正确的方法?

1 个答案:

答案 0 :(得分:1)

我不能在一行中使用正则表达式,但这是我尝试使用itertools和任何:

import pandas as pd
from itertools import product

url = pd.DataFrame({'urls' : ['www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA', 'www.ulta.com/beautyservices/benefitbrowbar/']})
string_list = ['Benefit Cosmetics', 'Anastasia Beverly Hills']

"""
For each of Cartesian product (the different combinations) of 
string_list and urls.
"""
for x in list(product(string_list, url['urls'])):
    """
    If any of the words in the string (x[0]) are present in 
    the URL (x[1]) disregarding case.
    """
    if any (word.lower() in x[1].lower() for word in x[0].split()):
        """
        Show the match.
        """
        print ("Match String: %s URL: %s" % (x[0], x[1])) 

输出:

Match String: Benefit Cosmetics URL: www.ulta.com/beautyservices/benefitbrowbar/
Match String: Anastasia Beverly Hills URL: www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA

<强>更新

你看待它的方式你也可以使用:

import pandas as pd
import warnings
pd.set_option('display.width', 100)
"""
Supress the warning it will give on a match.
"""
warnings.filterwarnings("ignore", 'This pattern has match groups')
string_list = ['Benefit Cosmetics', 'Anastasia Beverly Hills']
"""
Create a pandas DataFrame.
"""
url = pd.DataFrame({'urls' : ['www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA', 'www.ulta.com/beautyservices/benefitbrowbar/']})
"""
Using one string at a time.
"""
for string in string_list:
    """
    Get the individual words in the string and concatenate them 
    using a pipe to create a regex pattern. 
    """
    s = "|".join(string.split())
    """
    Update the DataFrame with True or False where the regex 
    matches the URL.
    """
    url[string] = url['urls'].str.contains('('+s+')', case = False)
"""
Show the result
"""
print (url)

将输出:

                                                urls Benefit Cosmetics Anastasia Beverly Hills
0  www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00...             False                    True
1        www.ulta.com/beautyservices/benefitbrowbar/              True                   False

我想,如果你想在DataFrame中使用它,可能会更好,但我更喜欢第一种方式。