Question

我有一个https链接数组，看起来像这样

list1 = ['https://wvva.com/news/top-stories/2018/12/10/w-va-gov-appoints-former-beckley-council-member-to-parole-board/','https://www.starbreeze.com/2018/12/starbreeze-appoints-claes-wenthzel-as-acting-cfo/','https://www.streetinsider.com/corporate+news/perkinelmer+%28pki%29+appoints+prahlad+singh+as+president+%26+coo/']

我想过滤包含"appoints"作为一个必要关键字和'chief-operating-officer','ceo','chief-executive-officer','coo','cfo','chief-financial-officer','chief-marketing-officer','cmo','chief-technology-officer','cto'作为其他必要关键字的链接。我的意思是，如果链接中包含指定的单词，并且上面提到的任何单词（例如[cto，ceo，coo等]）都可以选择该链接。

我的示例输出将是这样的：

['https://www.starbreeze.com/2018/12/starbreeze-appoints-claes-wenthzel-as-acting-cfo/','https://www.streetinsider.com/corporate+news/perkinelmer+%28pki%29+appoints+prahlad+singh+as+president+%26+coo/']

非常感谢您解决此问题的正则表达式。

Answer 1

这里不需要正则表达式。您可以直接检查要在网址列表中搜索的项目列表中是否存在任何项目；如果找到，请保留网址：

list1 = ['https://wvva.com/news/top-stories/2018/12/10/w-va-gov-appoints-former-beckley-council-member-to-parole-board/','https://www.starbreeze.com/2018/12/starbreeze-appoints-claes-wenthzel-as-acting-cfo/','https://www.streetinsider.com/corporate+news/perkinelmer+%28pki%29+appoints+prahlad+singh+as+president+%26+coo/']

list2 = ['appoints','chief-operating-officer','ceo','chief-executive-officer','coo','cfo','chief-financial-officer','chief-marketing-officer','cmo','chief-technology-officer','cto']

print([x for x in list1 if list2[0] in x and sum(y in x for y in list2[1:]) == 1])
# ['https://www.starbreeze.com/2018/12/starbreeze-appoints-claes-wenthzel-as-acting-cfo/', 'https://www.streetinsider.com/corporate+news/perkinelmer+%28pki%29+appoints+prahlad+singh+as+president+%26+coo/']

Answer 2

如果您拼命寻找正则表达式，可以使用此

import re
result = [url for url in list1 if len(re.findall('chief-operating-officer|ceo|chief-executive-officer|coo|cfo|chief-financial-officer|chief-marketing-officer|cmo|chief-technology-officer|cto', url, re.I)) > 0]

Answer 3

您可以循环浏览关键字，以找到与任何提供的链接匹配的关键字

import re
from pprint import pprint

keywords = [
    'appoints',
    'chief-operating-officer',
    'ceo',
    'chief-executive-officer',
    'coo',
    'cfo',
    'chief-financial-officer',
    'chief-marketing-officer',
    'cmo',
    'chief-technology-officer',
    'cto',
]

links = [
    'https://wvva.com/news/top-stories/2018/12/10/w-va-gov-appoints-former-beckley-council-member-to-parole-board/',
    'https://www.starbreeze.com/2018/12/starbreeze-appoints-claes-wenthzel-as-acting-cfo/',
    'https://www.streetinsider.com/corporate+news/perkinelmer+%28pki%29+appoints+prahlad+singh+as+president+%26+coo/',
]

new_links = []

for link in links:
    for keyword in keywords:
        temp = re.search(r'' + keyword + '', link, flags=re.IGNORECASE)
        if temp and link not in new_links:
            new_links.append(link)

pprint(new_links)

正则表达式字符串模式匹配

3 个答案: