我正在尝试从我的Google API搜索结果中排除某些链接。我正在尝试使用从links_to_exclude列表中提取的正则表达式。这种方法仍然输出我不需要的链接。
返回的一些链接:
https://money.cnn.com/2018/08/21/technology/facebook-disinformation-iran-russia/index.html
如何使用正则表达式排除这些链接?
links_to_exclude = ['cnn.com', 'nytimes.com']
for item in search_terms:
results = google_search(item, api_key, cse_id, num=1)
for result in results:
rtn_link = result.get('link')
for link in links_to_exclude:
regex = '((http[s]?|ftp):\/)?\/?([^:\/\s]+)?({})\/([^\/]+)'.format(link)
if re.search(regex, rtn_link):
continue
else:
pprint.pprint(result.get('link'))
答案 0 :(得分:1)
您的正则表达式似乎是正确的。我认为您只是在脚本上缺少import re
。
参见此处:https://ideone.com/Uzcf1K
import re
links_to_exclude = ['cnn.com', 'nytimes.com']
results = ['https://foo.bar', 'https://money.cnn.com/2018/08/21/technology/facebook-disinformation-iran-russia/index.html','https://www.cnn.com/videos/politics/2018/08/22/carl-bernstein-worse-than-watergate-egregious-trump-newday-sot-vpx.cnn','https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region®ion=top-news&WT.nav=top-news']
for result in results:
print "URL: " + result
for link in links_to_exclude:
regex = '((http[s]?|ftp):\/)?\/?([^:\/\s]+)?({})\/([^\/]+)'.format(link)
if re.search(regex, result):
print ' Matches: ' + link
else:
print ' Does not match: ' + link
输出:
URL: https://foo.bar
Does not match: cnn.com
Does not match: nytimes.com
URL: https://money.cnn.com/2018/08/21/technology/facebook-disinformation-iran-russia/index.html
Matches: cnn.com
Does not match: nytimes.com
URL: https://www.cnn.com/videos/politics/2018/08/22/carl-bernstein-worse-than-watergate-egregious-trump-newday-sot-vpx.cnn
Matches: cnn.com
Does not match: nytimes.com
URL: https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region®ion=top-news&WT.nav=top-news
Does not match: cnn.com
Matches: nytimes.com