Question

我有一组链接，如：

['http://www.nytimes.com/2016/12/31/us/politics/house-republicans-health-care-suit.html?partner=rss&amp;emc=rss" rel="standout"></atom:link>',
 'http://www.nytimes.com/2016/12/31/nyregion/bronx-murder-40th-precinct-police-residents.html</guid>',
 'http://www.nytimes.com/2016/12/30/movies/tyrus-wong-dies-bambi-disney.html?partner=rss&amp;emc=rss',
 'http://www.nytimes.com/2016/12/30/obituaries/among-deaths-in-2016-a-heavy-toll-in-pop-music.html</guid>',
 'http://www.nytimes.com/video/world/100000004830728/daybreak-around-the-world.html?partner=rss&amp;emc=rss']

我正在尝试迭代它们以删除html之后的所有内容。所以我有：

cleanitems = []

for item in links:  
    cleanitems.append(re.sub(r'html(.*)', '', item))

返回：

['http://www.nytimes.com/2016/12/31/us/politics/house-republicans-health-care-suit.',
 'http://www.nytimes.com/2016/12/31/nyregion/bronx-murder-40th-precinct-police-residents.',
 'http://www.nytimes.com/2016/12/30/movies/tyrus-wong-dies-bambi-disney.',
 'http://www.nytimes.com/2016/12/30/obituaries/among-deaths-in-2016-a-heavy-toll-in-pop-music.',
 'http://www.nytimes.com/video/world/100000004830728/daybreak-around-the-world.]

对于为何在捕获组中包含html感到困惑。谢谢你的帮助。

Answer 1

html是匹配文本 too 的一部分，而不仅仅是(...)组。 re.sub()替换了所有匹配的文本。

在替换文字中包含文字html文字：

cleanitems.append(re.sub(r'html(.*)', 'html', item))

或者，替代地，在组中捕获该部分：

cleanitems.append(re.sub(r'(html).*', r'\1', item))

您可能需要考虑使用非贪婪匹配和$字符串结尾锚点，以防止在路径中多次切断包含html的网址，包括.点以确保您实际上只匹配.html扩展名：

cleanitems.append(re.sub(r'\.html.*?$', r'.html', item))

但是，如果您的目标是从网址中删除查询字符串，请考虑使用urllib.parse.urlparse()解析网址，并在不使用查询字符串或片段标识符的情况下重新构建网址：< / p>

from urlib.parse import urlparse

cleanitems.append(urlparse(item)._replace(query='', fragment='').geturl())

但这不会删除错误的HTML块;如果要从HTML文档解析这些URL，请考虑使用real HTML parser而不是正则表达式。

Answer 2

只是对Martijn的答案的补充。

你也可以使用lookbehind断言来匹配html后面的文字：

cleanitems.append(re.sub(r'(?<=html).*', '', item))

或使用替换字符串来保留初始部分：

cleanitems.append(re.sub(r'(html).*', r'\1', item))

但正如Martin已经说过的那样，您最好使用urllib模块正确解析URL

re.sub替换太多文本

2 个答案: