Question

我有一个链接到报纸文章的数据集，我想做一些研究。但是，数据集中的链接以.ece扩展名结尾（由于某些api限制，这对我来说是个问题）

http://www.telegraaf.nl/telesport/voetbal/buitenlands/article22178882.ece

和

http://www.telegraaf.nl/telesport/voetbal/buitenlands/22178882/__Wenger_vreest_het_ergste__.html

是指向同一页面的链接。现在我需要将所有 .ece 链接转换为 .html 链接。我找不到更简单的方法，但要解析页面并找到原始的.html链接。问题是链接被隐藏在一个html元元素中，我无法使用tree.xpath来实现它。

<meta content="http://www.telegraaf.nl/telesport/voetbal/buitenlands/22178882/__Wenger_vreest_het_ergste__.html"

不幸的是，我不熟悉正则表达式，也不知道如何使用它提取链接。基本上，我需要的每个链接都将以：

开头

<meta content="http://www.telegraaf.nl/

我需要完整的链接（即http://www.telegraaf.nl/THE_REST_OF_THE_LINK）。另外，我正在使用BeautifulSoup进行解析。感谢。

Answer 1

这是一个非常简单的正则表达式，可以帮助您入门。

This one将提取所有链接

\<meta content="(http:\/\/www\.telegraaf\.nl.*)"

这个将匹配所有HTML链接

\<meta content="(http:\/\/www\.telegraaf\.nl.*\.html)"

要使用此功能，您可以执行以下操作：

import urllib2
import re

replacements = dict()
for url in ece_url_list:
    response = urllib2.urlopen(url)
    html = response.read()
    replacements[url] = re.findall('\<meta content="(http:\/\/www\.telegraaf\.nl.*\.html)"', html)[0]

注意：这假定每个源代码页始终在此元标记中包含html链接。它只需要一个。

Answer 2

使用BeautifulSoup查找匹配的内容属性，然后替换为：

from bs4 import BeautifulSoup
import re

html = """
    <meta content="http://www.telegraaf.nl/telesport/voetbal/buitenlands/article22178882.ece" />
    <meta content="http://www.telegraaf.nl/telesport/voetbal/buitenlands/22178882/__Wenger_vreest_het_ergste__.html" />
"""

soup = BeautifulSoup(html)
# reference table of url prefixes to full html link
html_links = {
    el['content'].rpartition('/')[0]: el['content'] 
    for el in soup.find_all('meta', content=re.compile('.html$'))
}
# find all ece links, strip the end of to match links, then adjust
# meta content with looked up element
for el in soup.find_all('meta', content=re.compile('.ece$')):
    url = re.sub('(?:article(\d+).ece$)', r'\1', el['content'])
    el['content'] = html_links[url]

print soup
# <meta content="http://www.telegraaf.nl/telesport/voetbal/buitenlands/22178882/__Wenger_vreest_het_ergste__.html"/>

Answer 3

(.*?)(http:\/\/.*\/.*?\.)(ece)

试试这个。$2html。

参见演示。

http://regex101.com/r/nA6hN9/24

从源代码中提取与正则表达式的链接;蟒蛇

3 个答案: