Question

我有一个来自维基百科的HTML文件，希望找到页面上的每个链接，例如/wiki/Absinthe，并将其替换为添加到前面的当前目录，例如/home/fergus/wikiget/wiki/Absinthe，以便：

<a href="/wiki/Absinthe">Absinthe</a>

变为：

<a href="/home/fergus/wikiget/wiki/Absinthe">Absinthe</a>

这贯穿整个文件。

你有什么想法吗？我很高兴使用BeautifulSoup或Regex！

Answer 1

如果您真的需要这么做，可以使用sed及其-i选项来就地重写文件：

sed -e 's,href="/wiki,href="/home/fergus/wikiget/wiki,' wiki-file.html

但是，这是一个使用可爱的lxml API的Python解决方案，以防您需要做任何更复杂的事情，或者您可能有糟糕的HTML等等：

from lxml import etree
import re

parser = etree.HTMLParser()

with open("wiki-file.html") as fp:
    tree = etree.parse(fp, parser)

for e in tree.xpath("//a[@href]"):
    link = e.attrib['href']
    if re.search('^/wiki',link):
        e.attrib['href'] = '/home/fergus/wikiget'+link

# Or you can just specify the same filename to overwrite it:
with open("wiki-file-rewritten.html","w") as fp:
    fp.write(etree.tostring(tree))

请注意，对于BeautifulSoup的作者给出的reasons，lxml对于此类任务来说可能是比BeautifulSoup更好的选择。

Answer 2

您可以使用re.sub：

的函数

def match(m):
    return '<a href="/home/fergus/wikiget' + m.group(1) + '">'

r = re.compile(r'<a\shref="([^"]+)">')
r.sub(match, yourtext)

一个例子：

>>> s = '<a href="/wiki/Absinthe">Absinthe</a>'
>>> r.sub(match, s)
'<a href="/home/fergus/wikiget/wiki/Absinthe">Absinthe</a>'

Answer 3

这是使用re模块的解决方案：

#!/usr/bin/env python
import re

open('output.html', 'w').write(re.sub('href="http://en.wikipedia.org', 'href="/home/fergus/wikiget/wiki/Absinthe', open('file.html').read()))

这是另一个没有使用re的人：

#!/usr/bin/env python
open('output.html', 'w').write(open('file.html').read().replace('href="http://en.wikipedia.org', 'href="/home/fergus/wikiget/wiki/Absinthe'))

Answer 4

我愿意

import re

ch = '<a href="/wiki/Absinthe">Absinthe</a>'

r = re.compile('(<a\s+href=")(/wiki/[^"]+">[^<]+</a>)')

print ch
print
print r.sub('\\1/home/fergus/wikiget\\2',ch)

编辑：

已经说过这个解决方案不会捕获带有附加属性的标签。我认为这是一个狭窄的字符串模式，如<a href="/wiki/WORD">WORD</a>

如果没有，那么，没问题，具有更简单RE的解决方案很容易编写

r = re.compile('(<a\s+href="/)([^>]+">)')

ch = '<a href="/wiki/Aide:Homonymie" title="Aide:Homonymie">'
print ch
print r.sub('\\1home/fergus/wikiget/\\2',ch)

或为什么不：

r = re.compile('(<a\s+href="/)')

ch = '<a href="/wiki/Aide:Homonymie" title="Aide:Homonymie">'
print ch
print r.sub('\\1home/fergus/wikiget/',ch)

Answer 5

from lxml import html

el = html.fromstring('<a href="/wiki/word">word</a>')
# or `el = html.parse(file_or_url).getroot()`

def repl(link):
    if link.startswith('/'):
       link = '/home/fergus/wikiget' + link
    return link

print(html.tostring(el))
el.rewrite_links(repl)
print(html.tostring(el))

输出

<a href="/wiki/word">word</a>
<a href="/home/fergus/wikiget/wiki/word">word</a>

您也可以直接使用函数lxml.html.rewrite_links()：

from lxml import html

def repl(link):
    if link.startswith('/'):
       link = '/home/fergus/wikiget' + link
    return link

print html.rewrite_links(htmlstr, repl)

查找并追加每个对html链接的引用 - Python

5 个答案:

输出