解析HTML文件

时间:2014-09-12 15:28:42

标签: python html parsing html-parsing beautifulsoup

我想用Python脚本做到这一点: 更改链接到图像的链接,例如:

<td>mylink.com</td>

到:

<td><a href="mylink,com"><img src="myimage.jpg"></a></td>

我使用BeautifulSoup lib:

尝试了这个
soup = BeautifulSoup("<td>mylink.html</td>")
soup.td.string.wrap(soup.new_tag("a"))
text = soup.a.string
soup.a.clear()
soup.find('a')['href'] = text
image = soup.new_tag('img')
soup.a.append(image)
soup.find('img')['src'] = "images/world_link.png"

它工作正常,但我想在target中添加另一个属性<a href="" target="",我该怎么做?

现在我想遍历所有td,我试过这个:

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("C:\Users\Will\Desktop\htm.html"))
td = soup.find_all('td')
for s in td:
    a = soup.new_tag("a", href=s.td.text, target='_blank')
    img = soup.new_tag('img', src="images/world_link.png")
    a.append(img)
    s.td.string.replace_with(a)

但它不起作用我有这个错误:AttributeError:'NoneType'对象没有属性'text'

2 个答案:

答案 0 :(得分:0)

new_tag()接受属性作为关键字参数,将target作为其中之一传递。

此外,使用replace_with()可以更轻松地实现相同的效果,而不是wrap()clear()

from bs4 import BeautifulSoup


soup = BeautifulSoup("<td>mylink.html</td>")
td = soup.td

a = soup.new_tag("a", href=td.text, target='_blank')
img = soup.new_tag('img', src="images/world_link.png")
a.append(img)

td.string.replace_with(a)

print soup.prettify()

打印:

<td>
    <a href="mylink.html" target="_blank">
        <img src="images/world_link.png"/>
    </a>
</td>

答案 1 :(得分:-1)

我找到了洗液:

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("C:\Users\Will\Desktop\htm.html"))
td = soup.find_all('td')
for s in td:
    a = soup.new_tag("a", href=s.text, target='_blank')
    img = soup.new_tag('img', src="images/world_link.png")
    a.append(img)
    s.string.replace_with(a)