Question

如何使用Python从我的HTML源代码获取电话，传真和地址？我需要将它设置为变量。

<h2>My name</h2>
<img src="images/logos/" style="float:right" />
<p>Adress  37/41 Portbell</p>
<p>P.O.Box 12339, Kampala</p>
<p>Tel: +41 414220702</p>
<p>Fax: +41 414220929</p>

在这种情况下我不能使用pyquery :(

Answer 1

使用Beautiful Soup进行HTML解析的解决方案：

from bs4 import BeautifulSoup
import re

html = ... # your html goes here
soup = BeautifulSoup(html)

telephone_p = soup.find_all(text=re.compile(r'Tel:'))
telephone = telephone_p[0].replace('Tel:', '').strip()
fax_p = soup.find_all(text=re.compile(r'Fax:'))
fax = fax_p[0].replace('Fax:', '').strip()
address_ps = soup.find_all('p')[:2]
address = '\n'.join([p.text for p in address_ps])

print(telephone)
print(fax)
print(address)

结果：

+41 414220702
+41 414220929
Adress  37/41 Portbell
P.O.Box 12339, Kampala

替代解决方案，仅使用标准库：

import re

html = ... # your html goes here

telephone = re.search('Tel: ([+\d\s]+)', html).groups()[0]
fax = re.search('Fax: ([+\d\s]+)', html).groups()[0]
paragraphs = [line for line in html.split('\n') if line.startswith('<p>')]
address = '\n'.join([p.replace('<p>', '').replace('</p>', '')
                     for p in paragraphs[0:2]])

print(telephone)
print(fax)
print(address)

结果：与上述相同。

这些解决方案很脆弱，如果您的HTML格式发生变化，可能会破坏（可能会非常惊人）。

解析HTML，将数据设置为变量

1 个答案: