from bs4 import BeautifulSoup
import re
data = open('C:\folder')
soup = BeautifulSoup(data, 'html.parser')
emails = soup.find_all('td', text = re.compile('@'))
for line in emails:
print(line)
我上面的脚本在Python 2.7中使用Beautifulsoup完美地工作,用于在HTML文件中的几个之间提取内容。但是,当我在Python 3.6.4中运行相同的脚本时,我得到以下结果:
<td>xxx@xxx.com</td>
<td>xxx@xxx.com</td>
我想要没有TD内容的内容......
为什么会在Python 3中发生这种情况?
答案 0 :(得分:0)
我找到了答案......
from bs4 import BeautifulSoup
import re
data = open('C:\folder')
soup = BeautifulSoup(data, 'html.parser') #Lade till html.parser
emails = soup.find_all('td', text = re.compile('@'))
for td in emails:
print(td.get_text())
仔细观察最后两行:)