Question

from bs4 import BeautifulSoup
import re

data = open('C:\folder')
soup = BeautifulSoup(data, 'html.parser')
emails = soup.find_all('td', text = re.compile('@'))

for line in emails:
   print(line)

我上面的脚本在Python 2.7中使用Beautifulsoup完美地工作，用于在HTML文件中的几个之间提取内容。但是，当我在Python 3.6.4中运行相同的脚本时，我得到以下结果：

<td>xxx@xxx.com</td>
<td>xxx@xxx.com</td>

我想要没有TD内容的内容......

为什么会在Python 3中发生这种情况？

Answer 1

我找到了答案......

from bs4 import BeautifulSoup
import re

data = open('C:\folder')
soup = BeautifulSoup(data, 'html.parser') #Lade till html.parser
emails = soup.find_all('td', text = re.compile('@'))

for td in emails:
   print(td.get_text())

仔细观察最后两行：）

Python 3 - 在之间提取内容

1 个答案: