Question

我正在玩BeautifulSoup库。我试图解析来自网站的电子邮件，但得到了意想不到的结果。这是我的代码：

from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

from bs4 import BeautifulSoup
import re
from urllib.parse import quote 

startUrl = "http://getrocketbook.com/pages/returns"
try:
    html = urlopen(quote((startUrl).encode('utf8'), ':/?%#_'))
    bsObj = BeautifulSoup(html, "html.parser")
    alls = bsObj.body.findAll(text=re.compile('[A-Za-z0-9\._+-]+@[A-Za-z0-9\.-]+'))
    for al in alls:
        print(al)
except HTTPError:
    pass
except URLError:
    pass

我希望只解析一封电子邮件，但我实际上解析了这句话：

If you’ve done all of this and you still have not received your refund yet, please contact us at hello@getrocketbook.com.

知道我能做错什么吗？

Answer 1

这是因为findAll()查找实际的元素或文本节点，而不是单独的单词。

您需要做的是将相同的编译正则表达式应用于结果：

pattern = re.compile('[A-Za-z0-9\._+-]+@[A-Za-z0-9\.-]+')
alls = bsObj.body.find_all(text=pattern)
for al in alls:
    print(pattern.search(al).group(0))

此外，由于此处只有一封电子邮件，请参阅您是否可以使用find()方法。

使用BeautifulSoup和regex解析时出现意外结果

1 个答案: