刮除任何元素中未包含的文本

时间:2015-03-08 01:13:44

标签: python beautifulsoup

我正在使用Beautiful Soup 4编写一个写得很糟糕的网站。除了用户的电子邮件地址之外,我已经收到了所有内容,而这些地址并不存在于区分它的任何包含元素中。任何想法如何刮呢?正如我所料,强大元素的next_sibling正在跳过它。

<div class="fieldset-wrapper">
 <strong>
  E-mail address:
 </strong>
 useremail@yahoo.com
 <div class="field field-name-ds-user-picture field-type-ds field-label-hidden">
  <div class="field-items">

2 个答案:

答案 0 :(得分:2)

我不确定这是最好的方法,但你可以获取父元素,然后遍历其子元素并查看非标记:

from bs4 import BeautifulSoup
import bs4

html='''
<div class="fieldset-wrapper">
 <strong>
  E-mail address:
 </strong>
 useremail@yahoo.com
 <div class="field field-name-ds-user-picture field-type-ds field-label-hidden">
  <div class="field-items">
'''


def print_if_email(s):
    if '@' in s: print s

soup = BeautifulSoup(html)

# Iterate over all divs, you could narrow this down if you had more information
for div in soup.findAll('div'):
    # Iterate over the children of each matching div
    for c in div.children:
        # If it wasn't parsed as a tag, it may be a NavigableString
        if isinstance(c, bs4.element.NavigableString):
            # Some heuristic to identify email addresses if other non-tags exist
            print_if_email(c.strip())

打印:

useremail@yahoo.com

当然,内部for循环和if语句可以组合成:

for c in filter(lambda c: isinstance(c, bs4.element.NavigableString), div.children):

答案 1 :(得分:0)

我无法直接回答您的问题,因为我从未使用过美丽的汤(所以不要接受这个答案!)但是只想提醒您页面是否都非常简单,另一种选择可能就是写你的使用.split()拥有自己的解析器?

这是相当笨拙的,但值得考虑的是页面是否简单/可预测......

也就是说,如果您对页面的整体布局有所了解 (例如,用户电子邮件总是首先提到电子邮件)你可以编写自己的解析器,找到'@'符号之前和之后的位

# html = the entire document as a string

# return the entire document up to the '@' sign
bit_before_at_sign = html.split('@')[0]
# only useful if you know first email is the one you care about

# you could then cut out everything before username with something like this
b = bit_before_at_sign
# a very long string, we just want the last bit right before the @ sign
username = b.split(' ')[-1].split('\n')[-1].split('\r')[-1].split('\r')[-1].split(';')[-1]
# add more if required, depending on how the html looks to you 
# (I've just guessed some html elements that might precede the username)

# you could similarly parse the bit after the @ sign, 
# html.split('@')[1]  
# e.g., checking the first few characters of this
# against a known list of .tlds like '.com', '.co.uk', etc  
# (remember some TLDs have more than one period, so don't just parse by '.')
# and combine with the username you already know

还可以随时使用,以防您想要缩小您关注的文档的哪个位置:

如果您想确保“电子邮件”一词也在您要解析的字符串中

if 'email' in lower(b) or 'e-mail' in lower(b):
    # do something...

要检查文档中@符号首次出现的位置

html.index('@')
# e.g., if you want to see how near this '@' symbol is to some other element you know about 
# such as the word 'e-mail', or a particular div element or '</strong>'

将您搜索的电子邮件限制在您知道的另一个元素之前/之后的300个字符:

startfrom = html.index('</strong>')
html_i_will_search = html[startfrom:startfrom+300]

我想在谷歌上再花几分钟可能会证明有用;你的任务听起来不寻常:)

并确保您考虑页面上有多个电子邮件地址的情况(例如,您不要将support@site.com分配给每个用户!)

无论你采用什么方法,如果你有疑问,可能值得使用email.utils.parseaddr()或其他人的正则表达式检查来检查你的答案。见previous question