BeautifulSoup:将连续的NavigableString组合到单个NavigableString中

时间:2017-03-01 05:56:47

标签: python-2.7 beautifulsoup

<html>
<body>
<p>A <span>die</span> is thrown \(x = {-b \pm\sqrt{b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 fromboth the throws?</p>
<p> Test </p>
</body>
<html>

我正在尝试将\(x = {-b \pm\sqrt{b^2-4ac} \over 2a}\)包含在span标记内。当is thrown \(x = {-b \pm\sqrt{b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from both the throws?是单个NavigableString时我能够这样做,但在某些情况下is thrown \(x = {-b \pm\sqrt{b^2-4ac}\over 2a}\) twice. What is the probability of getting a sum 7 from both the throws?被分成三个NavigableString。那么有没有办法使用beautifulsoup将连续的NavigableString合并到一个NavigableString。

当(x = {-b \ pm \ sqrt {b ^ 2-4ac} \ over 2a})`没有一个NavigableString时,我用来将它们包装在span标签内的代码。

mathml_regex = re.compile(r'\\\(.*?\\\)', re.DOTALL)
def mathml_wrap(soup):
    for p_tags in soup.find_all('p'):
        for p_child in p_tags.children:
            try:
                match = re.search(mathml_regex, p_child)
                if match:
                    start = match.start()
                    end = match.end()
                    text = p_child
                    new_str = NavigableString(text[:start])
                    p_child.replace_with(new_str)
                    new_str1 = NavigableString(text[end:])
                    span_tag = soup.new_tag("span", **{'class':'math-tex'})
                    span_tag.string= text[start:end]
                    new_str.insert_after(span_tag)
                    span_tag.insert_after(new_str1)
            except TypeError:
                pass

编辑:

from bs4 import BeautifulSoup
import re
html = """<p>
     A 
     <span>die</span> 
      is thrown \(x = {-b \pm 
      <span>\sqrt</span>
      {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
    both the throws?
    </p> <p> Test </p>"""

soup = BeautifulSoup(html, 'html.parser')
mathml_start_regex = re.compile(r'\\\(')
mathml_end_regex = re.compile(r'\\\)')

for p_tags in soup.find_all('p'):
    match = 0 #Flag set to 1 if '\(' is found and again set back to 0 if '\)' is found.
    for p_child in p_tags.children:
        try: #Captures Tags that contains \(
            if re.findall(mathml_start_regex, p_child.text):
                match += 1
        except: #Captures NavigableString that contains \(
            if re.findall(mathml_start_regex, p_child):
                match += 1
        try: #Replaces Tag with Tag's text
            if match == 1:
                p_child.replace_with(p_child.text)
        except: #No point in replacing NavigableString since they are just strings without Tags
            pass
        try: #Captures Tags that contains \)
            if re.findall(mathml_end_regex, p_child.text):
                match = 0
        except: #Captures NavigableString that contains \)
            if re.findall(mathml_end_regex, p_child):
                match = 0

使用上面的代码处理我的汤后,删除\(\)之间的范围标记 is thrown \(x = {-b \pm\sqrt{b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from both the throws?在我的汤对象中分为3个NavigableStrings。

1 个答案:

答案 0 :(得分:0)

我不知道我是否正确地提出了您的问题,但正如您所说,您想要连接这些<p>标签中的字符串,

我用它作为输入 -

mystr = """<html>
<body>
<p>A <span>die</span> is thrown \(x = {-b \pm\sqrt{b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 fromboth the throws?</p>
<p> Test </p>
</body>
<html>"""

所以这就是我所做的 -

soup = BeautifulSoup(mystr,"lxml")
my_p =  soup.findAll("p")
for p in my_p:
    print p.text

这会在<p>标记中提取您收到的全文,告诉我您的问题是否是其他内容。