Question

＆＃13;

\(x = {-b \pm <span>\sqrt</span> {b^2-4ac} \over 2a}\\)

＆＃13;

在上面的html中，我只需删除＆＃34; \（tags \）＆＃34;中的标签。即{{1}}。我刚刚开始使用beautifulsoup有什么办法可以用beautifulsoup来实现吗？

Answer 1

我想出了我的问题的解决方案。希望它能帮助别人。随意给我建议改进代码。

from bs4 import BeautifulSoup
import re
html = """<p>
     A 
     <span>die</span> 
      is thrown \(x = {-b \pm 
      <span>\sqrt</span>
      {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
    both the throws?
    </p> <p> Test </p>"""

soup = BeautifulSoup(html, 'html.parser')
mathml_start_regex = re.compile(r'\\\(')
mathml_end_regex = re.compile(r'\\\)')

for p_tags in soup.find_all('p'):
    match = 0 #Flag set to 1 if '\(' is found and again set back to 0 if '\)' is found.
    for p_child in p_tags.children:
        try: #Captures Tags that contains \(
            if re.findall(mathml_start_regex, p_child.text):
                match += 1
        except: #Captures NavigableString that contains \(
            if re.findall(mathml_start_regex, p_child):
                match += 1
        try: #Replaces Tag with Tag's text
            if match == 1:
                p_child.replace_with(p_child.text)
        except: #No point in replacing NavigableString since they are just strings without Tags
            pass
        try: #Captures Tags that contains \)
            if re.findall(mathml_end_regex, p_child.text):
                match = 0
        except: #Captures NavigableString that contains \)
            if re.findall(mathml_end_regex, p_child):
                match = 0

输出：

<p>
     A 
     <span>die</span> 
      is thrown \(x = {-b \pm 
      \sqrt
      {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
    both the throws?
    </p>
<p> Test
</p>

在上面的代码中，我搜索了所有＆＃39; p＆＃39; tag并返回 bs4.element.ResultSet 。在第一个for循环中，我正在迭代到结果集以获得个体＆＃39; p＆＃39;标签和第二个for循环，并使用。 children 生成器迭代＆＃39; p＆＃39;标签children（包含可导航的字符串和标签）。每个＆＃39; p＆＃39;标签的子项搜索＆＃39; \（＆＃39;，如果发现匹配设置为1，并且如果迭代到匹配的子项为1，则使用特定子项中的标记将其删除 replace_with ，最后在＆＃39; \）＆＃39;时将匹配设置为零找到了。

Answer 2

单独的美味汤不能得到子串。你可以使用正则表达式。

from bs4 import BeautifulSoup
import re

html = """<p>
     A 
     <span>die</span> 
      is thrown \(x = {-b \pm 
      <span>\sqrt</span>
      {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
    both the throws?
    </p>"""

soup = BeautifulSoup(html, 'html.parser')

print re.findall(r'\\\(.*?\)', soup.text, re.DOTALL)

输出：

[u'\\(x = {-b \\pm \n  \\sqrt\n  {b^2-4ac} \\over 2a}\\)']

正则表达式：

\\\(.*?\) - Get substring from ( to ).

如果要删除换行符和空格，可以这样做：

res = re.findall(r'\\\(.*?\)', soup.text, re.DOTALL)[0]
print ' '.join(res.split())

输出：

\(x = {-b \pm \sqrt {b^2-4ac} \over 2a}\)

围绕字符串的HTML包装器：

print BeautifulSoup(' '.join(res.split()))

输出：

<html><body><p>\(x = {-b \pm \sqrt {b^2-4ac} \over 2a}\)</p></body></html>

如何删除beautifulsoup中特定模式中的任何html标签

2 个答案: