\(x = {-b \pm <span>\sqrt</span> {b^2-4ac} \over 2a}\\)
&#13;
在上面的html中,我只需删除&#34; \(tags \)&#34;中的标签。即{{1}}。 我刚刚开始使用beautifulsoup有什么办法可以用beautifulsoup来实现吗?
答案 0 :(得分:2)
我想出了我的问题的解决方案。希望它能帮助别人。随意给我建议改进代码。
from bs4 import BeautifulSoup
import re
html = """<p>
A
<span>die</span>
is thrown \(x = {-b \pm
<span>\sqrt</span>
{b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
both the throws?
</p> <p> Test </p>"""
soup = BeautifulSoup(html, 'html.parser')
mathml_start_regex = re.compile(r'\\\(')
mathml_end_regex = re.compile(r'\\\)')
for p_tags in soup.find_all('p'):
match = 0 #Flag set to 1 if '\(' is found and again set back to 0 if '\)' is found.
for p_child in p_tags.children:
try: #Captures Tags that contains \(
if re.findall(mathml_start_regex, p_child.text):
match += 1
except: #Captures NavigableString that contains \(
if re.findall(mathml_start_regex, p_child):
match += 1
try: #Replaces Tag with Tag's text
if match == 1:
p_child.replace_with(p_child.text)
except: #No point in replacing NavigableString since they are just strings without Tags
pass
try: #Captures Tags that contains \)
if re.findall(mathml_end_regex, p_child.text):
match = 0
except: #Captures NavigableString that contains \)
if re.findall(mathml_end_regex, p_child):
match = 0
输出:
<p>
A
<span>die</span>
is thrown \(x = {-b \pm
\sqrt
{b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
both the throws?
</p>
<p> Test
</p>
在上面的代码中,我搜索了所有&#39; p&#39; tag并返回 bs4.element.ResultSet 。在第一个for循环中,我正在迭代到结果集以获得个体&#39; p&#39;标签和第二个for循环,并使用。 children 生成器迭代&#39; p&#39;标签children(包含可导航的字符串和标签)。每个&#39; p&#39;标签的子项搜索&#39; \(&#39;,如果发现匹配设置为1,并且如果迭代到匹配的子项为1,则使用特定子项中的标记将其删除 replace_with ,最后在&#39; \)&#39;时将匹配设置为零找到了。
答案 1 :(得分:0)
单独的美味汤不能得到子串。你可以使用正则表达式。
from bs4 import BeautifulSoup
import re
html = """<p>
A
<span>die</span>
is thrown \(x = {-b \pm
<span>\sqrt</span>
{b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
both the throws?
</p>"""
soup = BeautifulSoup(html, 'html.parser')
print re.findall(r'\\\(.*?\)', soup.text, re.DOTALL)
输出:
[u'\\(x = {-b \\pm \n \\sqrt\n {b^2-4ac} \\over 2a}\\)']
正则表达式:
\\\(.*?\) - Get substring from ( to ).
如果要删除换行符和空格,可以这样做:
res = re.findall(r'\\\(.*?\)', soup.text, re.DOTALL)[0]
print ' '.join(res.split())
输出:
\(x = {-b \pm \sqrt {b^2-4ac} \over 2a}\)
围绕字符串的HTML包装器:
print BeautifulSoup(' '.join(res.split()))
输出:
<html><body><p>\(x = {-b \pm \sqrt {b^2-4ac} \over 2a}\)</p></body></html>