Question

我需要查找所有PHP标记，但是在遇到使用＆＃34; - ＆gt;＆＃34;调用方法的类时遇到了问题。它拿起＆＃34;＆gt;＆＃34;作为结束标记。

PHP标记：<html><body> Blah Blah Blah... <h2>Section Heading <?php $playFrom->time("09:58"); ?></h2>Blah blah blah </body></html>

我的代码：

taglist = soup.findAll("?php")
for tag in taglist:
    tag.replaceWith("")

替换为<h2>Section Heading time("09:58"); ?>

BeautifulSoup可以这样做吗？如果是这样，那么正确的方法是什么？

编辑（1）：正如瑞恩指出的那样：

＆＃34; PHP不是HTML，所以你无法用HTML解析器真正解析它。＆＃34;

我发现汤解析器会自动删除PHP并留下所有<h2>标记文本中的碎片。所以我的解决方案就是使用findall('h2')来清理该文字... text.replace('badstuff', 'good stuff') ... 我的新问题是，因为lxml是默认值解析器（根据此链接：Set lxml as default BeautifulSoup parser），不应该仍然能够找到一种方法来使用BS4干净地删除PHP吗？

注意（我的解决方案）：通过删除上面的findAll("?php")...代码，我只需让BS4汤解析HTML，就可以获得<h2>标记的以下结果。

<h2>Section Heading <?php $playFrom->time("09:58"); ?></h2>

成为这个：

<h2>Section Heading time("09:58"); ?></h2>

以上结果来自：

soup = BeautifulSoup(html.read(),'lxml')
print(soup.body.h2)
html.close()

以下代码版本可以清除它：

soup = BeautifulSoup(html.read(),'lxml') 

h2list = soup.findAll("h2")
for tag in h2list:
    text = text.replace('time("', '(')
    text = text.replace('\"); ?>', ')')
    tag.string = text

print(soup.body.h2)
html.close()

制作：

<h2>Section Heading (09:58)</h2>

Python BeautifulSoup - findall（＆＃34;？php）（运行到enclass-＆gt;方法的结束标记cuz的问题）

0 个答案: