Question

所以我用漂亮的汤来阅读页面的html。

req = urllib.request.Request('https://en.wikipedia.org/wiki/Barack_Obama', headers = headers)
html = urllib.request.urlopen(reqx)
page = BeautifulSoup(html,'html.parser')

我想在句点上拆分html代码，条件是当句点位于p标签以外的另一个标签内时它不会拆分。例如，如果html代码是：

<p>In June 2015, the Court ruled 6–3 in <i><a href="/wiki/King_v._Burwell" 
title="King v. Burwell">King v. Burwell</a></i> that subsidies to help individuals 
and families purchase health insurance were authorized for those doing so on both 
the federal exchange and state exchanges, not only those purchasing plans 
"established by the State", as the statute reads.</p>

我不介意拆分p标签中的句点，而不关注标签内部或任何其他标签上的句点。将html代码转换为字符串然后拆分显然不起作用。我不想使用Beautiful Soup的get_text（）方法然后拆分的主要原因是因为我希望拆分发生在原始html上。美丽的汤有内置的分割功能，我可以检查它是否在正确的标签上分裂？或者还有其他方法吗？在此先感谢：）

因此我需要的输出是分为2的代码：

<p>In June 2015, the Court ruled 6–3 in <i><a href="/wiki/King_v._Burwell" 
title="King v. Burwell">King v


 . Burwell</a></i> that subsidies to help individuals and families purchase health insurance were authorized for those doing so on both the federal exchange and state exchanges, not only those purchasing plans "established by the State", as the statute reads.</p>

HTML拆分给定角色

0 个答案: