我使用Beautiful Soup加载XMl。我只需要文本,忽略标签,text
属性词很好。
但是,我想在<table><\table>
标记内完全排除任何内容。我有想法用正则表达式替换其间的所有东西,但我想知道是否有一个更清洁的解决方案部分是因为Don't parse [X]HTML with regex!。例如:
s =""" <content><p>Hasselt ( ) is a <link target="Belgium">Belgian</link> <link target="city">city</link> and <link target="Municipalities in Belgium">municipality</link>.
<table><cell>Passenger growth
<cell>Year</cell><cell>Passengers</cell><cell>Percentage </cell></cell>
<cell>1996</cell><cell>360 000</cell><cell>100%</cell>
<cell>1997</cell><cell>1 498 088</cell><cell>428%</cell>
</table>"""
clean = Soup(s)
print clean.text
将给出
Hasselt ( ) is a Belgian city and municipality.
Passenger growth
YearPassengersPercentage
1996360 000100%
19971 498 088428%
而我只想要:
Hasselt ( ) is a Belgian city and municipality.
答案 0 :(得分:1)
您可以找到content
元素并从中删除所有table
元素,然后获取文字:
from bs4 import BeautifulSoup
s =""" <content><p>Hasselt ( ) is a <link target="Belgium">Belgian</link> <link target="city">city</link> and <link target="Municipalities in Belgium">municipality</link>.
<table><cell>Passenger growth
<cell>Year</cell><cell>Passengers</cell><cell>Percentage </cell></cell>
<cell>1996</cell><cell>360 000</cell><cell>100%</cell>
<cell>1997</cell><cell>1 498 088</cell><cell>428%</cell>
</table>"""
soup = BeautifulSoup(s, "xml")
content = soup.content
for table in content("table"):
table.extract()
print(content.get_text().strip())
打印:
Hasselt ( ) is a Belgian city and municipality.