我想在第二次出现特定标记后提取HTML文件的文本。
我已经尝试过regex和bs4,但是我不知道出了什么问题。正则表达式总是只给我命中本身,而没有其余的html文件,而bs4却不起作用,因为我不知道如何指定文件的结尾。
简体:
<html>
<veryspecific tag>
abc
</veryspecific tag>
<stuff that comes before>
</stuff that comes before>
<...
<veryspecific tag>
abc
</veryspecific tag>
<other tags that come after>
something
</other tags that come after>
</...>
<other tags that come after2>
something
</other tags that come after2>
</html>
#I tried splitting it, so I can take the last part which should contain the end of the file, starting from the latest occurrence, but it did not work:
htmltxt.split(r'abc.*$')
# I also tried to get the last tag and try to "while" over the 2 to get the text:
last_tag = html_parsed.findall('a')[-1]
while specific_tag != last_tag:
text = ...
specific_tag = specific_tag.next
我找到了所需的标签并可以将其提取,但是我还需要文件的其余部分。有没有一种简单而pythonic的方法呢?
答案 0 :(得分:1)
以下是使用BeautifulSoup
的建议:
mark = soup.find('veryspecific').find_next('veryspecific')
all_other_tags = mark.find_all_next(name=True)
print(''.join(i.text for i in all_other_tags))
它给了我这个输出:
something
something