这是HTML:
<div class="body">
<p>this is the<br />
text that i want to<br />
.<br />
.<br />
get from html file<br />
.<br />
.</p>
<div class="sender">someone</div>
</div>
我只希望<p>
标记中的文字没有<br/>
标记。我也需要线之间的句号!
我正在使用lxml,我的代码看起来像这样:
jokes = tree.xpath("//div[contains(@class,'body')]/p/text()")
它将每行返回到列表中作为一个项目。但是我需要将所有<p>
标记的文本作为列表中的一个项目
有没有办法将没有br标签的整个p标签作为一个项目添加到列表中?
这样的事情:
this is the
text that i want to
.
.
get from html file
.
.
当我通过此代码将列表保存到文件中时:
with open('c:\\f.txt','w') as f:
for l in jokes:
f.write(l+'**************')
这就是我在文件中看到的内容:
this is the************
text that i want to************
.************
.************
get from html file************
.************
.************
答案 0 :(得分:3)
根据您的抓取范围可能有些过分,但请试用BeautifulSoup
HTML = """"<div class="body">
<p>this is the<br />
text that i want to<br />
.<br />
.<br />
get from html file<br />
.<br />
.</p>
<div class="sender">someone</div>
</div>
"""
soup = BeautifulSoup(HTML)
print soup.p.get_text()
答案 1 :(得分:0)
@Pete是对的,Beautiful Soup会在这里提供帮助。为了它的价值,您还可以使用以下功能剥离标签:
def stripTags(in_text):
# convert in_text to a mutable object (e.g. list)
s_list = list(in_text)
i,j = 0,0
while i < len(s_list):
# iterate until a left-angle bracket is found
if s_list[i] == '<':
while s_list[i] != '>':
# pop everything from the the left-angle bracket until the right-angle bracket
s_list.pop(i)
# pops the right-angle bracket, too
s_list.pop(i)
else:
i=i+1
# convert the list back into text
join_char=''
return join_char.join(s_list)