Question

这是HTML：

<div class="body">
    <p>this is the<br />
    text that i want to<br />
    .<br />
    .<br />
    get from html file<br />
    .<br />
    .</p>
    <div class="sender">someone</div>
</div>

我只希望<p>标记中的文字没有<br/>标记。我也需要线之间的句号！
我正在使用lxml，我的代码看起来像这样：
jokes = tree.xpath("//div[contains(@class,'body')]/p/text()")
它将每行返回到列表中作为一个项目。但是我需要将所有<p>标记的文本作为列表中的一个项目有没有办法将没有br标签的整个p标签作为一个项目添加到列表中？
这样的事情：

this is the
text that i want to
.
.
get from html file
.
.

当我通过此代码将列表保存到文件中时：

with open('c:\\f.txt','w') as f:
for l in jokes:
    f.write(l+'**************')

这就是我在文件中看到的内容：

this is the************
    text that i want to************
    .************
    .************
    get from html file************
    .************
    .************

Answer 1

根据您的抓取范围可能有些过分，但请试用BeautifulSoup

HTML = """"<div class="body">
    <p>this is the<br />
    text that i want to<br />
    .<br />
    .<br />
    get from html file<br />
    .<br />
    .</p>
    <div class="sender">someone</div>
</div>
"""
soup = BeautifulSoup(HTML)
print soup.p.get_text()

Answer 2

@Pete是对的，Beautiful Soup会在这里提供帮助。为了它的价值，您还可以使用以下功能剥离标签：

def stripTags(in_text):
            # convert in_text to a mutable object (e.g. list)
            s_list = list(in_text)
            i,j = 0,0
            while i < len(s_list):
                    # iterate until a left-angle bracket is found
                    if s_list[i] == '<':
                            while s_list[i] != '>':
                                    # pop everything from the the left-angle bracket until the right-angle bracket
                                    s_list.pop(i)   
                            # pops the right-angle bracket, too
                            s_list.pop(i)
                    else:
                            i=i+1       
            # convert the list back into text
            join_char=''
            return join_char.join(s_list)

如何在没有使用python中的lxml的情况下获取<p>的文本？</p>

2 个答案: