Question

我目前正在编写一个刮刀，我最后一点点卡住了，具有讽刺意味的是它应该是最简单的。 html有点弹出，包含以下代码结构。

<html lang="en">
    <head>
        <title>Website Title</title>
    </head>

    <body>
        <h2>Full Development Description</h2><br/>
        <input type="hidden" name="saaa" value="000" />
        <input type="hidden" name="saaa" value="000" />

        This is the text I would like to to extract                  

        <input type="hidden" name="saa" value="This is the text 
                       I would like to to extract" size="7" />

        <input type="hidden" name="saaa" value="000" />
        <input type="hidden" name="saa" value="000" />

     </body>
</html>

我希望提取这是我想要提取部分的文本。这部分写了两次，一次只是在正文中的纯文本，一次作为隐藏输入的值。无法从其他隐藏输入中识别此隐藏输入，因此我认为最简单的方法是提取文本。

我的计划是提取我知道该怎么做的身体。但我不知道如何排除＆＃39;标签允许我删除h2标签和输入标签以及这些标签中的所有数据。

我使用以下代码提取身体：

 body = response.css('body').extract()

Answer 1

我认为这会为你做，假设只有你想要的输入字段大小为7。

>>> from bs4 import BeautifulSoup
>>> page = open('temp.htm').read()
>>> soup = BeautifulSoup(page,'lxml')
>>> theInput = soup.findAll('input', attrs={'type': 'hidden', 'name': 'saa', 'size': '7'})
>>> len(theInput)
1
>>> theInput[0].attrs
{'type': 'hidden', 'size': '7', 'value': 'This is the text \n                       I would like to to extract', 'name': 'saa'}

BS4刮掉所有栏<h2>标签

1 个答案: