Question

<div id="some id" class="some class">
    <table id="some other id" class="a different class">...</table>


        I want this text,


    <br>

        this text,


    <br>


        along with this text


    </div>

我正在尝试使用Python网页剪贴多个具有相似代码的页面，如上所述。我尝试使用基本的Python CSS选择器抓取文本，但无法解决。我主要是想知道是否存在可以通过“美丽汤” select()方法传递的选择器，该选择器选择<div>中的元素，而不选择<table>中的元素。我尝试选择<br>（不知道它的作用），但这没有用。

我对HMTL的了解很少，对于上述代码示例引起的任何错误或混乱，我深表歉意。

Answer 1

简单地删除子表标签可能会更容易

from bs4 import BeautifulSoup as bs

html = '''
<div id="some id" class="some class">
    <table id="some other id" class="a different class">not this</table>


        I want this text,


    <br>

        this text,


    <br>


        along with this text


    </div>
'''

soup = bs(html, 'lxml')
soup.select_one('[id="some other id"]').extract()
print(soup.select_one('[id="some id"]').text)

Answer 2

解决方案实际上非常简单。经过试验，我发现您可以使用以下代码来获取上述HTML的文本。

import requests, bs4

#Create a BeautifulSoup Object
url = 'https://url.thisisthewebsitecontainingthehtml.com'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)

#Create a list containing all elements with the tag <div>
divElems = soup.select('div[id="some id"]')
#Create an empty list to add the text
trueText = []
for i in divElems:
    text = list(i)
    trueText.append((text[-5].strip(), text[-3].strip(), text[-1].strip()))

Python的list()函数将选定的HTML分成单独的“块”-<table>标签下的所有内容，文本的第一位，<br>标签下的所有内容，另一个<br>标签和最后一行文本。因为我们只希望包含文本的“块”，所以我们将text列表的'-1'st，'-3'和'-5'st元素添加到我们的trueText列表中。

执行此代码将创建一个列表trueText，其中包含上述HTML中所需的文本。

是否有一个选择器（在Python中）可用于选择没有标签的元素？

2 个答案: