Question

我想在html span标记内排除特定文本。在下面给出的示例中，我只想从{{1}下的test2到span提取所有class文本。

我的代码：

a-list-item

我的代码：test1 test2 test2

仅获取所有tag = tag.find_all("span", {"class" : "a-list-item"})的方法。感谢您的回复

Answer 1

您似乎正在使用美丽汤。在Beautiful Soup 4.7+中，仅使用String timeStamp = new SimpleDateFormat("yyyy.MM.dd HH.mm.ss").format(time);而不是select即可轻松实现。您可以使用find_all中包裹的:contains()来排除包含特定文本的跨度。

:not()

输出

from bs4 import BeautifulSoup
markup = '''
<span class="a-list-item">test1</span> 
<span class="a-list-item">test2</span> 
<span class="a-list-item">test2</span>
'''
soup = BeautifulSoup(markup)
print(soup.select("span.a-list-item:not(:contains(test1))"))

Answer 2

您可以应用xpath排除包含test1的

//span[@class='a-list-item' and not(contains(text(), 'test1'))]

例如

from lxml.html import fromstring
# url = ''
# tree = html.fromstring( requests.get(url).content)
h = '''
<html>
 <head></head>
 <body>
  <span class="a-list-item">test1</span> 
  <span class="a-list-item">test2</span> 
  <span class="a-list-item">test2</span>
 </body>
</html>
'''
tree = fromstring(h)
items = [item.text for item in tree.xpath("//span[@class='a-list-item' and not(contains(text(), 'test1'))]")]
print(items)

或测试每个css合格节点（基于标记和类）的文本值

from bs4 import BeautifulSoup as bs

h = '''
<html>
 <head></head>
 <body>
  <span class="a-list-item">test1</span> 
  <span class="a-list-item">test2</span> 
  <span class="a-list-item">test2</span>
 </body>
</html>
'''
soup = bs(h, 'lxml')
items = [item.text for item in soup.select('span.a-list-item') if  'test1' not in item.text]
print(items)

Answer 3

使用正则表达式re查找特定文本。

from bs4 import BeautifulSoup
import re
html = '''
<span class="a-list-item">test1</span> 
<span class="a-list-item">test2</span> 
<span class="a-list-item">test2</span>
'''
soup = BeautifulSoup(html,'html.parser')
items=soup.find_all('span',text=re.compile("test2"))
for item in items:
    print(item.text)

输出：

test2
test2

从标签中排除数据

3 个答案: