从标签中排除数据

时间:2019-04-04 05:07:48

标签: python beautifulsoup

我想在html span标记内排除特定文本。在下面给出的示例中,我只想从{{1}下的test2span提取所有class文本。

我的代码:

a-list-item

我的代码:<span class="a-list-item">test1</span> <span class="a-list-item">test2</span> <span class="a-list-item">test2</span>

仅获取所有tag = tag.find_all("span", {"class" : "a-list-item"})的方法。感谢您的回复

3 个答案:

答案 0 :(得分:2)

您似乎正在使用美丽汤。在Beautiful Soup 4.7+中,仅使用 String timeStamp = new SimpleDateFormat("yyyy.MM.dd HH.mm.ss").format(time); 而不是select即可轻松实现。您可以使用find_all中包裹的:contains()来排除包含特定文本的跨度。

:not()

输出

from bs4 import BeautifulSoup
markup = '''
<span class="a-list-item">test1</span> 
<span class="a-list-item">test2</span> 
<span class="a-list-item">test2</span>
'''
soup = BeautifulSoup(markup)
print(soup.select("span.a-list-item:not(:contains(test1))"))

答案 1 :(得分:0)

您可以应用xpath排除包含test1

//span[@class='a-list-item' and not(contains(text(), 'test1'))]

例如

from lxml.html import fromstring
# url = ''
# tree = html.fromstring( requests.get(url).content)
h = '''
<html>
 <head></head>
 <body>
  <span class="a-list-item">test1</span> 
  <span class="a-list-item">test2</span> 
  <span class="a-list-item">test2</span>
 </body>
</html>
'''
tree = fromstring(h)
items = [item.text for item in tree.xpath("//span[@class='a-list-item' and not(contains(text(), 'test1'))]")]
print(items)

或测试每个css合格节点(基于标记和类)的文本值

from bs4 import BeautifulSoup as bs

h = '''
<html>
 <head></head>
 <body>
  <span class="a-list-item">test1</span> 
  <span class="a-list-item">test2</span> 
  <span class="a-list-item">test2</span>
 </body>
</html>
'''
soup = bs(h, 'lxml')
items = [item.text for item in soup.select('span.a-list-item') if  'test1' not in item.text]
print(items)

答案 2 :(得分:0)

使用正则表达式re查找特定文本。

from bs4 import BeautifulSoup
import re
html = '''
<span class="a-list-item">test1</span> 
<span class="a-list-item">test2</span> 
<span class="a-list-item">test2</span>
'''
soup = BeautifulSoup(html,'html.parser')
items=soup.find_all('span',text=re.compile("test2"))
for item in items:
    print(item.text)

输出:

test2
test2