我想在html span
标记内排除特定文本。在下面给出的示例中,我只想从{{1}下的test2
到span
提取所有class
文本。
我的代码:
a-list-item
我的代码:<span class="a-list-item">test1</span>
<span class="a-list-item">test2</span>
<span class="a-list-item">test2</span>
仅获取所有tag = tag.find_all("span", {"class" : "a-list-item"})
的方法。感谢您的回复
答案 0 :(得分:2)
您似乎正在使用美丽汤。在Beautiful Soup 4.7+中,仅使用 String timeStamp = new SimpleDateFormat("yyyy.MM.dd HH.mm.ss").format(time);
而不是select
即可轻松实现。您可以使用find_all
中包裹的:contains()
来排除包含特定文本的跨度。
:not()
输出
from bs4 import BeautifulSoup
markup = '''
<span class="a-list-item">test1</span>
<span class="a-list-item">test2</span>
<span class="a-list-item">test2</span>
'''
soup = BeautifulSoup(markup)
print(soup.select("span.a-list-item:not(:contains(test1))"))
答案 1 :(得分:0)
您可以应用xpath排除包含test1
的
//span[@class='a-list-item' and not(contains(text(), 'test1'))]
例如
from lxml.html import fromstring
# url = ''
# tree = html.fromstring( requests.get(url).content)
h = '''
<html>
<head></head>
<body>
<span class="a-list-item">test1</span>
<span class="a-list-item">test2</span>
<span class="a-list-item">test2</span>
</body>
</html>
'''
tree = fromstring(h)
items = [item.text for item in tree.xpath("//span[@class='a-list-item' and not(contains(text(), 'test1'))]")]
print(items)
或测试每个css合格节点(基于标记和类)的文本值
from bs4 import BeautifulSoup as bs
h = '''
<html>
<head></head>
<body>
<span class="a-list-item">test1</span>
<span class="a-list-item">test2</span>
<span class="a-list-item">test2</span>
</body>
</html>
'''
soup = bs(h, 'lxml')
items = [item.text for item in soup.select('span.a-list-item') if 'test1' not in item.text]
print(items)
答案 2 :(得分:0)
使用正则表达式re
查找特定文本。
from bs4 import BeautifulSoup
import re
html = '''
<span class="a-list-item">test1</span>
<span class="a-list-item">test2</span>
<span class="a-list-item">test2</span>
'''
soup = BeautifulSoup(html,'html.parser')
items=soup.find_all('span',text=re.compile("test2"))
for item in items:
print(item.text)
输出:
test2
test2