这是我的代码输出
<h1 class="it-ttl" id="itemTitle" itemprop="name"><span class="g-hdn">Details about </span>item name goes here</h1>
我想只获取项目名称,而不是“有关”部分的详细信息。
我的Python代码选择了某个div id
for content in soup.select('#itemTitle'):
print(content.text)
答案 0 :(得分:3)
您可以使用decompose() clear()或extract()。 根据文件:
Tag.decompose()从树中删除标签,然后完全销毁它及其内容
Tag.clear()删除标记的内容
PageElement.extract()从树中删除标记或字符串。它返回提取的标记或字符串
from bs4 import BeautifulSoup
html = '''<h1 class="it-ttl" id="itemTitle" itemprop="name"><span class="g-hdn">Details about </span>item name goes here</h1>'''
soup = BeautifulSoup(html, 'lxml')
for content in soup.select('#itemTitle'):
content.span.decompose()
print(content.text)
输出:
item name goes here
答案 1 :(得分:2)
我的答案受到accepted answer的启发。
代码:
from bs4 import BeautifulSoup, NavigableString
data = '''
<h1 class="it-ttl" id="itemTitle" itemprop="name"><span class="g-hdn">Details about </span>item name goes here</h1>
'''
soup = BeautifulSoup(data, 'html.parser')
inner_text = [element for element in soup.h1 if isinstance(element, NavigableString)]
print(inner_text)
输出:
['item name goes here']
答案 2 :(得分:1)
这个怎么样:
from bs4 import BeautifulSoup
html= """<h1 class="it-ttl" id="itemTitle" itemprop="name"><span class="g-hdn">Details about </span>item name goes here</h1>"""
soup = BeautifulSoup(html, "lxml")
text = soup.find('h1', attrs={"id":"itemTitle"}).text
span = soup.find('span', attrs={"class":"g-hdn"}).text
final_text = text[len(span):]
print(final_text)
这导致:
item name goes here
答案 3 :(得分:0)
尝试这是否有效
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<h1 class="it-ttl" id="itemTitle" itemprop="name">
<span class="g-hdn">Details about </span>
item name goes here</h1>""")
print(soup.find('h1', {'class': 'it-ttl'}).contents[-1].strip())