第一次使用BeautifulSoup并且无法了解如何从某个特定节点提取文本。这是我的代码
HTML:
...
<p class="dsm">...</p>
<ul class="also">
<li>once as the adjective <i class="ab">abdrea</i> (<span class="at">groups</span>)</li>
<li>twice as the noun <i class="ab">shokdia</i> (<span class="at">techs</span>)</li>
</ul>
...
蟒:
current_page = urlopen(url)
current_soup = BeautifulSoup(current_page, 'html.parser')
derivative_list = current_soup.select('p.dsm + ul.also li')
for li in derivative_list:
print(li)
输出:
<li>once as the adjective <i class="ab">abdrea</i> (<span class="at">groups</span>)</li>
<li>twice as the noun <i class="ab">shokdia</i> (<span class="at">techs</span>)</li>
它输出正确的列表项,但我想得到的是i.ab和span.at的文本值,类似
所需的输出:
abdrea, groups
shokdia, techs
答案 0 :(得分:1)
获取所有<li>
代码的列表后,只需对其进行迭代,然后分别查找<i class="ab">
和<span class="at">
代码的文本。
for li in soup.select('p.dsm + ul.also li'):
print(li.i.text, li.span.text)
# abdrea groups
# shokdia techs
如果<i>
标记中包含其他<span>
和<li>
标记,则可以在find()
变量上使用li
。
for li in soup.select('p.dsm + ul.also li'):
print(li.find('i', class_='ab').text, li.find('span', class_='at').text)
答案 1 :(得分:0)
您正在寻找的确切答案:
data = """<ul class="also">
<li>once as the adjective <i class="ab">abdrea</i> (<span class="at">groups</span>)</li>
<li>twice as the noun <i class="ab">shokdia</i> (<span class="at">techs</span>)</li>
</ul>"""
from bs4 import BeautifulSoup
page_soup = BeautifulSoup(data, "html.parser")
i_data, span_data= zip([x.text for x in page_soup.find_all("i")], [y.text for y in page_soup.find_all("span")])
print(i_data )
print(span_data)
(u'abdrea', u'groups')
(u'shokdia', u'techs')