如何使用beautifulsoup获取所选节点的文本

时间:2018-05-23 07:16:37

标签: python python-3.x beautifulsoup

第一次使用BeautifulSoup并且无法了解如何从某个特定节点提取文本。这是我的代码

HTML:

...
<p class="dsm">...</p>
<ul class="also">
    <li>once as the adjective <i class="ab">abdrea</i> (<span class="at">groups</span>)</li>
    <li>twice as the noun <i class="ab">shokdia</i> (<span class="at">techs</span>)</li>
</ul>
...

蟒:

current_page = urlopen(url)
current_soup = BeautifulSoup(current_page, 'html.parser')
derivative_list = current_soup.select('p.dsm + ul.also li')
for li in derivative_list:
    print(li)

输出:

<li>once as the adjective <i class="ab">abdrea</i> (<span class="at">groups</span>)</li>
<li>twice as the noun <i class="ab">shokdia</i> (<span class="at">techs</span>)</li>

它输出正确的列表项,但我想得到的是i.ab和span.at的文本值,类似

所需的输出:

abdrea, groups
shokdia, techs

2 个答案:

答案 0 :(得分:1)

获取所有<li>代码的列表后,只需对其进行迭代,然后分别查找<i class="ab"><span class="at">代码的文本。

for li in soup.select('p.dsm + ul.also li'):
    print(li.i.text, li.span.text)
# abdrea groups
# shokdia techs

如果<i>标记中包含其他<span><li>标记,则可以在find()变量上使用li

for li in soup.select('p.dsm + ul.also li'):
    print(li.find('i', class_='ab').text, li.find('span', class_='at').text)

答案 1 :(得分:0)

您正在寻找的确切答案:

data = """<ul class="also">
    <li>once as the adjective <i class="ab">abdrea</i> (<span class="at">groups</span>)</li>
    <li>twice as the noun <i class="ab">shokdia</i> (<span class="at">techs</span>)</li>
</ul>"""

from bs4 import BeautifulSoup
page_soup = BeautifulSoup(data, "html.parser")
i_data, span_data= zip([x.text for x in page_soup.find_all("i")], [y.text for y in page_soup.find_all("span")])

print(i_data )
print(span_data)

输出:

(u'abdrea', u'groups')
(u'shokdia', u'techs')