Question

从下面的html代码中我想单独获取数字＆amp;单独的文本，我能够得到数字，但对于文本，它给出错误，如下所示。（注意：它在for loop中，对于少数几个链接，如果没有找到它的给定错误，那么split(b'.')[1]的匹配就是匹配的。）

错误：

Traceback (most recent call last):
  File "C:/Users/Computers Zone/Google Drive/Python/SANDWICHTRY.py", line 49, in <module>
    sandwich=soup.find('h1',{'class':'headline'}).encode_contents().strip().split(b'.')[1].decode("utf-8")
IndexError: list index out of range

HTML code：

<h1 class="headline ">1. Old Oak Tap BLT</h1>

Ny code：

soup=BeautifulSoup(pages,'lxml').find('div',{'id':'page'})
rank=soup.find('h1',{'class':'headline'}).encode_contents().strip().split(b'.')[0].decode("utf-8")
print (rank)
sandwich=soup.find('h1',{'class':'headline'}).encode_contents().strip().split(b'.')[1].decode("utf-8")
print(sandwich)

Answer 1

当标题字符串中没有.时，即第二个元素不存在时，会发生此错误。

要解决此问题，请获取结果，拆分字符串，但不要假设总有两个元素：

from bs4 import BeautifulSoup

pages = '<h1 class="headline">1. Old Oak Tap BLT</h1>'

soup = BeautifulSoup(pages, 'lxml')
titles = soup.find('h1', {'class': 'headline'}).encode_contents().split(b'.')

for text in titles:  # go through all existing list elements
    print(text.decode("utf-8").strip())

或者在阅读元素之前检查列表中的2个元素，例如：

if len(titles) == 2:
    rank = titles[0].decode("utf-8").strip()
    sandwich = titles[1].decode("utf-8").strip()

无法提取数字和文本与html分开

1 个答案: