Question

我使用以下代码检索网页上的所有图片链接

from bs4 import BeautifulSoup
import requests

def get_txt(soup, key):
    key_tag = soup.find('span', text=re.compile(key)).parent
    return key_tag.find_all('span')[1].text

urldes = "https://www.johnpyeauctions.co.uk/lot_list.asp?saleid=4709&siteid=1"

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
    r = requests.get(urldes, headers=headers)
soup = BeautifulSoup(r.content, "lxml")

image_links = [x['data-img'] for x in soup.find_all('a', rel='popover')]
for link in image_links:
    print(link)

我想应用相同的原则来检索每张图片的文字说明：

soup.find_all(width='41%')
for text in soup.find_all('h5'):
    print(text)

此代码检索所有<h5>代码，但不检索父代(width='41%')的特定代码。

我尝试对图像链接应用与上面相同的循环：

image_text = [x['h5'] for x in soup.find_all(width='41%')]
for text in image_text:
    print(text)

但是我收到以下错误：

`Traceback (most recent call last):
  File "C:\Users\alexa\Desktop\jpye_v2.py", line 41, in <module>
    image_text = [x['h5'] for x in soup.find_all(width='41%')]
  File "C:\Users\alexa\Desktop\jpye_v2.py", line 41, in <listcomp>
    image_text = [x['h5'] for x in soup.find_all(width='41%')]
  File "C:\Python36\lib\site-packages\beautifulsoup4-4.6.0-py3.6.egg\bs4\element.py", line 1011, in __getitem__
    return self.attrs[key]
KeyError: 'h5'`

我不明白为什么标记h5会给出标记a没有的错误，或者我不能使用相同的循环以相同的方式索引文本迭代如图像链接？

Answer 1

width=41%是一个属性。这会让你更接近你想要的东西：

for text in soup.find_all('td', {'width': '41%'}):
    print(text)

Answer 2

首先，只需编写此行soup.find_all(width='41%')，就不会做任何事情。 find_all()方法会返回所有匹配标记的列表。因此，您必须先将其存储在变量中，然后对其进行迭代。

对于您的第二个代码，tag['attribute']用于获取attribute的{{1}} 的值。因此，使用tag会引发x['h5']，因为KeyError不是属性，而是标记。

最后，要获得所需的文本，可以使用：

h5

或者，要显示for tag in soup.find_all('td', width='41%'): image_text = tag.find('h5').text print(image_text)方法的工作原理，您可以查看：

find_all()

部分输出：

tags = soup.find_all('td', width='41%') for tag in tags: image_text = tag.find('h5').text print(image_text)

使用BeautifulSoup

2 个答案: