Question

I am trying to scrape a webpage to collect Image Names & their respective asset URLs and write them to a CSV in two seperate columns. I have not been able to separate attrs out of the tags.

In BS4, I am able to run:

soup.find_all('a')

It successfully returns the below html (multiplied by the photo count on the page)

<a aria-label="SomeImageName" data-asset-id="10101010101" 
href="SomeWebsite">
<img alt="SomeImageName" 
src="https://SomeImageUrl"/>
</a>

I have tried running the following (and many other variations)

soup.find_all('a', attrs{"aria-label", "src"})

and they return

[]

Anyone know how to extract this data from the tag and write to a CSV?

Cheers!

Answer 1

欢迎使用StackOverflow！您在两个不同的元素中有自己的需求，即aria-label中的a和src中的img。但是幸运的是，您在img标签内嵌套了a。因此，迭代将很简单。

将名称和链接存储在词典列表中，并使用DictWriter()可以轻松地将它们写入csv文件。

import csv
img_data = []
for a_tag in soup.find_all('a'):
    data_dict = dict()
    data_dict['image_name'] = a_tag['aria-label']
    data_dict['url'] = a_tag.img['src']
    img_data.append(data_dict)

with open('urls.csv', 'w') as csvfile:
    fieldnames = ['image_name', 'url']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for data in img_data:    
        writer.writerow(data)

希望这会有所帮助！干杯!

Answer 2

尝试下面的代码，它提取<img>标记内src属性的值，该标记位于具有属性<a>的{{1}}标记内，并将这些链接写入csv文件< / p>

aria-label

或者您可以使用## To get the value of src attribute in the <img> tag tags = soup.find_all('a') src=[] for tag in tags: if tag.has_attr('aria-label'): src.append(tag.img['src']) ##writing to a csv file with open('csvfile.csv','w') as file: for line in src: file.write(line) file.write('\n')模块写入数据

csv

Answer 3

谢谢大家的投入！我仍然无法拉出aria-label，并且在其他论坛上读到，这是解析HTML时的BS4问题。

但是，我可以使用@SmashGuy解决方案很容易地解决此问题，并使用替代文本描述与aria-label进行比较。

img_data = []
for img_tag in soup.find_all('img'):
    data_dict = dict()
    data_dict['image_name'] = img_tag['alt']
    data_dict['image_url'] = img_tag['src']
    img_data.append(data_dict)

并写入CSV ...

with open('BCDS1.csv', 'w', newline='') as birddata:
    fieldnames = ['image_name', 'image_url']
    writer = csv.DictWriter(birddata, fieldnames=fieldnames)
    writer.writeheader()
    for data in img_data:
        writer.writerow(data)

再次感谢大家的帮助！干杯！

Answer 4

对于图像，您需要找到<img>标签，<a>是链接的标记。

<a aria-label="SomeImageName" data-asset-id="10101010101" href="SomeWebsite">
    <img alt="SomeImageName" src="https://SomeImageUrl"/>
</a>

您找到了该图片，因为您可以看到，链接标签包裹了图片标签。

但这不是字典语法的工作原理，请在:中使用attrs={}（请参见https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments）

因此它是soup.find_all('a', attrs={'css': 'value'})而不是soup.find_all('a', attrs{"aria-label" "SomeImageName"})

美丽的汤|如何在<a> tags

4 个答案: