美丽的汤|如何在<a> tags

时间:2018-09-21 04:25:44

标签: python web-scraping beautifulsoup

I am trying to scrape a webpage to collect Image Names & their respective asset URLs and write them to a CSV in two seperate columns. I have not been able to separate attrs out of the tags.

In BS4, I am able to run:

soup.find_all('a')

It successfully returns the below html (multiplied by the photo count on the page)

<a aria-label="SomeImageName" data-asset-id="10101010101" 
href="SomeWebsite">
<img alt="SomeImageName" 
src="https://SomeImageUrl"/>
</a>

I have tried running the following (and many other variations)

soup.find_all('a', attrs{"aria-label", "src"})

and they return

[]

Anyone know how to extract this data from the tag and write to a CSV?

Cheers!

4 个答案:

答案 0 :(得分:1)

欢迎使用StackOverflow! 您在两个不同的元素中有自己的需求,即aria-label中的asrc中的img。但是幸运的是,您在img标签内嵌套了a。因此,迭代将很简单。

将名称和链接存储在词典列表中,并使用DictWriter()可以轻松地将它们写入csv文件。

import csv
img_data = []
for a_tag in soup.find_all('a'):
    data_dict = dict()
    data_dict['image_name'] = a_tag['aria-label']
    data_dict['url'] = a_tag.img['src']
    img_data.append(data_dict)

with open('urls.csv', 'w') as csvfile:
    fieldnames = ['image_name', 'url']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for data in img_data:    
        writer.writerow(data)

希望这会有所帮助!干杯!

答案 1 :(得分:0)

尝试下面的代码,它提取<img>标记内src属性的值,该标记位于具有属性<a>的{​​{1}}标记内,并将这些链接写入csv文件< / p>

aria-label

或者您可以使用## To get the value of src attribute in the <img> tag tags = soup.find_all('a') src=[] for tag in tags: if tag.has_attr('aria-label'): src.append(tag.img['src']) ##writing to a csv file with open('csvfile.csv','w') as file: for line in src: file.write(line) file.write('\n') 模块写入数据

csv

答案 2 :(得分:0)

谢谢大家的投入!我仍然无法拉出aria-label,并且在其他论坛上读到,这是解析HTML时的BS4问题。

但是,我可以使用@SmashGuy解决方案很容易地解决此问题,并使用替代文本描述与aria-label进行比较。

img_data = []
for img_tag in soup.find_all('img'):
    data_dict = dict()
    data_dict['image_name'] = img_tag['alt']
    data_dict['image_url'] = img_tag['src']
    img_data.append(data_dict)

并写入CSV ...

with open('BCDS1.csv', 'w', newline='') as birddata:
    fieldnames = ['image_name', 'image_url']
    writer = csv.DictWriter(birddata, fieldnames=fieldnames)
    writer.writeheader()
    for data in img_data:
        writer.writerow(data)

再次感谢大家的帮助!干杯!

答案 3 :(得分:-1)

对于图像,您需要找到<img>标签,<a>是链接的标记。

<a aria-label="SomeImageName" data-asset-id="10101010101" href="SomeWebsite">
    <img alt="SomeImageName" src="https://SomeImageUrl"/>
</a>

您找到了该图片,因为您可以看到,链接标签包裹了图片标签。

但这不是字典语法的工作原理,请在:中使用attrs={}(请参见https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments

因此它是soup.find_all('a', attrs={'css': 'value'})而不是soup.find_all('a', attrs{"aria-label" "SomeImageName"})