I am trying to scrape a webpage to collect Image Names & their respective asset URLs and write them to a CSV in two seperate columns. I have not been able to separate attrs out of the tags.
In BS4, I am able to run:
soup.find_all('a')
It successfully returns the below html (multiplied by the photo count on the page)
<a aria-label="SomeImageName" data-asset-id="10101010101"
href="SomeWebsite">
<img alt="SomeImageName"
src="https://SomeImageUrl"/>
</a>
I have tried running the following (and many other variations)
soup.find_all('a', attrs{"aria-label", "src"})
and they return
[]
Anyone know how to extract this data from the tag and write to a CSV?
Cheers!
答案 0 :(得分:1)
欢迎使用StackOverflow!
您在两个不同的元素中有自己的需求,即aria-label
中的a
和src
中的img
。但是幸运的是,您在img
标签内嵌套了a
。因此,迭代将很简单。
将名称和链接存储在词典列表中,并使用DictWriter()
可以轻松地将它们写入csv文件。
import csv
img_data = []
for a_tag in soup.find_all('a'):
data_dict = dict()
data_dict['image_name'] = a_tag['aria-label']
data_dict['url'] = a_tag.img['src']
img_data.append(data_dict)
with open('urls.csv', 'w') as csvfile:
fieldnames = ['image_name', 'url']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for data in img_data:
writer.writerow(data)
希望这会有所帮助!干杯!
答案 1 :(得分:0)
尝试下面的代码,它提取<img>
标记内src属性的值,该标记位于具有属性<a>
的{{1}}标记内,并将这些链接写入csv文件< / p>
aria-label
或者您可以使用## To get the value of src attribute in the <img> tag
tags = soup.find_all('a')
src=[]
for tag in tags:
if tag.has_attr('aria-label'):
src.append(tag.img['src'])
##writing to a csv file
with open('csvfile.csv','w') as file:
for line in src:
file.write(line)
file.write('\n')
模块写入数据
csv
答案 2 :(得分:0)
谢谢大家的投入!我仍然无法拉出aria-label
,并且在其他论坛上读到,这是解析HTML时的BS4问题。
但是,我可以使用@SmashGuy解决方案很容易地解决此问题,并使用替代文本描述与aria-label
进行比较。
img_data = []
for img_tag in soup.find_all('img'):
data_dict = dict()
data_dict['image_name'] = img_tag['alt']
data_dict['image_url'] = img_tag['src']
img_data.append(data_dict)
并写入CSV ...
with open('BCDS1.csv', 'w', newline='') as birddata:
fieldnames = ['image_name', 'image_url']
writer = csv.DictWriter(birddata, fieldnames=fieldnames)
writer.writeheader()
for data in img_data:
writer.writerow(data)
再次感谢大家的帮助!干杯!
答案 3 :(得分:-1)
对于图像,您需要找到<img>
标签,<a>
是链接的标记。
<a aria-label="SomeImageName" data-asset-id="10101010101" href="SomeWebsite">
<img alt="SomeImageName" src="https://SomeImageUrl"/>
</a>
您找到了该图片,因为您可以看到,链接标签包裹了图片标签。
但这不是字典语法的工作原理,请在:
中使用attrs={}
(请参见https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments)
因此它是soup.find_all('a', attrs={'css': 'value'})
而不是soup.find_all('a', attrs{"aria-label" "SomeImageName"})