我正在使用BeatifulSoup将某些Web数据抓取到csv文件中。我要抓取的某些元素是特定项目的列表;两套清单要确切。下面是一个数据将通过的示例:
名称,图像文件名,[2015、2016、2017],[12、55、74]
我需要为每个列表中的每个单独项目设置一行,如下所示:
我已经将所有数据剪贴到一个csv文件中,并且我希望避免遍历整个工作表并手动清理数据。我不反对这样做,但是如果可以利用Python完成此任务,我宁愿走这条路...
这是我用来抓取数据的整个脚本。我对Python相当陌生,在网络抓取/浏览器自动化方面经验有限。我不知道是否可以在其中包含格式化数据,或者这是否是我必须写的另一种格式:
from urllib.request import urlopen
from bs4 import BeautifulSoup
from datetime import date
import re
import csv
with open('hyperlinks.csv', 'r') as startFile:
for line in startFile:
url = urlopen(line)
soup = BeautifulSoup(url, 'html.parser')
data_container = soup.find('aside')
image = data_container.find('a',attrs={'class':'image-thumbnail'})
image_href = image.get('href')
img_container = data_container.find('img')
data_image_name = img_container.get('data-image-name')
filename = data_image_name.split('.')
final_filename = filename[0]
train_title = data_container.find('h2')
title_text = train_title.get_text()
image_filename = final_filename
full = image_filename +'.jpg'
series = data_container.find('div', attrs={'data-source':'series'})
wave_links = series.find('div')
wave_set = []
wave_links_sep = wave_links.find_all('a')
for item in wave_links_sep:
text_only = item.get_text()
wave_set.append(text_only)
bag = data_container.find('div', attrs={'data-source':'bag_code'})
bag_code = bag.find('div')
bag_text = bag_code.get_text()
regex = re.compile(r'\s\((2015|2016|2017|2018|2019)\)')
bag_numbers = re.sub(regex,",",bag_text)
bag_list = []
for nums in bag_numbers.split(','):
bag_list.append(nums)
filtered_bag_list = list(filter(None,bag_list))
with open('train_data.csv', 'a', newline='') as myFile:
writer = csv.writer(myFile)
writer.writerow([title_text, full, wave_set, filtered_bag_list])
答案 0 :(得分:1)
您可以同时压缩两个项目列表:
for wvs,bgl in zip(wave_set,filtered_bag_list):
writer.writerow([title_text, full, wvs, bgl])
如果您的列表长度相同且在索引方向上对应。
完整示例:
wave_set = [2015, 2016, 2017]
filtered_bag_list = [12, 55, 74]
import csv
with open('train_data.csv', 'a', newline='') as myFile:
writer = csv.writer(myFile)
for wvs,bgl in zip(wave_set,filtered_bag_list):
writer.writerow(["some","text", wvs, bgl])
with open("train_data.csv") as f:
print(f.read())
文件输出:
some,text,2015,12 some,text,2016,55 some,text,2017,74
zip( [1,2,3],["a","b","c"])
创建元组(1,"a"), (2,"b"), (3,"c")
并将其提供为迭代器-参见f.e. Zip lists in Python以获得更多见解。