所以我想知道如何抓取多个网站/ URL并将它们(数据)保存到一个csv文件中。我现在只能保存第一页。我尝试了许多不同的方法,但是似乎没有用。如何在一个csv文件中保存5个页面,而不仅仅是一个?
import requests
import csv
from bs4 import BeautifulSoup
import pandas as pd
import re
from datetime import timedelta
import datetime
import time
urls = ['https://store.steampowered.com/search/?specials=1&page=1', 'https://store.steampowered.com/search/?specials=1&page=2', 'https://store.steampowered.com/search/?specials=1&page=3', 'https://store.steampowered.com/search/?specials=1&page=4','https://store.steampowered.com/search/?specials=1&page=5']
for url in urls:
my_url = requests.get(url)
html = my_url.content
soup = BeautifulSoup(html,'html.parser')
data = []
ts = time.time()
st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
for container in soup.find_all('div', attrs={'class':'responsive_search_name_combined'}):
title = container.find('span',attrs={'class':'title'}).text
if container.find('span',attrs={'class':'win'}):
win = '1'
else:
win = '0'
if container.find('span',attrs={'class':'mac'}):
mac = '1'
else:
mac = '0'
if container.find('span',attrs={'class':'linux'}):
linux = '1'
else:
linux = '0'
data.append({
'Title':title.encode('utf-8'),
'Time':st,
'Win':win,
'Mac':mac,
'Linux':linux})
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title','Win','Mac','Linux','Time']
writer = csv.DictWriter(f, fieldnames=fields)
writer.writeheader()
writer.writerows(data)
testing = pd.read_csv('data.csv')
heading = testing.head(100)
discription = testing.describe()
print(heading)
答案 0 :(得分:2)
问题是您要在每个URL之后重新初始化数据。然后在最后一次迭代之后编写它,这意味着您将始终拥有从最后一个URL获得的最后一个数据。您需要附加该数据,并且每次迭代后都不要将其覆盖:
import requests
import csv
from bs4 import BeautifulSoup
import pandas as pd
import re
from datetime import timedelta
import datetime
import time
urls = ['https://store.steampowered.com/search/?specials=1&page=1', 'https://store.steampowered.com/search/?specials=1&page=2', 'https://store.steampowered.com/search/?specials=1&page=3', 'https://store.steampowered.com/search/?specials=1&page=4','https://store.steampowered.com/search/?specials=1&page=5']
results_df = pd.DataFrame() #<-- initialize a results dataframe to dump/store the data you collect after each iteration
for url in urls:
my_url = requests.get(url)
html = my_url.content
soup = BeautifulSoup(html,'html.parser')
data = [] #<-- your data list is "reset" after each iteration of your urls
ts = time.time()
st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
for container in soup.find_all('div', attrs={'class':'responsive_search_name_combined'}):
title = container.find('span',attrs={'class':'title'}).text
if container.find('span',attrs={'class':'win'}):
win = '1'
else:
win = '0'
if container.find('span',attrs={'class':'mac'}):
mac = '1'
else:
mac = '0'
if container.find('span',attrs={'class':'linux'}):
linux = '1'
else:
linux = '0'
data.append({
'Title':title,
'Time':st,
'Win':win,
'Mac':mac,
'Linux':linux})
temp_df = pd.DataFrame(data) #<-- temporary storing the data in a dataframe
results_df = results_df.append(temp_df).reset_index(drop=True) #<-- dumping that data into a results dataframe
results_df.to_csv('data.csv', index=False) #<-- writing the results dataframe to csv
testing = pd.read_csv('data.csv')
heading = testing.head(100)
discription = testing.describe()
print(heading)
输出:
print (results_df)
Linux Mac ... Title Win
0 0 0 ... Tom Clancy's Rainbow Six® Siege 1
1 0 0 ... Tom Clancy's Rainbow Six® Siege 1
2 1 1 ... Total War: WARHAMMER II 1
3 0 0 ... Tom Clancy's Rainbow Six® Siege 1
4 1 1 ... Total War: WARHAMMER II 1
5 0 1 ... Frostpunk 1
6 0 0 ... Tom Clancy's Rainbow Six® Siege 1
7 1 1 ... Total War: WARHAMMER II 1
8 0 1 ... Frostpunk 1
9 1 1 ... Two Point Hospital 1
10 0 0 ... Tom Clancy's Rainbow Six® Siege 1
11 1 1 ... Total War: WARHAMMER II 1
12 0 1 ... Frostpunk 1
13 1 1 ... Two Point Hospital 1
14 0 0 ... Black Desert Online 1
15 0 0 ... Tom Clancy's Rainbow Six® Siege 1
16 1 1 ... Total War: WARHAMMER II 1
17 0 1 ... Frostpunk 1
18 1 1 ... Two Point Hospital 1
19 0 0 ... Black Desert Online 1
20 1 1 ... Kerbal Space Program 1
21 0 0 ... Tom Clancy's Rainbow Six® Siege 1
22 1 1 ... Total War: WARHAMMER II 1
23 0 1 ... Frostpunk 1
24 1 1 ... Two Point Hospital 1
25 0 0 ... Black Desert Online 1
26 1 1 ... Kerbal Space Program 1
27 1 1 ... BioShock Infinite 1
28 0 0 ... Tom Clancy's Rainbow Six® Siege 1
29 1 1 ... Total War: WARHAMMER II 1
... .. ... ... ..
1595 0 0 ... VEGAS Pro 14 Edit Steam Edition 1
1596 0 0 ... ABZU 1
1597 0 0 ... Sacred 2 Gold 1
1598 0 0 ... Sakura Bundle 1
1599 1 1 ... Distance 1
1600 0 0 ... LEGO® Batman™: The Videogame 1
1601 0 0 ... Sonic Forces 1
1602 0 0 ... The Stronghold Collection 1
1603 0 0 ... Miscreated 1
1604 0 0 ... Batman™: Arkham VR 1
1605 1 1 ... Shadowrun Returns 1
1606 0 0 ... Upgrade to VEGAS Pro 16 Edit 1
1607 0 0 ... Girl Hunter VS Zombie Bundle 1
1608 0 1 ... Football Manager 2019 Touch 1
1609 0 1 ... Total War: NAPOLEON - Definitive Edition 1
1610 1 1 ... SteamWorld Dig 2 1
1611 0 0 ... Condemned: Criminal Origins 1
1612 0 0 ... Company of Heroes 1
1613 0 0 ... LEGO® Batman™ 2: DC Super Heroes 1
1614 1 1 ... Euro Truck Simulator 2 Map Booster 1
1615 0 0 ... Sonic Adventure DX 1
1616 0 0 ... Worms Armageddon 1
1617 1 1 ... Unforeseen Incidents 1
1618 0 0 ... Warhammer 40,000: Space Marine Collection 1
1619 0 0 ... VEGAS Pro 14 Edit Steam Edition 1
1620 0 0 ... ABZU 1
1621 0 0 ... Sacred 2 Gold 1
1622 0 0 ... Sakura Bundle 1
1623 1 1 ... Distance 1
1624 0 0 ... Worms Revolution 1
[1625 rows x 5 columns]
答案 1 :(得分:0)
因此,我显然对我的代码非常盲目,这可能会在您整天盯着它看时发生。我实际上要做的就是将“ data = []”移到for循环上方,这样就不会每次都重置。