我创建了一个网络抓取工具,该抓取工具仅针对Python编程教育目的来抓取股票的Yahoo财务摘要和统计信息页面。它从程序目录中的“ 1stocklist.csv”中读取,如下所示:
Symbols
SNAP
KO
从那里,它将新信息按原样添加到数据框中的新列。那里有很多“ for”循环,我仍在对其进行调整,因为它不能正确地获取一些数据,但是现在还可以。
我的问题是尝试将数据框保存到新的.csv文件。如您所见,它现在的输出方式是这样的:
SNAP行应从14.02开始,然后一切都正确,下一行应为KO,从51.39开始。
有什么想法吗?只需创建一个类似于上面的1stocklist.csv文件,然后尝试即可。谢谢!
# Import dependencies
from bs4 import BeautifulSoup
import re, random, time, requests, datetime, csv
import pandas as pd
import numpy as np
# Use Pandas to read the "1stocklist.csv" file. We'll use Pandas so that we can append a 'dataframe' with new
# information we get from the Zacks site to work with in the program and output to the 'data(date).csv' file later
maindf = pd.read_csv('1stocklist.csv', skiprows=1, names=[
# The .csv header names
"Symbols"
]) #, delimiter = ',')
# Setting a time delay will help keep scraping suspicion down and server load down when scraping the Zacks site
timeDelay = random.randrange(2, 8)
# Start scraping Yahoo
print('Beginning to scrape Yahoo Finance site for information ...')
tickerlist = len(maindf['Symbols']) # for progress bar
# Create a progress counter to display how far along in the zacks rank scraping it is
zackscounter = 1
# For every ticker in the stocklist dataframe
for ticker in maindf['Symbols']:
# Print the progress
print(zackscounter, ' of ', tickerlist, ' - ', ticker) # for seeing which stock it's currently on
# The list of URL's for the stock's different pages to scrape the information from
summaryurl = 'https://ca.finance.yahoo.com/quote/' + ticker
statsurl = 'https://ca.finance.yahoo.com/quote/' + ticker + '/key-statistics'
# Define the headers to use in Beautiful Soup 4
headers = requests.utils.default_headers()
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
# Employ random time delay now before starting with the (next) ticker
time.sleep(timeDelay)
# Use Beautiful Soup 4 to get the info from the first Summary URL page
page = requests.get(summaryurl, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
counter = 0 # used to tell which 'td' it's currently looking at
table = soup.find('div', {'id' :'quote-summary'})
for i in table.find_all('span'):
counter += 1
if counter % 2 == 0: # All Even td's are the metrics/numbers we want
data_point = i.text
#print(data_point)
maindf[column_name] = data_point # Add the data point to the right column
else: # All odd td's are the header names
column_name = i.text
#print(column_name)
# Use Beautiful Soup 4 to get the info from the second stats URL page
page = requests.get(statsurl, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
time.sleep(timeDelay)
# Get all the data in the tables
counter = 0 # used to tell which 'td' it's currently looking at
table = soup.find('section', {'data-test' :'qsp-statistics'})
for i in table.find_all('td'):
counter += 1
if counter % 2 == 0: # All Even td's are the metrics/numbers we want
data_point = i.text
#print(data_point)
maindf[column_name] = data_point # Add the data point to the right column
else: # All odd td's are the header names
column_name = i.text
#print(column_name)
file_name = 'data_raw.csv'
if zackscounter == 1:
maindf.to_csv(file_name, index=False)
else:
maindf.to_csv(file_name, index=False, header=False, mode='a')
zackscounter += 1
continue
更新:
我知道这与我尝试在最后将数据框附加到.csv文件有关。我的起始数据框只是其中包含所有股票代码的一列,然后尝试在程序进行时将每个新列添加到数据框中,并填充到股票列表的底部。我想要做的就是只添加应有的column_name标头,然后将特定于特定数据的数据附加到一个代码中,然后对我数据框“符号”列中的每个代码进行相应处理。希望可以使问题更加清晰吗?
我尝试以各种方式使用.loc,但没有成功。谢谢!
答案 0 :(得分:0)
使用答案更新
我能够弄清楚!
基本上,我将从1stocklist.csv读取的第一个数据框更改为其自己的数据框,然后创建了一个新的空白框,可在第一个for循环中使用。这是我创建的更新后的头像:
# Use Pandas to read the "1stocklist.csv" file. We'll use Pandas so that we can append a 'dataframe' with new
# information we get from the Zacks site to work with in the program and output to the 'data(date).csv' file later
opening_dataframe = pd.read_csv('1stocklist.csv', skiprows=1, names=[
# The .csv header names
"Symbols"
]) #, delimiter = ',')
# Setting a time delay will help keep scraping suspicion down and server load down when scraping the Zacks site
timeDelay = random.randrange(2, 8)
# Start scraping Yahoo
print('Beginning to scrape Yahoo Finance site for information ...')
tickerlist = len(opening_dataframe['Symbols']) # for progress bar
# Create a progress counter to display how far along in the zacks rank scraping it is
zackscounter = 1
# For every ticker in the stocklist dataframe
for ticker in opening_dataframe['Symbols']:
maindf = pd.DataFrame(columns=['Symbols'])
maindf.loc[len(maindf)] = ticker
# Print the progress
print(zackscounter, ' of ', tickerlist, ' - ', ticker) # for seeing which stock it's currently on
# The list of URL's for the stock's different pages to scrape the information from
summaryurl = 'https://ca.finance.yahoo.com/quote/' + ticker
statsurl = 'https://ca.finance.yahoo.com/quote/' + ticker + '/key-statistics'
......
......
......
注意更改“ opening_dataframe = ...”的名称,并且
maindf = pd.DataFrame(columns = ['Symbols']) maindf.loc [len(maindf)] =股票代码
部分。我还利用.loc将其添加到数据框中的下一个可用行。希望这对某人有帮助!