首先,我将表示不赞成在服务条款上不允许进行抓取的网站进行抓取,这纯粹是为从各种网站上假设收集财务数据而进行的学术研究。
如果要查看此链接:
https://finviz.com/screener.ashx?v=141&f=geo_usa,ind_stocksonly,sh_avgvol_o100,sh_price_o1&o=ticker
...存储在URLs.csv文件中,并且希望抓取第2-5列(即股票行情指示器,Perf Week,Perf Month,Perf Quarter)并将其导出到CSV文件,这可能代码看起来像什么?
尝试使用我过去遇到的另一个用户的答案,到目前为止,我有类似以下内容的东西:
from bs4 import BeautifulSoup
import requests
import csv, random, time
# Open 'URLs.csv' to read list of URLs in the list
with open('URLs.csv', newline='') as f_urls, open('Results.csv', 'w', newline='') as f_output:
csv_urls = csv.reader(f_urls)
csv_output = csv.writer(f_output, delimiter=',')
headers = requests.utils.default_headers()
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
csv_output.writerow(['Ticker', 'Perf Week', 'Perf Month', 'Perf Quarter'])
# Start to read the first URL in the .csv and loop for each URL/row in the .csv
for line in csv_urls:
# Start at first url and look for items
page = requests.get(line[0])
soup = BeautifulSoup(page.text, 'html.parser')
symbol = soup.findAll('a', {'class':'screener-link-primary'})
perfdata = soup.findAll('a', {'class':'screener-link'})
lines = list(zip(perfdata, symbol))
# pair up every two teams
for perfdata1, symbol1 in zip(lines[1::2], lines[::2]):
# extract string items
a1, a2, a3, _ = (x.text for x in symbol1 + perfdata1)
# reorder and write row
row = a1, a2, a3
print(row)
csv_output.writerow(row)
...我得到以下输出:
('1', 'A', '7.52%')
('-0.94%', 'AABA', '5.56%')
('10.92%', 'AAL', '-0.58%')
('4.33%', 'AAOI', '2.32%')
('2.96%', 'AAP', '1.80')
('2.83M', 'AAT', '0.43')
('70.38', 'AAXN', '0.69%')
...
因此,它跳过了一些行,并且没有以正确的顺序返回数据。我想在最终输出中看到:
('A', '7.52%', -0.94%, 5.56%)
('AA', '0.74%', 0.42%, -20.83%)
('AABA', '7.08%', '0.50%', '7.65%')
('AAC', '31.18%', '-10.95%', '-65.14%')
...
我知道这是代码的最后几部分不正确,但需要一些指导。谢谢!
答案 0 :(得分:2)
问题是您只提取列Ticker
和随机单元格(.screener-link
),而是提取行。
for line in csv_urls:
# Start at first url and look for items
page = requests.get(line[0])
soup = BeautifulSoup(page.text, 'html.parser')
rows = soup.select('table[bgcolor="#d3d3d3"] tr')
for row in rows[1:]:
# extract string items
a1, a2, a3, a4 = (x.text for x in row.find_all('td')[1:5])
row = a1, a2, a3, a4
print(row)
# write row
csv_output.writerow(row)
输出
('A', '7.52%', '-0.94%', '5.56%')
('AA', '0.74%', '0.42%', '-20.83%')
('AABA', '7.08%', '0.50%', '7.65%')
('AAC', '31.18%', '-10.95%', '-65.14%')
('AAL', '-0.75%', '-6.74%', '0.60%')
('AAN', '5.68%', '6.51%', '-6.55%')
('AAOI', '5.47%', '-17.10%', '-23.12%')
('AAON', '0.62%', '1.10%', '8.58%')
('AAP', '0.38%', '-3.85%', '-2.30%')
('AAPL', '2.72%', '-9.69%', '-29.61%')
('AAT', '3.26%', '-2.39%', '10.74%')
('AAWW', '15.87%', '1.55%', '-9.62%')
('AAXN', '7.48%', '11.85%', '-14.24%')
('AB', '1.32%', '6.67%', '-2.73%')
('ABBV', '-0.85%', '0.16%', '-5.12%')
('ABC', '3.15%', '-7.18%', '-15.72%')
('ABCB', '5.23%', '-3.31%', '-22.35%')
('ABEO', '1.71%', '-10.41%', '-28.81%')
('ABG', '1.71%', '8.95%', '12.70%')
('ABM', '7.09%', '26.92%', '5.90%')
答案 1 :(得分:1)
这只是我的偏爱。但是要阅读和编写csv,我喜欢使用Pandas。
我还假设列表中的每个链接都是相同的表结构。如果不是这种情况,我可能只需要看一些链接即可运行并使其更可靠。否则,对于您提供的一个链接,将获得所需的输出。
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv, random, time
# Read in the csv
csv_df = pd.read_csv('URLs.csv')
# Create a list of the column with the urls. Change the column name to whatever you have it named in the csv file
csv_urls = list(csv_df['NAME OF COLUMN WITH URLS'])
########### delete this line below. This is for me to test ####################
csv_urls = ['https://finviz.com/screener.ashx?v=141&f=geo_usa,ind_stocksonly,sh_avgvol_o100,sh_price_o1&o=ticker']
###############################################################################
headers = requests.utils.default_headers()
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
result = pd.DataFrame()
for url in csv_urls:
tables = pd.read_html(url)
for dataframe in tables:
# I'm assuming the tables are all the same that you're getting. Otherwise this won't work for all tables
# The table you're interested in is the table with 16 coloumns
if len(dataframe.columns) == 16:
table = dataframe
else:
continue
# Make first row column headers and keep all rows and
table.columns = table.iloc[0,:]
table = table.iloc[1:,1:5]
result = result.append(table)
result.to_csv('Results.csv', index=False)
输出:
print (result)
0 Ticker Perf Week Perf Month Perf Quart
1 A 7.52% -0.94% 5.56%
2 AA 0.74% 0.42% -20.83%
3 AABA 7.08% 0.50% 7.65%
4 AAC 31.18% -10.95% -65.14%
5 AAL -0.75% -6.74% 0.60%
6 AAN 5.68% 6.51% -6.55%
7 AAOI 5.47% -17.10% -23.12%
8 AAON 0.62% 1.10% 8.58%
9 AAP 0.38% -3.85% -2.30%
10 AAPL 2.72% -9.69% -29.61%
11 AAT 3.26% -2.39% 10.74%
12 AAWW 15.87% 1.55% -9.62%
13 AAXN 7.48% 11.85% -14.24%
14 AB 1.32% 6.67% -2.73%
15 ABBV -0.85% 0.16% -5.12%
16 ABC 3.15% -7.18% -15.72%
17 ABCB 5.23% -3.31% -22.35%
18 ABEO 1.71% -10.41% -28.81%
19 ABG 1.71% 8.95% 12.70%
20 ABM 7.09% 26.92% 5.90%