刮刮多个网页并写入CSV文件

时间:2013-11-03 05:50:34

标签: python csv web-scraping beautifulsoup

我正在编写一个程序,它将从网站获取七个数据并将其写入symbols.txt文件中的每个公司的csv文件,例如AAPL或NFLX。我的问题似乎来自于我对索引的困惑使脚本工作。我对它如何适应感到茫然。我认为这种格式可行......

import urllib2
from BeautifulSoup import BeautifulSoup
import csv
import re
import urllib
# import modules

symbolfile = open("symbols.txt")
symbolslist = symbolfile.read()
newsymbolslist = symbolslist.split("\n")

i = 0

f = csv.writer(open("pe_ratio.csv","wb"))
# short cut to write

f.writerow(["Name","PE","Revenue % Quarterly","ROA% YOY","Operating Cashflow","Debt to Equity"])
#first write row statement

# define name_company as the following
while i<len(newsymbolslist):
    page = urllib2.urlopen("http://finance.yahoo.com/q/ks?s="+newsymbolslist[i] +"%20Key%20Statistics").read()
    soup = BeautifulSoup(page)
    name_company = soup.findAll("div", {"class" : "title"}) 
    for name in name_company: #add multiple iterations?     
        all_data = soup.findAll('td', "yfnc_tabledata1")
        stock_name = name.find('h2').string #find company's name in name_company with h2 tag
        f.writerow([stock_name, all_data[2].getText(),all_data[17].getText(),all_data[13].getText(), all_data[29].getText(),all_data[26].getText()]) #write down PE data
    i+=1    

当我尝试按原样运行代码时,我收到以下错误:

Traceback (most recent call last):
  File "company_data_v1.py", line 28, in <module>
    f.writerow([stock_name, all_data[2].getText(),all_data[17].getText(),all_data[13].getText(), all_data[29].getText()
all_data[26].getText()]) #write down PE data
IndexError: list index out of range

提前感谢您的帮助。

2 个答案:

答案 0 :(得分:2)

name_company = soup.findAll("div", {"class" : "title"})
soup = BeautifulSoup(page) #this is the first time you define soup

您在尝试soup行上定义soup.findAll。 Python会告诉您问题的确切原因:您尚未在soup行定义findAll

翻转这些行的顺序。

答案 1 :(得分:1)

我假设当你说“在哪里放置变量以使脚本工作”你指的是这个'汤'变量?您的错误消息中的那个?

如果是这样,那么我建议在之前声明'汤'你试图在soup.findAll()中使用它。如您所见,您在 soup.findAll()之后声明了汤= BeautifulSoup(页面)一行。它应该超越它。