我正试图从纽约时报收集文章并不断收到很长的错误列表。我想知道是否有人可以帮我指出正确的方向。下面是相关文章的URL,我的代码和控制台的输出。任何帮助都会非常巨大。
文章:http://www.nytimes.com/2014/03/10/world/asia/malaysia-airlines-flight.html?ref=world&_r=0
import urllib2
from bs4 import BeautifulSoup
import re
# Ask user to enter URL
url = "http://www.nytimes.com/2014/03/10/world/asia/malaysia-airlines-flight.html?ref=world&_r=0"
# Open txt document for output
txt = open('ctp_output.txt', 'w')
# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())
# Write the article title to the file
title = soup.find("h1")
txt.write('\n' + "Title: " + title.string + '\n' + '\n')
# Write the article date to the file
try:
date = soup.find("span", {'class':'dateline'}).text
txt.write("Date: " + str(date) + '\n' + '\n')
except:
print "Could not find the date!"
# Write the article author to the file
try:
byline=soup.find("p", {'class':'byline-author'}).text
txt.write("Author: " + str(byline) + '\n' + '\n')
except:
print "Could not find the author!"
# Write the article location to the file
regex = '<span class="location">(.+?)</span>'
pattern = re.compile(regex)
byline = re.findall(pattern,str(soup))
txt.write("Location: " + str(byline) + '\n' + '\n')
# retrieve all of the paragraph tags
with open('ctp_output.txt', 'w'):
for tag in soup.find_all('p'):
txt.write(tag.text.encode('utf-8') + '\n' + '\n')
# Close txt file with new content added
txt.close()
Sample output from console:
andrews-mbp-3:CTP Andrew$ python idle_test.py
Please enter a valid URL: http://www.nytimes.com/2014/03/10/world/asia/malaysia-airlines- flight.html?ref=world&_r=0
Traceback (most recent call last):
File "idle_test.py", line 20, in <module>
soup = BeautifulSoup(urllib2.urlopen(url).read())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 442, in error
result = self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
答案 0 :(得分:2)
正如您从错误列表(也称为traceback
)中看到的那样,(第一个)错误发生在第20行,当您打电话给{时{1}}。所以,看看你传递给那个功能的是什么。你的变量urllib
url
期望是一个字符串,它周围没有引号,这让我想知道代码是如何在之前没有引发错误的。
我之前说过第一个错误,因为你第一次编写它时的代码(对大多数程序员来说都是如此,对于新程序员来说都是如此)会有很多错误。学习编程在很多方面学习如何解释计算机中的错误(urllib
)。
您刚刚将traceback
的定义更改为url
功能。请不要这样做,因为它使得阅读和调试代码变得更加困难。 raw_input
对变量urllib
存在问题。隐藏变量url
的值会使调试变得更加困难。根据经验,我建议可能包含(或不包括)url
或某些此类语法可能会让您感到困扰 - 但如果我看不到http
,我只能猜测。