我有一个.txt文件,其中包含许多页面的完整URL,每个页面都包含一个我想要从中删除数据的表。我的代码适用于一个URL,但是当我尝试添加循环并从.txt文件中读取URL时,我收到以下错误
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: ?
这是我的代码
from urllib2 import urlopen
from bs4 import BeautifulSoup as soup
with open('urls.txt', 'r') as f:
urls = f.read()
for url in urls:
uClient = urlopen(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("tr", {"class":"data"})
for container in containers:
unform_name = container.findAll("th", {"width":"30%"})
name = unform_name[0].text.strip()
unform_delegate = container.findAll("td", {"id":"y000"})
delegate = unform_delegate[0].text.strip()
print(name)
print(delegate)
f.close()
我检查了我的.txt文件,所有条目都正常。它们以HTTP开头:以.html结尾。他们周围没有撇号或引号。我是不是错误地编写了for循环?
使用
with open('urls.txt', 'r') as f:
for url in f:
print(url)
我得到以下
??http://www.thegreenpapers.com/PCC/AL-D.html
http://www.thegreenpapers.com/PCC/AL-R.html
http://www.thegreenpapers.com/PCC/AK-D.html
等等100行。只有第一行有问号。 我的.txt文件包含那些只更改状态和派对缩写的URL。
答案 0 :(得分:2)
您尝试过的方法可以通过在代码中抽取两行不同来修复。
试试这个:
with open('urls.txt', 'r') as f:
urls = f.readlines() #make sure this line is properly indented.
for url in urls:
uClient = urlopen(url.strip())
答案 1 :(得分:1)
您无法使用'f.read()'将整个文件读入字符串,然后对字符串进行迭代。要解决,请参阅下面的更改。我也删除了你的最后一行。当您使用'with'语句时,它将在代码块完成时关闭文件。
Code from Greg Hewgill for(Python 2)显示url字符串是'str'还是'unicode'。
from urllib2 import urlopen
from bs4 import BeautifulSoup as soup
# Code from Greg Hewgill
def whatisthis(s):
if isinstance(s, str):
print "ordinary string"
elif isinstance(s, unicode):
print "unicode string"
else:
print "not a string"
with open('urls.txt', 'r') as f:
for url in f:
print(url)
whatisthis(url)
uClient = urlopen(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("tr", {"class":"data"})
for container in containers:
unform_name = container.findAll("th", {"width":"30%"})
name = unform_name[0].text.strip()
unform_delegate = container.findAll("td", {"id":"y000"})
delegate = unform_delegate[0].text.strip()
print(name)
print(delegate)
使用带有上面列出的URL的文本文件运行代码会产生以下输出:
http://www.thegreenpapers.com/PCC/AL-D.html
ordinary string
Gore, Al
54. 84%
Uncommitted
10. 16%
LaRouche, Lyndon
http://www.thegreenpapers.com/PCC/AL-R.html
ordinary string
Bush, George W.
44. 100%
Keyes, Alan
Uncommitted
http://www.thegreenpapers.com/PCC/AK-D.html
ordinary string
Gore, Al
13. 68%
Uncommitted
6. 32%
Bradley, Bill