使用.txt文件从多个网页中截取数据,该文件包含带有Python和美味汤的URL

时间:2018-03-31 02:49:57

标签: python web-scraping beautifulsoup valueerror

我有一个.txt文件,其中包含许多页面的完整URL,每个页面都包含一个我想要从中删除数据的表。我的代码适用于一个URL,但是当我尝试添加循环并从.txt文件中读取URL时,我收到以下错误

raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: ?

这是我的代码

from urllib2 import urlopen
from bs4 import BeautifulSoup as soup

with open('urls.txt', 'r') as f:
urls = f.read()
for url in urls:

    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    page_soup = soup(page_html, "html.parser")

    containers = page_soup.findAll("tr", {"class":"data"})


    for container in containers:
        unform_name = container.findAll("th", {"width":"30%"})
        name = unform_name[0].text.strip()

        unform_delegate = container.findAll("td", {"id":"y000"})
        delegate = unform_delegate[0].text.strip()

        print(name)
        print(delegate)

f.close()

我检查了我的.txt文件,所有条目都正常。它们以HTTP开头:以.html结尾。他们周围没有撇号或引号。我是不是错误地编写了for循环?

使用

with open('urls.txt', 'r') as f:
    for url in f:
        print(url)

我得到以下

??http://www.thegreenpapers.com/PCC/AL-D.html

http://www.thegreenpapers.com/PCC/AL-R.html

http://www.thegreenpapers.com/PCC/AK-D.html

等等100行。只有第一行有问号。 我的.txt文件包含那些只更改状态和派对缩写的URL。

2 个答案:

答案 0 :(得分:2)

您尝试过的方法可以通过在代码中抽取两行不同来修复。

试试这个:

with open('urls.txt', 'r') as f:
    urls = f.readlines()   #make sure this line is properly indented.
for url in urls:
    uClient = urlopen(url.strip())

答案 1 :(得分:1)

您无法使用'f.read()'将整个文件读入字符串,然后对字符串进行迭代。要解决,请参阅下面的更改。我也删除了你的最后一行。当您使用'with'语句时,它将在代码块完成时关闭文件。

Code from Greg Hewgill for(Python 2)显示url字符串是'str'还是'unicode'。

from urllib2 import urlopen
from bs4 import BeautifulSoup as soup

# Code from Greg Hewgill
def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

with open('urls.txt', 'r') as f:
    for url in f:
        print(url)
        whatisthis(url)
        uClient = urlopen(url)
        page_html = uClient.read()
        uClient.close()

        page_soup = soup(page_html, "html.parser")

        containers = page_soup.findAll("tr", {"class":"data"})

        for container in containers:
            unform_name = container.findAll("th", {"width":"30%"})
            name = unform_name[0].text.strip()

            unform_delegate = container.findAll("td", {"id":"y000"})
            delegate = unform_delegate[0].text.strip()

            print(name)
            print(delegate)

使用带有上面列出的URL的文本文件运行代码会产生以下输出:

http://www.thegreenpapers.com/PCC/AL-D.html

ordinary string
Gore, Al
54.   84%
Uncommitted
10.   16%
LaRouche, Lyndon

http://www.thegreenpapers.com/PCC/AL-R.html

ordinary string
Bush, George W.
44.  100%
Keyes, Alan

Uncommitted

http://www.thegreenpapers.com/PCC/AK-D.html
ordinary string
Gore, Al
13.   68%
Uncommitted
6.   32%
Bradley, Bill