为什么我不能在Python中抓取此链接?

时间:2015-08-18 13:17:19

标签: python beautifulsoup web-crawler

我正在尝试抓取网页的内容,但我不明白为什么会收到此错误:d should matchPattern { case x:Element if x.count > 0 => case x:Element if x.value != "" => }

这是我要抓取的链接: www.rc2.vd.ch

以下是我用来抓取的Python代码:

http.client.IncompleteRead: IncompleteRead(2268 bytes read, 612 more expected)

我尝试使用其他网站链接并且工作正常,但为什么我不能抓住这个?

如果使用此代码无法执行此操作,那该怎么办?

------------编辑------------

以下是完整的错误消息:

import requests
from bs4 import BeautifulSoup
def spider_list():
    url = 'http://www.rc2.vd.ch/registres/hrcintapp-pub/companySearch.action?lang=FR&init=false&advancedMode=false&printMode=false&ofpCriteria=N&actualDate=18.08.2015&rowMin=0&rowMax=0&listSize=0&go=none&showHeader=false&companyName=&companyNameSearchType=CONTAIN&companyOfsUid=&companyOfrcId13Part1=&companyOfrcId13Part2=&companyOfrcId13Part3=&limitResultCompanyActive=ACTIVE&searchRows=51&resultFormat=STD_COMP_NAME&display=Rechercher#result'

    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')

    for link in soup.findAll('a', {'class': 'hoverable'}):
        print(link)

spider_list()

2 个答案:

答案 0 :(得分:2)

可能是编辑器的问题。

我在{3}中使用您的代码在python 3中获得正确结果

图片附在下面以供参考 -

enter image description here

我唯一能想到的是以某种方式绕过错误:

import requests
from bs4 import BeautifulSoup
def spider_list():
    url = 'http://www.rc2.vd.ch/registres/hrcintapp-pub/companySearch.action?lang=FR&init=false&advancedMode=false&printMode=false&ofpCriteria=N&actualDate=18.08.2015&rowMin=0&rowMax=0&listSize=0&go=none&showHeader=false&companyName=&companyNameSearchType=CONTAIN&companyOfsUid=&companyOfrcId13Part1=&companyOfrcId13Part2=&companyOfrcId13Part3=&limitResultCompanyActive=ACTIVE&searchRows=51&resultFormat=STD_COMP_NAME&display=Rechercher#result'
    try:
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'html.parser')

        for link in soup.findAll('a', {'class': 'hoverable'}):
            print(link)
    except:
        pass
        #I am passing but you do whatever you want to do in case of error
spider_list()

如果有帮助,请告诉我。

答案 1 :(得分:1)

这个怎么样!!

import requests
from lxml import html
def spider_list():
    url = 'https://www.rc2.vd.ch/registres/hrcintapp-pub/companySearch.action?lang=FR&init=false&advancedMode=false&printMode=false&ofpCriteria=N&actualDate=18.08.2015&rowMin=0&rowMax=0&listSize=0&go=none&showHeader=false&companyName=&companyNameSearchType=CONTAIN&companyOfsUid=&companyOfrcId13Part1=&companyOfrcId13Part2=&companyOfrcId13Part3=&limitResultCompanyActive=ACTIVE&searchRows=51&resultFormat=STD_COMP_NAME&display=Rechercher#result'
    code = requests.get(url)
    tree = html.fromstring(code.text)
    skim=tree.xpath('//a[@class="hoverable"]/@href')
    print(skim)
spider_list()