Web爬网程序未获取所有URL

时间:2015-05-19 21:35:18

标签: python-2.7 web-scraping request beautifulsoup web-crawler

因此,我编写了以下程序,以便从此搜索结果页https://stackoverflow.com/a/649673/816536

中提取所有个人资料网址

有大约18,400多个链接要提取。

但是,当我运行代码时,它不会超出URL#1623而且它会在没有任何错误或任何错误的情况下停止。

这是我的代码

from bs4 import BeautifulSoup
import requests

url = 'https://www.ohiobar.org/Pages/Find-a-Lawyer.aspx?sFN=&sLN=&sPA=&sCI=&sST=OH&sZC='

with requests.Session() as session:
    session.headers = {'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30'}

    response = session.get(url)
    soup = BeautifulSoup(response.content, "lxml")

    for link in soup.select("div#content_findResults div#content_column1 ul li a[href*=MemberProfile]"):
        print 'https://www.ohiobar.org' + link.get("href")

请在这里建议我做错了什么?

由于

1 个答案:

答案 0 :(得分:1)

由于我无法评论,我将添加此作为答案。我试过在Python-3.4上运行你的代码,这就是我得到的:

Good Results!

如果你可能只是更新你的python版本。

在这一行做了一个小改动:

soup = BeautifulSoup(response.content)

代码:

from bs4 import BeautifulSoup
import requests

url = 'https://www.ohiobar.org/Pages/Find-a-Lawyer.aspx?sFN=&sLN=&sPA=&sCI=&sST=OH&sZC='

with requests.Session() as session:
    session.headers = {'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30'}

    response = session.get(url)
    soup = BeautifulSoup(response.content)
    counter = 0

    for link in soup.select("div#content_findResults div#content_column1 ul li a[href*=MemberProfile]"):
        print(counter , ": " , 'https://www.ohiobar.org' , link.get("href"))
        counter += 1

此致 亚历