从子链接中提取数据

时间:2017-01-15 15:28:02

标签: python web-crawler

我是Python的新手,我试图从网页中提取信息(http://findanrd.eatright.org/listing/search?zipCode=page=1)。

它可以从“信息页面”获取所有链接,但我无法提取这些信息。

<div class="user-info-box clearfix">
<dl class="details-left">
<dl class="details-left">
<dl class="details-right">
<dd>26850 Providence Parkway, Suite 425</dd>
<dd>Novi, MI 48374</dd>
<dd>Email: info@aartibatavia.com</dd>
<dd>
Website:
<a href="http://www.aartibatavia.com/" target="_blank">www.aartibatavia.com/</a>
</dd>
</dl>

我想提取上述信息,例如街道,电子邮件地址和网页。我的代码如下所示:

import requests
from bs4 import BeautifulSoup

def nutrispider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://findanrd.eatright.org/listing/search?zipCode=&page=' + str(page)
        source_code = requests.get(url)
        text = source_code.text
        soup = BeautifulSoup(text)
        x = 0
        while x<=19:
            rows = soup.findAll('tr', {'data-index':x})
            for row in rows:
                link_elm = row.find('div', {'class':'search-address-list-address'}).a
                link = 'http://findanrd.eatright.org' + link_elm['href']

                users = soup.findAll('div', {'class': 'user-info-box clearfix'})
                for user in users:
                    information = user.find('dd')
                    text = information.get_Text()
                    print(text)
                print(link)
            x += 1
        page += 1

nutrispider(1)

目前没有错误,但它只是打印到信息所在的子页面的链接。

1 个答案:

答案 0 :(得分:0)

import requests, bs4

url = 'http://findanrd.eatright.org/listing/search?zipCode=page=1'
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')

for tr in soup.table('tr'):
    address = tr.find(class_='search-address-list-address').get_text(strip=True).strip('View details')
    name = tr.find(class_='search-address-list-name').get_text(strip=True)
    link = tr.p.a['href']
    print(name, address, link)

出:

Aarti Batavia, MS  RD  IFMCP 26850 Providence Parkway, Suite 425Novi, MI 48374 http://maps.google.com/maps?saddr=&daddr=26850 Providence Parkway, Suite 425 Novi, MI 48374
Aarti Batavia, MS  RD  IFMCP 26850 Providence Parkway, Suite 425Novi, MI 48374 http://maps.google.com/maps?saddr=&daddr=26850 Providence Parkway, Suite 425 Novi, MI 48374
Abbey Carlson, RD 3935 N 75 WHyde Park, UT 84318 http://maps.google.com/maps?saddr=&daddr=3935 N 75 W Hyde Park, UT 84318
Abbi Kifer, MED  RDN  LD PO Box 120Mount Storm, WV 26739 http://maps.google.com/maps?saddr=&daddr=PO Box 120 Mount Storm, WV 26739
Abbie Scott, RD  LD Hy-Vee, Inc.3221 SE 14th StreetDes Moines, IA 50320 http://maps.google.com/maps?saddr=&daddr=3221 SE 14th Street Des Moines, IA 50320