刮刮并导航到链接以获取更多信息

时间:2014-04-01 19:33:36

标签: python csv beautifulsoup

不确定我想做的事情是否可能......但是这里有。我正试图从这张表中导航和抓取信息(简化)......

> <tr class="transaction odd" id="transaction_7"><td><a
> href="/show_customer/11111">Erin</a></td></tr> <tr class="transaction
> even" id="transaction_6"><td><a
> href="/show_customer/2222">Jack</a></td></tr> <tr class="transaction
> odd" id="transaction_5"><td><a
> href="/show_customer/3333">Carl</a></td></tr> <tr class="transaction
> even" id="transaction_4"><td><a
> href="/show_customer/4444">Kelly</a></td></tr>

这是我用来刮掉表格并输出到csv中的代码......效果很好。

columns = ["User Name", "Source", "Staff", "Location", "Attended On", "Used", "Date"]
table = []

for row in table_1.find_all('tr'):
    tds  = row.find_all('td')
    try:
        data = [td.get_text() for td in tds]
        for field,value in zip(columns, data):
            print("{}: {}".format(field, value))
        table.append(data)
    except:
        print("Bad string value")


import csv

with open("myfile.csv", "wb") as outf:

    outcsv = csv.writer(outf)

    # header row
    outcsv.writerow(columns)

    # data
    outcsv.writerows(table)

我需要做的是导航到表格中的每个链接

<a> href="/show_customer/11111">Erin</a>

并获取此html表单中的客户电子邮件地址

<div class="field">
   <div class = "label">Email</div>
   <p>XXXX@email.com</p>
   </div>

并将其添加到我的csv中的相关行。

非常感谢任何帮助!

1 个答案:

答案 0 :(得分:1)

您必须为href中的每个td发出http请求。这是您修改现有代码的方法:

from urllib2 import urlopen

for row in table_1.find_all('tr'):
    tds  = row.find_all('td')
    # Get all the hrefs to make http request
    links = row.find_all('a').get('href')
    try:
        data = [td.get_text() for td in tds]
        for field,value in zip(columns, data):
            print("{}: {}".format(field, value))
        # For every href make a request, get the page,
        # create a BS object
        for link in links:
            link_soup = BeautifulSoup(urlopen(link))

            # Use link_soup BS instance to get the email 
            # by navigating the div and p and add it to your data

        table.append(data)
    except:
        print("Bad string value")

请注意,您的href与网站的网址相关。因此,在您提取href之后,您必须在网站的网址前面添加它以形成有效的网址