Python TypeError:'NoneType'对象不可调用

时间:2016-04-23 13:18:14

标签: python csv web-scraping beautifulsoup

我正在尝试抓取网站并将数据写入CSV文件(已成功)。我面临两个挑战:

  1. CSV文件中的数据保存在ROWS中而不是列中。
  2. 网站有页面,1,2,3,4 ...接下来我无法浏览所有页面来刮取数据。数据仅从第一页中删除。
  3. 错误:

    if last_link.startswith('Next'):
    TypeError: 'NoneType' object is not callable
    

    代码:

    import requests
    import csv
    from bs4 import BeautifulSoup
    
    url = 'http://localhost:8088/wiki.html'
    
    response = requests.get(url)
    html = response.content
    soup = BeautifulSoup(html)
    
    table = soup.find('table', {'class' : 'tab_operator'})
    
    list_of_rows = []
    for rows in table.findAll('tr'):
        list_of_cells = []
        for cell in rows.findAll('td'):
            list_of_links = []
            for links in cell.findAll('a'):
                text = links.text.replace(' ', '')
                list_of_links.append(text)
            list_of_rows.append(list_of_links)
    
    outfile = open('./outfile.csv', 'w')
    writer = csv.writer(outfile)
    writer.writerows(list_of_rows)
    
    try:
        last_link = soup.find('table', {'id' : 'str_nav'}).find_all('a')[-1]
        if last_link.startswith('Next'):
            next_url_parts = urllib.parse.urlparse(last_link['href'])
            url = urllib.parse.urlunparse((base_url_parts.scheme, base_url_parts.netloc, next_url_parts.path, next_url_parts.params, next_url_parts.query, next_url_parts.fragment))
    
    except ValueError:
        print("Oops! Try again...")
    

    网站HTML示例代码:

    ### Numbers to scrape ###
    
    <table cellpadding="10" cellspacing="0" border="0" style="margin-top:20px;" class="tab_operator">
    <tbody><tr>
    <td valign="top">
    <a href="http://localhost:8088/wiki/9400000">9400000</a><br>
    <a href="http://localhost:8088/wiki/9400001">9400001</a><br>
    </td>
    </tr></tbody>
    </table>
    
    ###  Paging Sample Code: ###
    
    <div class="pstrnav" align="center">
    <table cellpadding="0" cellspacing="2" border="0" id="str_nav">
    <tbody>
    <tr>
    <td style="background-color:#f5f5f5;font-weight:bold;">1</td>
    <td><a href="http://localhost:8088/wiki/2">2</a></td>
    <td><a href="http://localhost:8088/wiki/3">3</a></td>
    <td><a href="http://localhost:8088/wiki/4">4</a></td>
    <td><a href="http://localhost:8088/wiki/2">Next &gt;&gt;</a></td>
    <td><a href="http://localhost:8088/wiki/100">Last</a></td>
    </tr>
    </tbody>
    </table>
    </div>
    

1 个答案:

答案 0 :(得分:0)

last_link是标记对象,不是字符串。 BeautifulSoup将标签上的任何属性名称视为标记搜索,而不是现有属性或方法。由于链接中没有startswith标记,该搜索会返回None,并且您正在尝试调用该对象:

>>> last_link = soup.find('table', {'id' : 'str_nav'}).find_all('a')[-1]
>>> last_link
<a href="http://localhost:8088/wiki/100">Last</a>
>>> last_link.startswith is None
True
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'NoneType' object is not callable

您想要测试包含的文本

if last_link.get_text(strip=True).startswith('Next'):

这使用Tag.get_text() method来访问链接中的所有文字;即使链接中包含其他标记(例如<b><i>标记),使用此方法也可以正常工作。

您可能希望直接在此处搜索Next链接:

import re

table = soup.select_one('table#str_nav')
last_link = table.find('a', text=re.compile('^Next'))

正则表达式指定只有​​以Next开头的直接包含文本才允许a标记匹配。