BeautifulSoup问题并请求解析

时间:2017-01-28 17:03:32

标签: python parsing python-requests bs4

尝试使用BeautifulSoup和请求模块时出错。

我的代码如下:

C:\Users\PANDEMIC\Desktop\Python-Test>vkp.py Traceback (most recent call last):
File "C:\Users\PANDEMIC\Desktop\Python-Test\vkp.py",
line 23, in <module>
    main()
File "C:\Users\PANDEMIC\Desktop\Python-Test\vkp.py", line 20, in main
    total_pages = get_total_pages(get_html)
File "C:\Users\PANDEMIC\Desktop\Python-Test\vkp.py", line 13, in get_total_pages
    soup = BeautifulSoup(get_html, 'lxml')
File "C:\Users\PANDEMIC\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4\__init__.py", line 192, in __init__
    elif len(markup) <= 256 and (
TypeError: object of type 'function' has no len()

这会产生错误:

{{1}}

3 个答案:

答案 0 :(得分:0)

中执行()时忘记了get_html()和参数
total_pages = get_total_pages( get_html(base_url) )

顺便说一句url中你不需要get_html因为它会在下次通话中超出你的论点

def get_html(url):
    #url = ('https://m.vk.com/bageto?act=members&offset=0')
    r = requests.get(url)
    return r.text

或者您可以使用默认值

def get_html(url='https://m.vk.com/bageto?act=members&offset=0')
    r = requests.get(url)
    return r.text

base_url+"0"

中以get_html(base_url+"0")作为参数的完整版本
import requests
from bs4 import BeautifulSoup

def get_html(url):
    #url = ('https://m.vk.com/bageto?act=members&offset=0')
    r = requests.get(url)
    return r.text

def get_total_pages(html):
    soup            = BeautifulSoup(html, 'lxml')
    pages           = soup.find('div', class_='pagination').find_all('a', class_='pg_link')[-1].get('href')
    total_pages     = pages.split('=')[2]
    return int(total_pages)

def main():
    base_url = 'https://m.vk.com/bageto?act=members&offset='
    total_pages = get_total_pages(get_html(base_url+"0"))

    print(total_pages)

    for i in range(50, total_pages, 50):
        print(i)
        #print(base_url + str(i))

main()

答案 1 :(得分:0)

import requests
from bs4 import BeautifulSoup


def get_html(url):
    url = ('https://m.vk.com/bageto?act=members&offset=0')
    r = requests.get(url)
    return r.text

def get_total_pages(html):

    soup            = BeautifulSoup(html, 'lxml')
    pages           = soup.find('div', class_='pagination').find_all('a', class_='pg_link')[-1].get('href')
    total_pages     = pages.split('=')[2]
    return int(total_pages)

def main():
    base_url = 'https://m.vk.com/bageto?act=members&offset=0'
    html = get_html(base_url)
    total_pages = get_total_pages(html)
    print(total_pages)

你应该将html字符串传递给BeautifulSoup,而不是函数。

答案 2 :(得分:0)

def main():
    try:
        urll = []
        base_url = 'https://m.vk.com/bageto?act=members&offset='
        total_pages = int(get_total_pages(get_html(url)))
        for i in range(0, total_pages, 50):
            url_gen = str(base_url + str(i))
            urll.append(url_gen)
            #get_page_data(url_gen)
        pool = ThreadPool(8)
        results = pool.map(get_page_data, urll)

    except KeyboardInterrupt:
        print('you are stopped script yourself')

if __name__ == '__main__':

    main()