我的带有BeautifulSoup的Webscraping代码没有超过第一页

时间:2019-04-22 13:29:42

标签: python web-scraping beautifulsoup

它似乎没有超出首页。怎么了? 另外,如果您要查找的单词在链接中,它不会提供正确的出现次数,它将显示5个输出,其中5个为出现次数

import requests from bs4 import BeautifulSoup 

for i in range (1,5):

    url = 'https://www.nairaland.com/search/ipob/0/0/0/{}'.format(i)
    the_word = 'is' 
    r = requests.get(url, allow_redirects=False)
    soup = BeautifulSoup(r.content, 'lxml')
    words = soup.find(text=lambda text: text and the_word in text) 
    print(words) 
    count =  len(words)
    print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, the_word))

4 个答案:

答案 0 :(得分:1)

如果您想跳过前6页,请更改循环范围:

for i in range (6):   # the first page is addressed at index `0`

或:

for i in range (0,6):

代替:

for i in range (1,5):    # this will start from the second page, since the second page is indexed at `1`

答案 1 :(得分:0)

对我来说,这很好:

import requests
from bs4 import BeautifulSoup

if __name__ == "__main__":

    # correct the range, 0, 6 to go from first page to the fifth one (starting counting from "0")
    # or try 0, 5 to go from 0 to 5 (five pages in total)
    for i in range(0, 6): # range(0, 4)

        url = 'https://www.nairaland.com/search/ipob/0/0/0/{}'.format(i)
        print(url, "url")
        the_word = 'is'
        r = requests.get(url, allow_redirects=False)
        soup = BeautifulSoup(r.content, 'lxml')
        words = soup.find(text=lambda text: text and the_word in text)
        print(words)
        count =  len(words)
        print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, the_word))

这是输出:

https://www.nairaland.com/search/ipob/0/0/0/0 url
 is somewhere in Europe sending semi nude video on the internet.Are you proud of such groups with such leader?

Url: https://www.nairaland.com/search/ipob/0/0/0/0
contains 110 occurrences of word: is
https://www.nairaland.com/search/ipob/0/0/0/1 url
Notre is a French word; means 'Our"...and Dame means "Lady" So Notre Dame means Our Lady.

Url: https://www.nairaland.com/search/ipob/0/0/0/1
contains 89 occurrences of word: is
https://www.nairaland.com/search/ipob/0/0/0/2 url
How does all this uselessness Help Foolish 

Url: https://www.nairaland.com/search/ipob/0/0/0/2
contains 43 occurrences of word: is
https://www.nairaland.com/search/ipob/0/0/0/3 url
Dumb fuckers everywhere. I thought I was finally going to meet someone that has juju and can show me. Instead I got a hopeless broke buffoon that loves boasting online. Nairaland I apologize on the behalf of this waste of space and time. He is not even worth half of the data I have spent writing this post. 

Url: https://www.nairaland.com/search/ipob/0/0/0/3
contains 308 occurrences of word: is
https://www.nairaland.com/search/ipob/0/0/0/4 url
People like FFK, Reno, Fayose etc have not been touched, it is an unknown prophet that hasn't said anything against the FG that you expect the FG to waste its time on. 

Url: https://www.nairaland.com/search/ipob/0/0/0/4
contains 168 occurrences of word: is
https://www.nairaland.com/search/ipob/0/0/0/5 url
 children send them to prison

Url: https://www.nairaland.com/search/ipob/0/0/0/5
contains 29 occurrences of word: is

Process finished with exit code 0

答案 2 :(得分:0)

尝试:

import requests
from bs4 import BeautifulSoup 

for i in range(6):
    url = 'https://www.nairaland.com/search/ipob/0/0/0/{}'.format(i)
    the_word = 'afonja' 
    r = requests.get(url, allow_redirects=False)
    soup = BeautifulSoup(r.content, 'lxml')
    words = soup.find(text=lambda text: text and the_word in text) 
    print(words)
    count = 0
    if words:
        count = len(words)
    print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, the_word))

在新规范后进行编辑。

假设要计数的单词与url中的单词相同,您可以注意到该单词在页面中突出显示,并且可以在html中被span class=highlight识别。

因此您可以使用以下代码:

import requests
from bs4 import BeautifulSoup 

for i in range(6):
    url = 'https://www.nairaland.com/search/afonja/0/0/0/{}'.format(i)
    the_word = 'afonja' 
    r = requests.get(url, allow_redirects=False)
    soup = BeautifulSoup(r.content, 'lxml')
    count = len(soup.find_all('span', {'class':'highlight'})) 
    print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, the_word))

您会得到:

Url: https://www.nairaland.com/search/afonja/0/0/0/0
contains 30 occurrences of word: afonja

Url: https://www.nairaland.com/search/afonja/0/0/0/1
contains 31 occurrences of word: afonja

Url: https://www.nairaland.com/search/afonja/0/0/0/2
contains 36 occurrences of word: afonja

Url: https://www.nairaland.com/search/afonja/0/0/0/3
contains 30 occurrences of word: afonja

Url: https://www.nairaland.com/search/afonja/0/0/0/4
contains 45 occurrences of word: afonja

Url: https://www.nairaland.com/search/afonja/0/0/0/5
contains 50 occurrences of word: afonja

答案 3 :(得分:0)

顺便说一句,搜索词有其自己的类名,因此您只需对其进行计数即可。以下内容可正确返回页面上未找到的位置。您可以在循环中使用这种方法。

import requests 
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.nairaland.com/search?q=afonja&board=0&topicsonly=2')
soup = bs(r.content, 'lxml')
occurrences = len(soup.select('.highlight'))
print(occurrences)

import requests 
from bs4 import BeautifulSoup as bs

for i in range(9):
    r = requests.get('https://www.nairaland.com/search/afonja/0/0/0/{}'.format(i))
    soup = bs(r.content, 'lxml')
    occurrences = len(soup.select('.highlight'))
    print(occurrences)