Question

我有这个Python代码，但它搜索的是实际页面而不是页面的源。

import requests
from bs4 import BeautifulSoup

def count_words(url, the_word):
    r = requests.get(url, allow_redirects=False)
    soup = BeautifulSoup(r.content, 'lxml')
    words = soup.find(text=lambda text: text and the_word in text)
    print(words)
    return len(words)


def main():
    url = 'google.com'
    word = 'google'
    count = count_words(url, word)
    print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, word))

if __name__ == '__main__':
    main()

我如何才能搜索该页面的来源呢？

我不想数数。是的，我知道我必须删除count {}部分。但是，如何让它从文本文件中加载网站列表，如果找到x字打印＆＃34; X在本网站上找到＆＃34;

感谢任何帮助！

Answer 1

如果要搜索源代码以查找某个子字符串的出现，则无需使用BeautifulSoup。它只会让你解析实际的页面内容，而不包括源代码。

使用以下代码替换count_words()。

def count_words(url, the_word):
    r = requests.get(url).text    
    return r.count(the_word)

Output (do NOT include this in the final code):
>>> count_words('https://google.com', 'Google')
8

您只需要使用requests将网页源代码作为字符串获取，并使用.count()计算子字符串的出现次数。

另外，请确保在网址中添加方案（例如http，https）。否则，BeautifulSoup会“吓坏”。

抓取网站来源并搜索单词

1 个答案: