Question

所以我试图制作一个蜘蛛网站的程序，并搜索每个页面以查找存储在大文件中的某些代码片段。

要做到这一点，我需要从页面中提取html源代码然后创建一个html对象 - 我正在使用BeautifulSoup来执行此操作。最初我尝试使用此函数将代码实际匹配到html源

def textsearch(soup, exploit): 
    code = soup.find(text = re.compile(exploit)) 
    if code == None:
        print "Coudln't find the bad stuff!\n"
        return False
    else: 
        print "Found the bad code!\n" 
        return True

阅读完BS4的文档后，我意识到这不会起作用，所以我开始研究解析器以接受＆＃34; exploit＆＃34;并将其解析为文本。问题不是所有的漏洞都有这种格式（有些是javascript脚本），所以我可以将整个html源视为一个大的文本文档＆＃34;使用源中显示的确切字符，但没有格式化，然后只搜索任何匹配的字符序列。

是否有一个很好的模块可以将从网络上获取的html源转换为这样的对象？

Answer 1

我认为这会有所帮助

import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
    page = 0
    while page <=max_pages:
        url = 'http://www.indeed.co.in/jobs?q=tripura&start=' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text)
        for jobs in soup.findAll('h2',{'class' :'jobtitle'}):
            #job = jobs.findAll('h2',{'class' :'jobtitle'})
            href = "http://www.indeed.co.in/" + jobs.a.get('href')
            title= jobs.a.string
            print(title)
            print(href)
            page +=10

trade_spider(20)

如何搜索html文档，好像它是python中的纯文本？

1 个答案: