我已经在python中编写了一些代码,我的目的是通过" web_parser"提供新生成的链接。等级为" get_docs"类。但是,我无法想到任何有效的事情。我想做的就是在两个类之间建立联系,以便" web_parser" class产生链接和" get_docs" class处理它们以获得精炼的输出。任何关于我如何能够完美地完成任务的想法都将受到高度赞赏。提前谢谢。
from lxml import html
import requests
class web_parser:
page_link = "https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=San%20Francisco%2C%20CA"
main_url = "https://www.yellowpages.com"
def __init__(self, link):
self.link = link
self.vault = []
def parser(self):
self.get_link(self.page_link)
def get_link(self, url):
page = requests.get(url)
tree = html.fromstring(page.text)
item_links = tree.xpath('//h2[@class="n"]/a[@class="business-name"][not(@itemprop="name")]/@href')
for item_link in item_links:
self.vault.append(self.main_url + item_link)
class get_docs(web_parser):
def __init__(self, new_links):
web_parser.__init__(self, link)
self.new_links = [new_links]
def procuring_links(self):
for link in self.vault:
self.using_links(link)
def using_links(self, newly_created_link):
page = requests.get(newly_created_link)
tree = html.fromstring(page.text)
name = tree.findtext('.//div[@class="sales-info"]/h1')
phone = tree.findtext('.//p[@class="phone"]')
print(name, phone)
if __name__ == '__main__':
crawl = web_parser(web_parser.page_link)
parse = get_docs(crawl)
parse.parser()
parse.procuring_links()
我对创建课程知之甚少,请原谅我的无知。在此阶段执行时,我收到错误:
web_parser.__init__(self, link)
NameError: name 'link' is not defined
答案 0 :(得分:1)
我不太确定你想如何使用它,给web_parser提供一个参数或者在类中使用硬编码链接?
根据您在__main__
中使用的命令,您可以按以下方式处理:
class get_docs(object):
def __init__(self, web_parser):
self.vault = web_parser.vault
if __name__ == '__main__':
crawl = web_parser() # create an instance
crawl.parser()
parse = get_docs(crawl) # give the instance to get_doc, or directly the vault with crawl.vault
parse.procuring_links() # execute get_doc processing
__
您还需要更正web_parser类:
您必须在创建期间给出的参数(link
)或硬编码的page_link之间进行选择,只需调整方法parser()
以定位好的方法。
class web_parser:
def __init__(self, link=''):
self.link = link
self.vault = []
self.page_link = "https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=San%20Francisco%2C%20CA"
self.main_url = "https://www.yellowpages.com"
答案 1 :(得分:1)
要修复您在问题中发布的NameError,您需要向子类的__init__
添加另一个参数 - 并将内容传递给它。
class get_docs(web_parser):
#def __init__(self, new_links):
def __init__(self, link, new_links):
web_parser.__init__(self, link)
self.new_links = [new_links]
虽然web_parser
似乎对这些数据没有任何作用,所以可能只是从基类中删除它。