在python scraper中使用装饰器时遇到问题

时间:2018-11-30 12:08:00

标签: python python-3.x function web-scraping decorator

我已经用python编写了一个脚本,通过跟踪其着陆页上的基本链接来获取某些帖子的链接。如果我坚持传统方法,我可以自己抓取同样的东西。

但是,我的目标是使用 decorator 。似乎我已经接近了,但是在将链接从功能get_links()传递到get_target_link()时遇到了麻烦。我在功能return func()中使用get_target_link()作为占位符,因为我不知道如何传递链接。函数get_links()中有print语句(如果未注释,则可以使用),以确保我处在正确的轨道上。

如何将链接return linklistget_links()传递到get_target_link(),以便在必要时可以重复使用?

这是我到目前为止尝试过的:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://www.janglo.net/component/option,com_sobi2/"

def get_links(func):
    linklist = []
    res = requests.get(func())
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select(".sobi2ItemTitle a"):
        linklist.append(urljoin(url,item.get("href")))
    #print(linklist)
    return linklist

    def get_target_link():
        return func()  #All I need to do is fix this line
    return get_target_link

@get_links
def get_info():
    res = requests.get(url)
    soup = BeautifulSoup(res.text,"lxml")
    for items in soup.select("#sobi2CatListSymbols .sobi2SubcatsListItems a[title]"):
        if items.text=="Tutors":
            ilink = f"{urljoin(url,items.get('href'))}"
    return ilink

if __name__ == '__main__':
    for links in get_info():
        print(links)

Post Script: I only would like to comply with the logic I've tried to apply above

@sir Andersson(Can you explain how you want to re-use them if necessary)的更新:

def get_target_link():
    titles = []
    new_links =  func()
    for new_link in new_links:
        res = requests.get(new_link)
        soup = BeautifulSoup(res.text)
        titles.append(soup.select_one("h1").text)
    return titles
return get_target_link

我想创建装饰后的功能,使其类似于以下@Carlos Mermingas:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://www.janglo.net/component/option,com_sobi2/"

def get_info(link):
    res = requests.get(url)
    soup = BeautifulSoup(res.text,"lxml")
    for items in soup.select("#sobi2CatListSymbols .sobi2SubcatsListItems a[title]"):
        if items.text=="Tutors":
            ilink = f"{urljoin(url,items.get('href'))}"
    return ilink

def get_links(tlink):
    linklist = []
    res = requests.get(tlink)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select(".sobi2ItemTitle a"):
        linklist.append(urljoin(url,item.get("href")))
    return linklist

def get_target_link(link):
    titles = []
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    titles.append(soup.select_one("h1").text)
    return titles

if __name__ == '__main__':
    item = get_info(url)
    for nlink in get_links(item):
        for ititle in get_target_link(nlink):
            print(ititle)

2 个答案:

答案 0 :(得分:0)

很长一段时间的信达,老实说,我已经停止阅读您的第一个python错误。

让我为您修复它,您告诉我是否遗漏了什么。这就是装饰器模式在python中的工作方式。

一开始有点奇怪,有点像运送轮船,但是它很聪明。

装饰器是一个函数,它返回要调用的函数而不是另一个函数。

让我们想象一下这个功能,没有修饰。

>>> def some_func(number):
...      return f'Number is {number}'
...
>>> print(some_func(10))
Number is 10

要装饰此功能,可以说我们要添加模糊测试;经常在这里和那里增加一些延迟的事情。

>>> def fuzz():
...     def fuzz_decorator(func):
...         def fuzz_wrapper(*args, **kwargs):
...             print('fuzz') # this is our added functionality
...             return func(*args, **kwargs) # call whatever we're decorating
...         return fuzz_wrapper
...     return fuzz_decorator
...
>>> @fuzz()
... def some_func(number):
...     return f'Number is {number}'
...
>>> print(some_func(10))
fuzz
Number is 10

fuzz()是一个函数,该函数返回接受函数fuzz_decorator(func)的函数,并返回在func本身调用func的同时向setState添加一些功能的新函数。一点。

希望这不会造成混淆。但是你错了。

答案 1 :(得分:0)

看来我真的很接近我想要实现的目标。我以上文章中的网站引发连接错误,因此我改用 public interface getquestion { @FormUrlEncoded @POST("feedback_question") Call<Question> post( @Field("userid") String question ); } 。 我的脚本要做的是从其目标页面收集指向各个帖子的所有链接,然后从其内部页面获取每个帖子的标题。

以下代码具有完整功能

stackoverflow.com

一个问题,我找不到任何解决方案。在下面的函数之间的引用块中,我无法做任何有生产力的工作吗?

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://stackoverflow.com/questions/tagged/web-scraping"

def get_links(func):

    def get_target_link(*args):
        titles = []
        for link in func(*args):
            res = requests.get(link)
            soup = BeautifulSoup(res.text,"lxml")
            title = soup.select_one("h1[itemprop='name'] a").text
            titles.append(title)
        return titles
    return get_target_link

@get_links
def get_info(*args):
    ilink = []
    res = requests.get(*args)
    soup = BeautifulSoup(res.text,"lxml")
    for items in soup.select(".summary .question-hyperlink"):
        ilink.append(urljoin(url,items.get('href')))
    return ilink

if __name__ == '__main__':
    for item in get_info(url):
        print(item)