我已经用python编写了一个脚本,通过跟踪其着陆页上的基本链接来获取某些帖子的链接。如果我坚持传统方法,我可以自己抓取同样的东西。
但是,我的目标是使用 decorator
。似乎我已经接近了,但是在将链接从功能get_links()
传递到get_target_link()
时遇到了麻烦。我在功能return func()
中使用get_target_link()
作为占位符,因为我不知道如何传递链接。函数get_links()
中有print语句(如果未注释,则可以使用),以确保我处在正确的轨道上。
如何将链接return linklist
从get_links()
传递到get_target_link()
,以便在必要时可以重复使用?
这是我到目前为止尝试过的:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://www.janglo.net/component/option,com_sobi2/"
def get_links(func):
linklist = []
res = requests.get(func())
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".sobi2ItemTitle a"):
linklist.append(urljoin(url,item.get("href")))
#print(linklist)
return linklist
def get_target_link():
return func() #All I need to do is fix this line
return get_target_link
@get_links
def get_info():
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select("#sobi2CatListSymbols .sobi2SubcatsListItems a[title]"):
if items.text=="Tutors":
ilink = f"{urljoin(url,items.get('href'))}"
return ilink
if __name__ == '__main__':
for links in get_info():
print(links)
Post Script: I only would like to comply with the logic I've tried to apply above
。
@sir Andersson(Can you explain how you want to re-use them if necessary
)的更新:
def get_target_link():
titles = []
new_links = func()
for new_link in new_links:
res = requests.get(new_link)
soup = BeautifulSoup(res.text)
titles.append(soup.select_one("h1").text)
return titles
return get_target_link
我想创建装饰后的功能,使其类似于以下@Carlos Mermingas:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://www.janglo.net/component/option,com_sobi2/"
def get_info(link):
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select("#sobi2CatListSymbols .sobi2SubcatsListItems a[title]"):
if items.text=="Tutors":
ilink = f"{urljoin(url,items.get('href'))}"
return ilink
def get_links(tlink):
linklist = []
res = requests.get(tlink)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".sobi2ItemTitle a"):
linklist.append(urljoin(url,item.get("href")))
return linklist
def get_target_link(link):
titles = []
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
titles.append(soup.select_one("h1").text)
return titles
if __name__ == '__main__':
item = get_info(url)
for nlink in get_links(item):
for ititle in get_target_link(nlink):
print(ititle)
答案 0 :(得分:0)
很长一段时间的信达,老实说,我已经停止阅读您的第一个python错误。
让我为您修复它,您告诉我是否遗漏了什么。这就是装饰器模式在python中的工作方式。
一开始有点奇怪,有点像运送轮船,但是它很聪明。
装饰器是一个函数,它返回要调用的函数而不是另一个函数。
让我们想象一下这个功能,没有修饰。
>>> def some_func(number):
... return f'Number is {number}'
...
>>> print(some_func(10))
Number is 10
要装饰此功能,可以说我们要添加模糊测试;经常在这里和那里增加一些延迟的事情。
>>> def fuzz():
... def fuzz_decorator(func):
... def fuzz_wrapper(*args, **kwargs):
... print('fuzz') # this is our added functionality
... return func(*args, **kwargs) # call whatever we're decorating
... return fuzz_wrapper
... return fuzz_decorator
...
>>> @fuzz()
... def some_func(number):
... return f'Number is {number}'
...
>>> print(some_func(10))
fuzz
Number is 10
fuzz()
是一个函数,该函数返回接受函数fuzz_decorator(func)
的函数,并返回在func
本身调用func
的同时向setState
添加一些功能的新函数。一点。
希望这不会造成混淆。但是你错了。
答案 1 :(得分:0)
看来我真的很接近我想要实现的目标。我以上文章中的网站引发连接错误,因此我改用 public interface getquestion {
@FormUrlEncoded
@POST("feedback_question")
Call<Question> post(
@Field("userid") String question
);
}
。
我的脚本要做的是从其目标页面收集指向各个帖子的所有链接,然后从其内部页面获取每个帖子的标题。
以下代码具有完整功能 :
stackoverflow.com
一个问题,我找不到任何解决方案。在下面的函数之间的引用块中,我无法做任何有生产力的工作吗?
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://stackoverflow.com/questions/tagged/web-scraping"
def get_links(func):
def get_target_link(*args):
titles = []
for link in func(*args):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
title = soup.select_one("h1[itemprop='name'] a").text
titles.append(title)
return titles
return get_target_link
@get_links
def get_info(*args):
ilink = []
res = requests.get(*args)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select(".summary .question-hyperlink"):
ilink.append(urljoin(url,items.get('href')))
return ilink
if __name__ == '__main__':
for item in get_info(url):
print(item)