我创建了一个没有flask的简单桌面应用程序,我在其中抓取了多个网站并返回了它们的标题,但是,我正在尝试使用flask将此桌面应用程序转换为Web应用程序。下面我有两个 html 页面(一个用于插入程序可以破译的链接,另一个用于标题的结果)。问题是获取标题的功能给了我错误、404 全局页面和大多数网站的错误请求,尽管在它自己运行相同的功能时,它提供了网站的所有正确标题。我尝试通过连接不同的 python 脚本来运行标题函数 (quickMLA)。我还尝试在虚拟环境中而不是在一个虚拟环境中运行整个程序,但所有尝试都没有解决问题。任何有关正在发生的事情的知识都会很棒! 如果您想测试在其中复制和粘贴网站的实际 Web 应用程序,请复制并粘贴以下网站列表:
https://covid19tracker.ca/provincevac.html?p=ON
https://www.aboutamazon.com/news/company-news/amazons-covid-19-blog-updates-on-how-were-responding-to-the-crisis#covid-latest
https://www.bcg.com/en-us/publications/2021/advantages-of-remote-work-flexibility
https://news.prudential.com/increasingly-workers-expect-pandemic-workplace-adaptations-to-stick.htm
https://www.mckinsey.com/featured-insights/future-of-work/whats-next-for-remote-work-an-analysis-of-2000-tasks-800-jobs-and-nine-countries
https://www.gsb.stanford.edu/faculty-research/publications/does-working-home-work-evidence-chinese-experiment
https://www.livecareer.com/resources/careers/planning/is-remote-work-here-to-stay
https://www.bcg.com/en-us/publications/2021/advantages-of-remote-work-flexibility
这是运行在它自己身上的函数:
import requests
import bs4 as bs
lst = [
"https://covid19tracker.ca/provincevac.html?p=ON",
"https://www.ontario.ca/page/reopening-ontario#foot-1",
"https://blog.twitter.com/en_us/topics/company/2020/keeping-our-employees-and-partners-safe-during-coronavirus.html",
"https://www.aboutamazon.com/news/company-news/amazons-covid-19-blog-updates-on-how-were-responding-to-the-crisis#covid-latest",
"https://www.bcg.com/en-us/publications/2021/advantages-of-remote-work-flexibility",
"https://news.prudential.com/increasingly-workers-expect-pandemic-workplace-adaptations-to-stick.htm",
"https://www.mckinsey.com/featured-insights/future-of-work/whats-next-for-remote-work-an-analysis-of-2000-tasks-800-jobs-and-nine-countries",
"https://www.gsb.stanford.edu/faculty-research/publications/does-working-home-work-evidence-chinese-experiment",
"https://www.livecareer.com/resources/careers/planning/is-remote-work-here-to-stay",
]
def quickMLA(lst):
cited_lst = []
for websites in range(len(lst)):
url=lst[websites]
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36"}
source = requests.get(url,headers=headers, timeout=10).content
soup_page = bs.BeautifulSoup(source,'html.parser')
title = soup_page.find("title").get_text().strip()
print(title)
quickMLA(lst)
这是flask项目的完整python代码,运行时出现错误请求错误:
from flask import Flask,redirect, url_for,render_template,request,session
import requests
import bs4 as bs
app = Flask(__name__)
app.secret_key = "sourcerer"
@app.route("/", methods=["POST","GET"])
def home():
return render_template("index.html")
@app.route("/cited", methods=["POST"])
def cited():
doc = request.form['the_document']
lst = identifyUrls(doc)
info = quickMLA(lst)
return render_template("results.html",info=info)
def identifyUrls(doc):
temp = -1
website_urls = []
for chars in range(len(doc)):
if doc[chars:chars+6] == "https:":
temp = chars
if (doc[chars] == " " or doc[chars] == "\n" or chars+1 == len(doc)) and temp != -1:
website_urls.append(doc[temp:chars])
temp = -1
return website_urls
def quickMLA(lst):
cited_lst = []
for websites in range(len(lst)):
url=lst[websites]
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36"}
source = requests.get(url,headers=headers, timeout=10).content
soup_page = bs.BeautifulSoup(source,'html.parser')
title = soup_page.find("title").get_text().strip()
cited_lst.append(title)
return cited_lst
if __name__ == "__main__":
app.run(debug=True)
Html 主页(不含 css):
<!DOCTYPE html>
<html lang="en">
<head>
<link rel="stylesheet"/>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width,initial-scale=1.0" />
<meta name="description" content="..." />
<title>Sourcerer</title>
</head>
<body>
<section>
<div class="header">
<h2>The Sourcerer</h2>
<p>Wizard of Citations</p>
</div>
</section>
<div class="main">
<div class="position">
<form action="{{url_for('cited')}}" method="POST">
<textarea class="beg" placeholder="Copy and Paste Document" name="the_document"></textarea>
<div class='main_button'>
<input class="c" value="Cite It" type="submit" name="citeme">
</div>
</form>
</div>
</div>
</div>
</body>
</html>
Html 结果页(不含 css):
<!DOCTYPE html>
<html lang="en">
<head>
<link rel="stylesheet"/>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width,initial-scale=1.0"/>
<meta name="description" content="..." />
<title>Sourcerer</title>
</head>
<body>
<section>
<div class="header">
<h2>The Sourcerer</h2>
<p>Wizard of Citations</p>
</div>
</section>
<div class="main">
<form action="{{url_for('home')}}" method="POST">
<textarea class="results">
{% for cites in info%}
{{cites}}
{%endfor%}
</textarea>
<div class="main">
<input class="c" value="Cite Another" type="submit" name="another">
</div>
</form>
</div>
</form>
</body>
</html>
如果您知道为什么会发生这种情况,我们将不胜感激!
答案 0 :(得分:0)
这里有一个调试建议。您观察到一些 URL 给您带来了问题,但这可能有助于捕获问题所在。试着把 BeautifulSoup 完全去掉,然后做一些类似的事情
for url in websites:
try:
cited_lst.append(url)
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36"}
response = requests.get(url,headers=headers, timeout=10)
cited_lst.append(str(response.status_code))
except Exception as ex:
cited_lst.append(repr(ex))
这将说明为什么某些网址会给您带来问题。