用烧瓶抓取网站时的美丽汤给出了错误的请求错误

时间:2021-06-20 16:43:30

标签: python html flask beautifulsoup

我创建了一个没有flask的简单桌面应用程序,我在其中抓取了多个网站并返回了它们的标题,但是,我正在尝试使用flask将此桌面应用程序转换为Web应用程序。下面我有两个 html 页面(一个用于插入程序可以破译的链接,另一个用于标题的结果)。问题是获取标题的功能给了我错误、404 全局页面和大多数网站的错误请求,尽管在它自己运行相同的功能时,它提供了网站的所有正确标题。我尝试通过连接不同的 python 脚本来运行标题函数 (quickMLA)。我还尝试在虚拟环境中而不是在一个虚拟环境中运行整个程序,但所有尝试都没有解决问题。任何有关正在发生的事情的知识都会很棒! 如果您想测试在其中复制和粘贴网站的实际 Web 应用程序,请复制并粘贴以下网站列表:

https://covid19tracker.ca/provincevac.html?p=ON
https://www.aboutamazon.com/news/company-news/amazons-covid-19-blog-updates-on-how-were-responding-to-the-crisis#covid-latest
https://www.bcg.com/en-us/publications/2021/advantages-of-remote-work-flexibility
https://news.prudential.com/increasingly-workers-expect-pandemic-workplace-adaptations-to-stick.htm
https://www.mckinsey.com/featured-insights/future-of-work/whats-next-for-remote-work-an-analysis-of-2000-tasks-800-jobs-and-nine-countries
https://www.gsb.stanford.edu/faculty-research/publications/does-working-home-work-evidence-chinese-experiment
https://www.livecareer.com/resources/careers/planning/is-remote-work-here-to-stay
https://www.bcg.com/en-us/publications/2021/advantages-of-remote-work-flexibility

这是运行在它自己身上的函数:

import requests
import bs4 as bs

lst = [
    "https://covid19tracker.ca/provincevac.html?p=ON",
    "https://www.ontario.ca/page/reopening-ontario#foot-1",
    "https://blog.twitter.com/en_us/topics/company/2020/keeping-our-employees-and-partners-safe-during-coronavirus.html",
    "https://www.aboutamazon.com/news/company-news/amazons-covid-19-blog-updates-on-how-were-responding-to-the-crisis#covid-latest",
    "https://www.bcg.com/en-us/publications/2021/advantages-of-remote-work-flexibility",
    "https://news.prudential.com/increasingly-workers-expect-pandemic-workplace-adaptations-to-stick.htm",
    "https://www.mckinsey.com/featured-insights/future-of-work/whats-next-for-remote-work-an-analysis-of-2000-tasks-800-jobs-and-nine-countries",
    "https://www.gsb.stanford.edu/faculty-research/publications/does-working-home-work-evidence-chinese-experiment",
    "https://www.livecareer.com/resources/careers/planning/is-remote-work-here-to-stay",
    ]


def quickMLA(lst):
    cited_lst = []
    for websites in range(len(lst)):
        url=lst[websites]
        headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36"}
        source = requests.get(url,headers=headers, timeout=10).content
        soup_page = bs.BeautifulSoup(source,'html.parser')
        title = soup_page.find("title").get_text().strip()
        print(title)
quickMLA(lst)

这是flask项目的完整python代码,运行时出现错误请求错误:

from flask import Flask,redirect, url_for,render_template,request,session
import requests
import bs4 as bs

app = Flask(__name__)
app.secret_key = "sourcerer"

@app.route("/", methods=["POST","GET"])
def home():
    return render_template("index.html")

@app.route("/cited", methods=["POST"])
def cited():
    doc = request.form['the_document']
    lst = identifyUrls(doc)
    info = quickMLA(lst) 
    return render_template("results.html",info=info)

def identifyUrls(doc):
    temp = -1
    website_urls = []
    for chars in range(len(doc)):
        if doc[chars:chars+6] == "https:":
            temp = chars
        if (doc[chars] == " " or doc[chars] == "\n" or chars+1 == len(doc)) and temp != -1: 
            website_urls.append(doc[temp:chars])
            temp = -1
    return website_urls

def quickMLA(lst):
    cited_lst = []
    for websites in range(len(lst)):
        url=lst[websites]
        headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36"}
        source = requests.get(url,headers=headers, timeout=10).content
        soup_page = bs.BeautifulSoup(source,'html.parser')
        title = soup_page.find("title").get_text().strip()
        cited_lst.append(title) 
    return cited_lst

if __name__ == "__main__":
    app.run(debug=True)

Html 主页(不含 css):

<!DOCTYPE html>
<html lang="en">
  <head>
    <link rel="stylesheet"/>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width,initial-scale=1.0" />
    <meta name="description" content="..." />
    <title>Sourcerer</title>
  </head>
  <body>
    <section>
      <div class="header">
        <h2>The Sourcerer</h2>
        <p>Wizard of Citations</p>
      </div>
    </section>
      <div class="main">
        <div class="position">
          <form action="{{url_for('cited')}}" method="POST">
            <textarea class="beg" placeholder="Copy and Paste Document" name="the_document"></textarea>
            <div class='main_button'>
              <input class="c" value="Cite It" type="submit" name="citeme">
            </div>   
          </form>
        </div>
      </div>
    </div>
  </body>
</html>

Html 结果页(不含 css):

<!DOCTYPE html>
<html lang="en">
  <head>
    <link rel="stylesheet"/>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width,initial-scale=1.0"/>
    <meta name="description" content="..." />
    <title>Sourcerer</title>
  </head>
  <body>
    <section>
      <div class="header">
        <h2>The Sourcerer</h2>
        <p>Wizard of Citations</p>
      </div>
    </section>
      <div class="main">
        <form action="{{url_for('home')}}" method="POST">
          <textarea class="results">
            {% for cites in info%}
            {{cites}}
            {%endfor%}
          </textarea>
          <div class="main">
            <input class="c" value="Cite Another" type="submit" name="another">
          </div>  
        </form>
      </div>
    </form>
  </body>
</html>

如果您知道为什么会发生这种情况,我们将不胜感激!

1 个答案:

答案 0 :(得分:0)

这里有一个调试建议。您观察到一些 URL 给您带来了问题,但这可能有助于捕获问题所在。试着把 BeautifulSoup 完全去掉,然后做一些类似的事情

for url in websites:
    try:
        cited_lst.append(url)
        headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36"}
        response = requests.get(url,headers=headers, timeout=10)
        cited_lst.append(str(response.status_code))
    except Exception as ex:
        cited_lst.append(repr(ex))

这将说明为什么某些网址会给您带来问题。