Question

我在来自不同报纸的CSV文件中有一堆URL（超过50k）。我主要是在寻找主要标题<h1>和主要段落<p>。我遇到了一个我不太熟悉或不知道如何处理的例外。她是我回来的消息：

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connection.py", line 141, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 8] nodename nor servname provided, or not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 346, in _make_request
    self._validate_conn(conn)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 850, in _validate_conn
    conn.connect()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connection.py", line 284, in connect
    conn = self._new_conn()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connection.py", line 150, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x118e1a6a0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/adapters.py", line 440, in send
    timeout=timeout
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/util/retry.py", line 388, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.cnn.com', port=443): Max retries exceeded with url: /2019/02/01/us/chicago-volunteer-homeless-cold-trnd/index.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+rss%2Fcnn_topstories+%28RSS%3A+CNN+-+Top+Stories%29 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x118e1a6a0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Volumes/FELIPE/english_news/pass_news.py", line 24, in <module>
    request_to_url = requests.get(urls).text
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 640, in send
    history = [resp for resp in gen] if allow_redirects else []
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 640, in <listcomp>
    history = [resp for resp in gen] if allow_redirects else []
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 218, in resolve_redirects
    **adapter_kwargs
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/adapters.py", line 508, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.cnn.com', port=443): Max retries exceeded with url: /2019/02/01/us/chicago-volunteer-homeless-cold-trnd/index.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+rss%2Fcnn_topstories+%28RSS%3A+CNN+-+Top+Stories%29 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x118e1a6a0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',)))

她的代码是

import uuid
import pandas as pd
import os
import requests
from bs4 import BeautifulSoup

cwd = os.path.dirname(os.path.realpath(__file__))

csv_file = os.path.join(cwd, "csv_data", "data.csv")

text_data = os.path.join(cwd, "raw_text2")

if not os.path.exists(text_data):
    os.makedirs(text_data)

df = pd.read_csv(csv_file)


for link, source in df.iterrows():
    urls = source['Link']
    source_name = source["Source"]
    request_to_url = requests.get(urls).text
    soup = BeautifulSoup(request_to_url, 'html.parser')
    try:
        h = soup.find_all('h1')

        try:
            text_h = h.get_text()
        except AttributeError:
            text_h = ""

        p = soup.find_all('p')
        text_p = ([p.get_text() for p in soup('p')])
        text_bb = str(" ".join(repr(e) for e in text_p))

        source_dir = os.path.join(text_data, source_name)

        try:
            os.makedirs(source_dir)
        except FileExistsError as e:
            pass

        filename = str(uuid.uuid4())
        write = open(os.path.join(source_dir, filename + ".txt"), "w+", encoding="utf-8")
        write.write(text_h + "\n" + text_bb)
        write.close()

        data = pd.Series(text_h + text_bb)
        with open("raw_text.csv", "a") as f:
            data.to_csv(f, encoding="utf-8", header=False, index=None)

    except:
        # Removes all <div> with id "sponsor-slug"
        for child_div in soup.find_all("div", id="sponsor-slug"):
            child_div.decompose()

        # Remove all <p> with class "copyright"
        for child_p in soup.find_all('p', attrs={'class': "copyright"}):
            child_p.decompose()

        # Removes all <a> tags an keeps the content if any
        a_remove = soup.find_all("a")
        for unwanted_tag in a_remove:
            unwanted_tag.replaceWithChildren()

        # Removes all <span> content and keeps content if any
        span_remove = soup.find_all("span")
        for unwanted_tag in span_remove:
            unwanted_tag.replaceWithChildren()

        # Removes all <em> content and keeps content if any
        span_remove = soup.find_all("em")
        for unwanted_tag in span_remove:
            unwanted_tag.replaceWithChildren()

处理这些异常的最佳方法是什么？如果可能，是否可以忽略连接并转到下一个URL？

我想抓取内容并将其添加到另一个CSV文件中，或者尽可能将它们添加到当前CSV中。同时使用不同的来源创建不同的文件夹，并将相应的文本添加到该文件夹。

基本上，这段代码在做什么：

        filename = str(uuid.uuid4())
        write = open(os.path.join(source_dir, filename + ".txt"), "w+", encoding="utf-8")
        write.write(text_h + "\n" + text_bb)
        write.close()

        data = pd.Series(text_h + text_bb)
        with open("raw_text.csv", "a") as f:
            data.to_csv(f, encoding="utf-8", header=False, index=None)

我想在每个文本上使用NLP，然后尝试在文本上使用一些情感分析工具。

Answer 1

在获得响应的 text 值之前，请在此行中

request_to_url = requests.get(urls).text

您可以检查链接是否可用。我为此操作编写了简单的函数：

import requests

# Open session
s = requests.Session()

page_url = "http://wp.meQ/testBadUrl" # example of bad URL

def get_response(page_url):
    """ Get good or bad response from page_url"""
    # Create 'bad' Response object
    bad_resp = requests.Response()
    bad_resp.status_code = 404
    try:
        # By default 'allow_redirects' = True
        good_resp = s.get(page_url, timeout=(3, 10))
        if good_resp.ok:
            return good_resp
        else:
            return bad_resp
    except requests.exceptions.ConnectionError:
        print("Exception! Bad Request for URL: " + page_url)
        return bad_resp
    except requests.exceptions.Timeout:
        print("Exception! Timeout for URL: " + page_url)
        return bad_resp
    except:
        print("Unknown Exception!: " + page_url)
        return bad_resp

page_resp = get_response(page_url)

if page_resp.ok:
    # Your code for good URLs
    print("Append URL into 'GOOD' list")
else:
    # Your code for bad URLs
    print("Skip BAD url here...")

如果需要，您还可以添加和处理不同的请求异常（完整列表here）。希望对您有帮助。

处理请求中的异常

1 个答案: