Question

这是我第一次发帖，如有错误请见谅。我目前有一个包含 URL 列表的文件，我正在尝试创建一个 python 程序，该程序将转到 URL 并从 HTML 页面中获取文本并将其保存在 .txt 文件中。我目前正在使用 beautifulsoup 来抓取这些网站，其中许多网站都抛出了我不确定如何解决的错误。我正在寻找更好的方法：我已通过下面的代码发布。

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from urllib.request import Request
import datefinder
from dateutil.parser import parse
import json
import re
import random
import time
import scrapy
import requests
import urllib
import os.path
from os import path
 
#extracts page contents using beautifulSoup
def page_extract(url):
    req = Request(url,
                  headers={'User-Agent': 'Mozilla/5.0'})
    webpage = uReq(req, timeout=5).read()
    page_soup = soup(webpage, "lxml")
    return page_soup
 
#opens file that contains the links
file1 = open('links.txt', 'r')
lines = file1.readlines()
 
#for loop that iterates through the list of urls I have
for i in range(0, len(lines)):
    fileName = str(i)+".txt"
    url = str(lines[i])
    print(i)
    try:
        #if the scraping is successful i would like it to save the text contents in a text file with the text file name 
        # being the index
        soup2 = page_extract(url)
        text = soup2.text
        f = open("Politifact Files/"+fileName,"x")
        f.write(str(text))
        f.close()
        print(url)
    except:
        #otherwise save it to another folder which contains all the sites that threw an error
        f = open("Politifact Files Not Completed/"+fileName,"x")
        f.close()
        print("NOT DONE: "+url)

Answer 1

感谢@Thierry Lathuille 和@Dr Pi 的回复。通过查看能够从网页上抓取重要文本的 Python 库，我找到了解决此问题的方法。我遇到了一个名为“Trafilatura”的产品，它能够完成这项任务。此库的文档位于：https://pypi.org/project/trafilatura/。

网页抓取 - Python - 需要帮助

1 个答案: