我尝试了几种获取美联储新闻发布会抄录(PDF格式)并将其转换为.txt文件的方法,但是失败了。下面是我的原始代码。任何建议将不胜感激。
import csv
from bs4 import BeautifulSoup
import requests
source=requests.get('https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm').text
soup=BeautifulSoup(source,'lxml')
for b in soup.find_all("a",href=True):
if b.text=='Press Conference':
lnk='https://www.federalreserve.gov'+b['href']
source2=requests.get(lnk).text
soup2=BeautifulSoup(source2,'lxml')
for c in soup2.find_all("a",href=True):
if 'Press Conference Transcript'in c.text:
lnk2='https://www.federalreserve.gov'+c['href']
source3=requests.get(lnk2).text
soup3=BeautifulSoup(source3,'lxml')
for d in soup3.find_all("div",attrs={"id","content"}):
print(d)
fileout = open('conf.txt', 'a')
fileout.write(d)
答案 0 :(得分:0)
因此,关于PDF Scraping,我提出了以下建议:
import requests
import io
import PyPDF2
# Donwload PDF
URL = 'https://www.federalreserve.gov/monetarypolicy/files/monetary20200129a1.pdf'
pdf_bytes = requests.get(URL).content
# PDF Reader expects a file-like object
pdf_stream = io.BytesIO(pdf)
reader = PyPDF2.PdfFileReader(pdf_stream)
# Read the first page
page = reader.getPage(0)
page_content = page.extractText()
print(page_content.encode('utf-8'))
答案 1 :(得分:0)
如果您对结帐库https://firebase.google.com/docs/reference/js/firebase.auth.Auth#onauthstatechanged不满意,就只是一个建议。如果您的PDF格式正确,则非常易于使用。代码示例看起来很简单,如下所示:
from PyPDF2 import PdfFileReader
def extract_information(pdf_path):
with open(pdf_path, 'rb') as f:
pdf = PdfFileReader(f)
information = pdf.getDocumentInfo()
number_of_pages = pdf.getNumPages()
pyPDF2也是不错的选择。
PDFMiner博客中的这篇文章虽然有些陈旧,但是还是很好的信息来源