我了解我们可以从pdf文件中提取文本。
例如,
import pandas as pd
import PyPDF2
# =============================================================================
# Extracting from pdf files
# =============================================================================
pdfFileObj = open(r'C:\Users\User\Documents\Sentiment_Analysis\Urea Weekly Report 01-07-2016.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
totalpage = pdfReader.numPages
pg = [''] * (totalpage-2-2)
for i in [x for x in range(2,totalpage-2) if x != 7]:
pageObj = pdfReader.getPage(i)
pg[i-2] = pageObj.extractText().replace("\n","").lower()
我该如何做类似的事情,但是现在pdf文件位于github中?
我知道我们可以使用excel文件来执行此操作(无需下载)
import pandas as pd
meg = pd.read_csv('https://raw.githubusercontent.com/James-smarttradz/arimax/master/MEG_marketprice_ICIS.csv')
例如,我的文件位于 https://github.com/James-smarttradz/sentiment/blob/master/Urea%20Weekly%20Report%2001-07-2016.pdf
答案 0 :(得分:0)
您可以首先下载此pdf文件:
import wget
url = "https://github.com/James-smarttradz/sentiment/blob/master/Urea%20Weekly%20Report%2001-07-2016.pdf"
wget.download(url, 'C:\Users\User\Documents\Sentiment_Analysis\Urea Weekly Report 01-07-2016.pdf')
并继续您之前的代码以进行必要的操作