如何直接从github读取pdf文件(而不从github下载或获取它)?

时间:2019-12-20 08:31:16

标签: python pdf github pypdf2

我了解我们可以从pdf文件中提取文本。

例如,

import pandas as pd
import PyPDF2

# =============================================================================
# Extracting from pdf files
# =============================================================================
pdfFileObj = open(r'C:\Users\User\Documents\Sentiment_Analysis\Urea Weekly Report 01-07-2016.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
totalpage = pdfReader.numPages

pg = [''] * (totalpage-2-2)
for i in [x for x in range(2,totalpage-2) if x != 7]:
    pageObj = pdfReader.getPage(i)
    pg[i-2] = pageObj.extractText().replace("\n","").lower()

我该如何做类似的事情,但是现在pdf文件位于github中?

我知道我们可以使用excel文件来执行此操作(无需下载)

import pandas as pd
meg = pd.read_csv('https://raw.githubusercontent.com/James-smarttradz/arimax/master/MEG_marketprice_ICIS.csv')

例如,我的文件位于 https://github.com/James-smarttradz/sentiment/blob/master/Urea%20Weekly%20Report%2001-07-2016.pdf

1 个答案:

答案 0 :(得分:0)

您可以首先下载此pdf文件:

 import wget
    url = "https://github.com/James-smarttradz/sentiment/blob/master/Urea%20Weekly%20Report%2001-07-2016.pdf"
    wget.download(url, 'C:\Users\User\Documents\Sentiment_Analysis\Urea Weekly Report 01-07-2016.pdf')

并继续您之前的代码以进行必要的操作