背景 我使用Python 3.4,PyPDF2和Regular Expressions从以下PDF的第1页的表中提取数据:
http://minerals.usgs.gov/minerals/pubs/commodity/gold/mcs-2015-gold.pdf
import PyPDF2
import re
gold_pdf = r'C:\Users\xxxxx_x\xxxxxxx\mcs_gold_2015.pdf'
pdfFileObj = open(gold_pdf,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
pageObj.extractText()
start_pos = pageObj.extractText().index('United States\n:')
end_pos = pageObj.extractText().index('Recycling\n:')
table_text = pageObj.extractText()[start_pos:end_pos]
print(table_text)
print(re.findall(r'\d+[\d,]*\d', table_text))`
*Results* - NOTE: Scroll Left & Right
['2010', '2011', '2012', '2013', '2014', '231', '234', '235', '230', '211', '175', '220', '222', '223', '200', '198', '263', '215', '210', '200', '616', '550', '326', '315', '315', '383', '644', '695', '691', '430', '180', '168', '147', '160', '165', '8,140', '8,140', '8,140', '8,140', '8,140', '1,228', '1,572', '1,673', '1,415', '1,270', '10,300', '11,100', '12,700', '12,958', '12,500']
问题: 美国地质调查局矿物商品概要中有更多的PDF文件,其结构类似于我试图用PyPDF2编写的,但它并没有起作用。我已经与他们核对过,并且数据不能以其他格式提供。
例如,如果您在上面的示例中使用Silver PDF(http://minerals.usgs.gov/minerals/pubs/commodity/silver/mcs-2015-silve.pdf)而不是Gold PDF,我就无法获得所需的结果。
NOTE: Scroll left & right
*OUTPUT from PageObj.extracttext():*
'SILVER\n \n\nDomestic Production and Use\n:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSalient Statistics\nŠUnited States\n:2010 2011 20122013 2014e\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n \n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nRecycling\n:\n\nImport Sources (2010\nŒ13)\n:2\nTariff\n:\nDepletion Allowance\n:\n Government Stockpile\n:\nEvents, Trends, and Issues\n: \n\n\n\n\n\n\n\n\n\n\n\n\n\n\nFlorence C. Katrivanos\n [(703) 648\nŒ6782, fkatrivanos@usgs.gov]\n '
??? - 为什么数据不像Silver PDF那样提取Silver PDF的数据
用于Python 3.4的Python库是什么? 我无法找到适用于Python 3.4的PDF报废的好解决方案(请参阅以下帖子:Best tool for text extraction from PDF in Python 3.4)
非常感谢您的协助!