使用python 3从pdf中提取单词?

时间:2018-02-07 04:39:34

标签: python-3.x pdftotext

我们正在以pdf格式从简历中提取单词。

一种做法!

# importing required modules
  import PyPDF2

  # creating a pdf file object
  pdfFileObj = open('resume1.pdf', 'rb')

  # creating a pdf reader object
  pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

  # printing number of pages in pdf file
  print(pdfReader.numPages)

  # creating a page object
  pageObj = pdfReader.getPage(0)

  # extracting text from page
  print(pageObj.extractText())

  # closing the pdf file object
  pdfFileObj.close()

输出:

2
Mostrecentversionalwaysavailableat
nlp.stanford.edu/
˘
rkarthik/cv.html
KarthikRaghunathan
Mobile:
+1-650-384-5782
Email:
kr
f
csDOTstanfordDOTedu
g
Homepage:
nlp.stanford.edu/
˘
rkarthik
ResearchInterests
Intelligence,NaturalLanguageProcessing,Human-RobotInteraction
EducationStanfordUniversity
,California2008onwards
MasterofScienceinComputerScienceCurrentGPA:3.91/4.00
NationalInstituteofTechnology(NIT)
,Calicut,India2004-2008
BachelorofTechnologyinComputerScienceandEngineeringCGPA:9.14/10.00
SoftwareSkills
ProgrammingLanguages
:C,C
++
,Perl,Java,C
#
,MATLAB,Lisp,SQL,MDX,Intelx86
assembly
Speech/NLP/AITools
:HMMToolkit(HTK),CMUSphinxAutomaticSpeechRecogni-
tionSystem,FestivalSpeechSynthesisSystem,VoiceXML,BerkeleyAligner,Giza++,Moses
StatisticalMachineTranslationToolkit,RobotOperatingSystem(ROS)
OtherTools
:L
A
T
E
X,LEX,YACC,Vim,Eclipse,MicrosoftVisualStudio,MicrosoftSQLServer
ManagementStudio,TestNGJavaTestingPlatform,SVN
OperatingSystems
:Linux,Windows,DOS
WorkExperienceMicrosoftCorporationSoftwareDevelopmentEngineerIntern
Redmond,WAJune2009-Sept2009
WorkedwiththeRevenue&RelevanceTeamatMicrosoftadCenteronthe
adCenterMarket-
placeScorecard
project,aimedatdevelopingastandardreliablesetofmetricsthatmeasure
thecompany'sperformanceintheonlineadvertisingmarketplaceandaidinmakinginformed
decisionstomaximizethemarketplacevalue.Alsoinitiatedtheonastatisticallearning
modelthatectivelypredictschangesintheadvertisers'biddingbehaviorwithtime.
StanfordNaturalLanguageProcessingGroupGraduateResearchAssistant
StanfordUniversity,CASept2008onwards
WorkingonStanford'sstatisticalmachinetranslation(SMT)system(aspartoftheDARPA
GALEProgram)undertheguidanceofProf.ChristopherManning.LedStanford'sfor
theGALEPhase3Chinese-EnglishMTevaluationaspartoftheIBM-Rosettateam.
MicrosoftResearch(MSR)LabIndiaResearchIntern
Bangalore,IndiaApr2007-Jul2007
Investigatedthetoleranceofstatisticalmachinetranslationsystemstonoiseinthetraining
corpus,particularlythekindofnoisethataccompaniesautomaticextractionofparallelcorpora
fromcomparablecorpora.AlsoworkedonthedesignofanonlinegameforNLPdataacquisition.
InternationalInstituteofInformationTechnology(IIIT)SummerIntern
Hyderabad,IndiaApr2006-Jun2006
Workedontherapidprototypingofrestricteddomainspokendialogsystems(SDS)forIndian
languages.Developedthe
IIITReceptionist
,aSDSinTamil,TeluguandEnglishlanguages,
whichfunctionedasanautomaticreceptionistforIIIT.
CourseProjectsNormalizationoftextinSMSmessagesusinganSMTsystem
Apr2009-Jun2009
Developedasystemforconvertingtextspeak(languageusedinSMScommunication)toproper
EnglishusingtheMosesstatisticalmachinetranslationsystem.
STAIRspokendialogproject
Jan2009-Apr2009
DevelopedaspokendialoginterfacetotheStanfordAIRobot(STAIR)forgivinginstructions
forfetchingtasks,undertheguidanceofProf.DanJurafskyandProf.AndrewNg.

这些单词不是作为关键词或单词提取出来的,而是出现了这种恶意的东西。

另一种方法。

import PyPDF2 
#import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

#write a for-loop to open many files -- leave a comment if you'd #like to                 
learn how
filename = 'sample.pdf' 
#open allows you to read the file
pdfFileObj = open(filename,'rb')
#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the     
pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()
#This if statement exists to check if the above library returned #words.         
It's done because PyPDF2 cannot read scanned files.
if text != "":
   text = text
#If the above returns as False, we run the OCR library textract to #convert         
scanned/image based PDF files into text
else:
   text = textract.process(fileurl, method='tesseract', language='eng')
# Now we have a text variable which contains all the text derived #from our 
PDF file.
# Now, we will clean our text variable, and return it as a list of keywords.
print(text)

#The word_tokenize() function will break our text phrases into #individual 
words
tokens = word_tokenize(text)
#print(tokens)
#we'll create a new list which contains punctuation we wish to clean
punctuations = ['(',')',';',':','[',']',',',' ']
#We initialize the stopwords variable which is a list of words like #"The",     
"I", "and", etc. that don't hold much value as keywords
stop_words = stopwords.words('english')
#We create a list comprehension which only returns a list of words #that are                 
NOT IN stop_words and NOT IN punctuations.
keywords = []
keywords = [word for word in tokens if not word in stop_words and  not word     
in string.punctuation]
print(keywords)

输出: 为此,输出相同,但找不到textract模块。

问题:任何人都可以更正代码或提供新代码来帮助完成工作吗?

0 个答案:

没有答案