我们正在以pdf格式从简历中提取单词。
# importing required modules
import PyPDF2
# creating a pdf file object
pdfFileObj = open('resume1.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
输出:
2
Mostrecentversionalwaysavailableat
nlp.stanford.edu/
˘
rkarthik/cv.html
KarthikRaghunathan
Mobile:
+1-650-384-5782
Email:
kr
f
csDOTstanfordDOTedu
g
Homepage:
nlp.stanford.edu/
˘
rkarthik
ResearchInterests
Intelligence,NaturalLanguageProcessing,Human-RobotInteraction
EducationStanfordUniversity
,California2008onwards
MasterofScienceinComputerScienceCurrentGPA:3.91/4.00
NationalInstituteofTechnology(NIT)
,Calicut,India2004-2008
BachelorofTechnologyinComputerScienceandEngineeringCGPA:9.14/10.00
SoftwareSkills
ProgrammingLanguages
:C,C
++
,Perl,Java,C
#
,MATLAB,Lisp,SQL,MDX,Intelx86
assembly
Speech/NLP/AITools
:HMMToolkit(HTK),CMUSphinxAutomaticSpeechRecogni-
tionSystem,FestivalSpeechSynthesisSystem,VoiceXML,BerkeleyAligner,Giza++,Moses
StatisticalMachineTranslationToolkit,RobotOperatingSystem(ROS)
OtherTools
:L
A
T
E
X,LEX,YACC,Vim,Eclipse,MicrosoftVisualStudio,MicrosoftSQLServer
ManagementStudio,TestNGJavaTestingPlatform,SVN
OperatingSystems
:Linux,Windows,DOS
WorkExperienceMicrosoftCorporationSoftwareDevelopmentEngineerIntern
Redmond,WAJune2009-Sept2009
WorkedwiththeRevenue&RelevanceTeamatMicrosoftadCenteronthe
adCenterMarket-
placeScorecard
project,aimedatdevelopingastandardreliablesetofmetricsthatmeasure
thecompany'sperformanceintheonlineadvertisingmarketplaceandaidinmakinginformed
decisionstomaximizethemarketplacevalue.Alsoinitiatedtheonastatisticallearning
modelthatectivelypredictschangesintheadvertisers'biddingbehaviorwithtime.
StanfordNaturalLanguageProcessingGroupGraduateResearchAssistant
StanfordUniversity,CASept2008onwards
WorkingonStanford'sstatisticalmachinetranslation(SMT)system(aspartoftheDARPA
GALEProgram)undertheguidanceofProf.ChristopherManning.LedStanford'sfor
theGALEPhase3Chinese-EnglishMTevaluationaspartoftheIBM-Rosettateam.
MicrosoftResearch(MSR)LabIndiaResearchIntern
Bangalore,IndiaApr2007-Jul2007
Investigatedthetoleranceofstatisticalmachinetranslationsystemstonoiseinthetraining
corpus,particularlythekindofnoisethataccompaniesautomaticextractionofparallelcorpora
fromcomparablecorpora.AlsoworkedonthedesignofanonlinegameforNLPdataacquisition.
InternationalInstituteofInformationTechnology(IIIT)SummerIntern
Hyderabad,IndiaApr2006-Jun2006
Workedontherapidprototypingofrestricteddomainspokendialogsystems(SDS)forIndian
languages.Developedthe
IIITReceptionist
,aSDSinTamil,TeluguandEnglishlanguages,
whichfunctionedasanautomaticreceptionistforIIIT.
CourseProjectsNormalizationoftextinSMSmessagesusinganSMTsystem
Apr2009-Jun2009
Developedasystemforconvertingtextspeak(languageusedinSMScommunication)toproper
EnglishusingtheMosesstatisticalmachinetranslationsystem.
STAIRspokendialogproject
Jan2009-Apr2009
DevelopedaspokendialoginterfacetotheStanfordAIRobot(STAIR)forgivinginstructions
forfetchingtasks,undertheguidanceofProf.DanJurafskyandProf.AndrewNg.
这些单词不是作为关键词或单词提取出来的,而是出现了这种恶意的东西。
import PyPDF2
#import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
#write a for-loop to open many files -- leave a comment if you'd #like to
learn how
filename = 'sample.pdf'
#open allows you to read the file
pdfFileObj = open(filename,'rb')
#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the
pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
#This if statement exists to check if the above library returned #words.
It's done because PyPDF2 cannot read scanned files.
if text != "":
text = text
#If the above returns as False, we run the OCR library textract to #convert
scanned/image based PDF files into text
else:
text = textract.process(fileurl, method='tesseract', language='eng')
# Now we have a text variable which contains all the text derived #from our
PDF file.
# Now, we will clean our text variable, and return it as a list of keywords.
print(text)
#The word_tokenize() function will break our text phrases into #individual
words
tokens = word_tokenize(text)
#print(tokens)
#we'll create a new list which contains punctuation we wish to clean
punctuations = ['(',')',';',':','[',']',',',' ']
#We initialize the stopwords variable which is a list of words like #"The",
"I", "and", etc. that don't hold much value as keywords
stop_words = stopwords.words('english')
#We create a list comprehension which only returns a list of words #that are
NOT IN stop_words and NOT IN punctuations.
keywords = []
keywords = [word for word in tokens if not word in stop_words and not word
in string.punctuation]
print(keywords)
输出: 为此,输出相同,但找不到textract模块。
问题:任何人都可以更正代码或提供新代码来帮助完成工作吗?