如何从文本字段中提取特定字段

时间:2019-07-23 09:42:57

标签: python-3.x

我正在尝试从文本字段中提取“体验”字段。但是在将PDF转换为文本文件后,出现了几行多余的行,由于这些行我无法正确提取数据。以下是转换后产生的文本字段。有人可以告诉我如何从此文件中提取“体验”字段吗?

下面的代码非常适合那些没有空行的文本文件。

with open('E:/cvparser/sampath.txt', 'r', encoding = 'utf-8') as f:
    exp_summary_flag = False
    exp_summary = ''
    for line in f:
        if line.startswith('EXPERIENCE'):
            exp_summary_flag = True
        elif exp_summary_flag:
            exp_summary += line
            if not line.strip(): break

print(exp_summary)

这是我使用pdfminer转换后得到的文本文件。

Sampath XYZ 

8th Semester Undergraduate | Computer Science Engineering | UCE RTU, Kota 

+91 654876352 | ABCDEFG@gmail.com | 7/108, Malviya Nagar Jaipur (302017) 

SUMMARY 



To seek an opportunity to apply my technology expertise along with my creative problem solving skills in an 
innovative software company. 



EXPERIENCE 





  Machine Learning Engineering Intern , Forsk Technologies , Jaipur  (May,2017 – July,2017)     

Learned the foundational concepts of data science and machine learning including python and statistics, 
enough time was spent on understanding the concept behind each algorithm and examples and case 
studies were done. Built some mid-scaled machine learning models using supervised and unsupervised 
learning. 

  Software Engineering Intern , Proxbotics Creations Technologies , Jaipur (May,2016 – July,2016) 

Developed  and  optimized  various  projects  including  ecommerce,  booking  &  reservation,  non-profit 
organization Websites, using technologies: HTML, CSS, PHP, JavaScript, MySQL etc.                          

  Trainee at TecheduSoft , Kota  (May,2015) 

The course contains 15+ modules including Android Basics, fragments, screen designing, intents, various 
views, signing app, web servers, web services, notifications, etc.                                                       

PROJECTS 

All projects are available on git: https://github.com/JAIJANYANI 

  Video Analysis for surveillance  

-A command line app which takes all your CCTV feeds as input and filters feeds with abnormal events 
which results in 90% less videos to watch, Used image processing and deep learning algorithms, 
outputs all time-stamps of interesting events for all feeds. 

  Food Calorie Estimator 

-An android app to estimate calories present in food with still image. Trained own Data-set (Meal-net) 
using Transfer learning Built upon Inception V3, Proposed a Deep Convolutional Neural Network (CNN) 
with 48 Layers, Developed a REST API to integrate it in Mobile apps, Optimized total computation time 
~ 2 Seconds. 

  CryptoCurrency Market Predictor 

- A Flask app to predict the future prices of various Crypto Currencies, implemented various supervised 
and deep learning algorithms such as LSTM (RNN), polynomial regression, using scikit-learn, tensorflow, 
keras etc.  

  Spam Filter 

-A REST API to Detect Incoming SMS or Email as Spam or Ham which can be trained on your own data 
set. Used NLP with Naive Bayes for Sentiment Analysis. 


 

Image Classifier using CNN 
-An application which detects objects present in a still image, implemented convolutional neural 
network using open source machine learning library which can be run on multiple machines to reduce 
training workloads, classifies objects using pre-trained image-net model. 

  Online Student and Faculty Portal 

-A Web Portal to manage attendance of students and faculties, can be integrated to mobile apps. Uses 
Php, MySQL, HTML, CSS, JavaScript, etc. 

  Tax Accounting 

-A Decentralized web app built on Ethereum Block-Chain using Truffle and Embark framework, which 
can be used to transfer funds between accounts which automatically deducts tax from the account. 



TECHNICAL SKILLS 

Programming Languages 

Web Technologies  



Scripting Languages     







Database Management System  



Operating Systems  

Strongest Areas 



COURSES 







: 

: 

: 

: 

: 

C, C++ 

HTML, CSS 

Python, PHP, BASH 

MySQL, SQLite 

Microsoft Windows, Linux, UNIX 

             :  

Machine Learning, Data Science 

Applied  Machine  Learning  ,  Applied  Data  Science  ,  Exploratory  Data  Analysis  &  Data  Visualization  ,  Neural 
Networks & Deep Learning , Computer networks , Data Structures & Algorithms , Operating Systems , Cloud 
Computing , Data Mining , Block chain Essentials , Database Management Systems. 



EDUCATION 

  University College of Engineering , Kota : Btech (Pursuing) in Computer Science Engineering  (2018) 
  St. Edmunds School , Jaipur : Senior Secondary (XII) Education Rajasthan  (2012) 
  St. Edmunds School , Jaipur : Secondary (X) Education Rajasthan  (2010) 

如何从此文本文件中提取经验?

2 个答案:

答案 0 :(得分:0)

似乎您想从简历中提取数据。 这是一个复杂的问题,在这里无法给出答案,这太长了。但我会为您提供一些可能对您有所帮助的提示。

首先,您应该将PDF转换为json或XML,而不是文本,这是可提供更多信息的格式,例如单词在页面中的位置,段落或单词序列,字体等。 。 尝试使用此信息以提取所需的数据。字体可能会帮助您获取字幕,而文本的位置可能会用来获取段落。

答案 1 :(得分:0)

根据您的代码,当EXPERIENCE与其余内容之间存在空白行时,该代码将不起作用,因为“如果没有line.strip():breaks”将退出循环。您必须需要一个特定的标识符才能中断并退出循环。

可能如下所示,我尝试使用个人简历并尝试提取经验总结。我提供了“技术专长”作为终点。

    from docx import Document
document = Document(r'cv.docx')
exp_summary_flag = False
exp_summary = ''
for p in document.paragraphs:
    if p.text == 'Experience Summary':
        exp_summary_flag = True
    elif p.text == 'Technical Expertise':
        break
    elif exp_summary_flag:
        print(p.text)

参考文献:Reading .docx files in Python to find strikethrough, bullets and other formats

对于更通用的解决方案,最好将其转换为XML并读取特定标签,这样您就不需要任何端点标识符。

参考文献:Extracting specific xml tag value using python https://www.tutorialspoint.com/How-to-get-specific-nodes-in-xml-file-in-Python