从文本文件中提取某些项目以进行标记化

时间:2018-11-11 13:10:43

标签: python regex tokenize

下面是文本文件“ info.txt”的结构。从此文件中,我需要提取ID和描述(可以正确提取ID和描述信息的任何方法)。该文件中大约有500个ID和描述的实例。一个ID代表一个标题和一个描述,如文本文件中所示。

我不确定的第一部分是是否将ID和描述信息存储在2个列表中。如果我使用列表,那么我将能够使用“描述”列表来标记每个描述(请记住,该列表中将有500个描述)。

ID: #22579462
Title: Quality Engineer
Description: Our client are a leading supplier of precision machined, high integrity components, integrated kits of parts and complete mechanical assemblies. Due to an large increase in workload they are recruiting a Quality Engineer Reporting to the Quality Manager, the successful individual will be responsible for providing documentation to fulfil our customers quality assurance requirements on specific contracts, whilst maintaining a system of storage and retrieval for documentation. The role will also support the internal audit schedule, performing audits as required. Responsibilities include: Documentation Checking all vendor supplied documentation to ensure it complies with the requirements or Express s customer specifications. Produce accurate, legible documentation packs, in accordance with customer requirements. Quality Systems Maintain system of storage and retrieval of all associated QA documentation in accordance with ISO9001:**** Certification Ensure certificates of conformance are checked, in accordance with the C of C matrix and any applicable concessions are referenced Material Certification Verify and approve certification on receipt for conformance to customer requirements and resolve discrepancies with suppliers Non conformance Raise and submit supplier reject reports and concessions. Store all responses received in relevant databases. Internal Auditing Carry out internal audits as and when required in line with the internal audit schedule. Identify and report all nonconformances within Quality Management System, and assist in corrective actions to close them out Supplier Rejects Ensure corrective action is received for supplier rejects submitted to key suppliers The Individual: Has experience within the quality department of a related company in a similar role Ideally from a mechanical or manufacturing engineering background. Ideally be familiar with the range of processes involved in the markets of Oil Must have good communication and organisational skills Has the ability to work as part of a team or as an individual. Has the ability to be customer facing and discuss technical / quality issues with vendors and customers

ID: #22933091
Title: Chef de Partie  Award Winning Dining  Live In  Share of Tips
Description: A popular hotel located in Norfolk which is a very busy operation has a position available for a Chef de Partie Role: A Chef de Partie capable of coping well under pressure is required to join the kitchen team at a hotel that has an excellent reputation for offering high quality dining to its guests and has gained accreditations in the main restaurant.The busy Brasserie style restaurant regularly serves **** covers for lunch and dinner so this Chef de Partie role will require you to be organised on your section ensuring all prep is complete to the standards expected by the Head Chef before each service. Requirements: All Chef de Parties applying for this role must have a strong background with highlights previous AA Rosette experience in a high volume operation.A candidate who is self motivated and capable of working well in a busy team of chefs would be ideal for this role. Benefits Include: Uniform Provided Meals on Duty Accommodation Available Share of Tips – IRO **** Per Month Excellent Opportunities To Progress If you are interested in this position or would like information on the other positions we are recruiting for or any temporary assignments please send your CV by clicking on the 'apply now' button below and our consultant Sean Bosley will do his utmost to assist you in your search for employment. In line with the requirements of the Asylum Immigration Act **** all applicants must be eligible to live and work in the UK. Documented evidence of the eligibility will be required from candidates as part of the recruitment process. This job was originally posted as  

ID: #23528672
Title: Senior Fatigue and Damage Tolerance Engineer
Description: Senior Fatigue Static stress (metallic or composite) Finite element analysis. Senior Fatigue Aerospace  ****K****K (dep on exp)  benefits package Bristol, Avon

ID: #23529949
Title: C I Design Engineer
Description: We are currently recruiting on behalf of our client who have an exciting opportunity available for a CE Produce CE Control Panel designs  Genera Arrangements, Detail drawings, Schematics Diagrams, Interlock Diagrams for typically PLC Specification of hardware and production of parts list. Manufacturing specification. Ensure Company policies and procedures are being applied across the projects. Manage the interface between CE Communicate at all levels with both internal and external customers to meet their expectations while meeting the project budget and programme constraints. Support the Lead Engineer in the delivery of scope to budget and programme. Provide technical expertise to tenders as and when required. Provide input to the development of the CE&l function and resource

我想在这里实现一些目的,一个是创建一个所有单词的单字词汇,其格式为word_string:integer_index。二)创建一个文本文件,其中每一行对应一个描述。该行将从ID(保留#)开始。每行的其余部分是对应描述的稀疏表示形式,其形式为word_index:word_freq,以逗号分隔。

我想这就是为什么我认为将ID和Description信息存储在列表中会是理想的原因。这样,ID列表中的索引0将是#22579462,描述列表中的索引0将是对应的描述文本。

预先感谢

2 个答案:

答案 0 :(得分:1)

您可以一次读入文件,然后使用regex findall对其进行解析。 “ rslt”列表包含(ID,描述)元组:

with open("info.txt") as ff:
    rslt= re.findall(r"(?sm)^\s*ID:\s*#(\d+)\s*$.*?^Description:(.*?)(?:\s*(?=^ID: #)|\Z)",ff.read())

(?sm)-> m:多行模式,s:dot(。)也匹配新行;

^ \ s * ID:\ s *#(\ d +)->匹配行的开头,空格和“ ID:#”模式,然后匹配数字,将其分组(请参阅括号) );

\ s * $->在数字后,该行只能包含空格;

。*?^描述:->跳过标题,并匹配“描述:”模式;

(。?)(?:\ s (?= ^ ID:#)| \ Z)->(。*?)获得描述 文本(分组)到下一个以“ ID:#”开头或字符串\ Z结束的块。

答案 1 :(得分:0)

如评论中所述,您的数据似乎正在引导您使用字典。首先,创建一个忽略空行的函数。可以在here中找到空白函数,这是一个很好的解释。然后,调用该函数以逐行导入txt并将其保存在字典中。最后,将生成一个数据帧,其中索引是您的ID。

import pandas as pd
file=r"C:\***\***\info.txt".replace('\\', '/')
d={}

def nonblank_lines(f):#ingore blank lines
    for l in f:
        line = l.rstrip()
        if line:
            yield line
#importing txt line by line into a dictionary   
with open(file) as my_file:
    for line in nonblank_lines(my_file):
        key = line.split(': ')[0]
        if key not in d:#if key not in dictionary then create empty
            d[key] = []
        d[key].append(line.split(': ')[1])#populate the keys
#drop unwanted keys
my_keys=['Description','ID','Title']
for key, value in d.items():
    if key not in my_keys:
        del(d[key])
#Create a df with ID as index and the rest of data in columns
df=pd.DataFrame(data={your_key:d[your_key] for your_key in ['Description','Title']},index=d.get('ID'),columns=['Description','Title'])
df.to_csv(r'path\filename.txt',sep=',', index=True, header=True)#save your df