使用pandas从文本字段中提取数据

时间:2015-12-25 22:20:43

标签: python pandas

我正在尝试通过从笔记中提取信息来创建一个pandas数据框。我想得到一些专栏

cuddly-slider:
  version: 1.x
  css:
    theme:
      css/cuddly-slider.css: {}
  js:
    js/cuddly-slider.js: {}

注意:

phonenumber    | status   | result      | notation
(999) 555-9898  Partial    Generic VM   VOICE MAIL LEFT

我会制作第二个数据帧,我会尝试在单个引号中拉出过程事件的单个单词。

Event   Notation
Call    Call to (Home) (999) 555-9898 ended. Partial – Generic VM --> - VOICE MAIL LEFT 
Call    Call to (Work) (999) 555-9898 ended. Partial - Voice Mail, No Message left -->
Call    Call to (Work) (999) 555-9898 ended. Positive –  Spoke to Receptionist --> 
Call    Call to (Mobile) (999) 555-9898 ended. Partial – Generic VM --> - Unable to reach customer, voice message left and text sent
Procedure   Procedure 'Verify' is checked
Procedure   Procedure 'Duplicate Check' is checked
Procedure   Procedure 'Check Something' is checked
Procedure   Procedure 'Scenario' is checked
Procedure   Procedure 'Attempt' is checked

1 个答案:

答案 0 :(得分:2)

为了给你一个想法,这里可能有些事情开始(但是,请注意,这是我第一次使用正则表达式):

import re
data = []
with open('notes.txt', 'r') as f:
    next(f)
    for line in f:
        data.append(line.strip('\n'))
data

['Call Call to (Home) (999) 555-9898 ended. Partial – Generic VM --> - VOICE MAIL LEFT ', 'Call Call to (Work) (999) 555-9898 ended. Partial - Voice Mail, No Message left -->', 'Call Call to (Work) (999) 555-9898 ended. Positive – Spoke to Receptionist --> ', 'Call Call to (Mobile) (999) 555-9898 ended. Partial – Generic VM --> - Unable to reach customer, voice message left and text sent', "Procedure Procedure 'Verify' is checked", "Procedure Procedure 'Duplicate Check' is checked", "Procedure Procedure 'Check Something' is checked", "Procedure Procedure 'Scenario' is checked", "Procedure Procedure 'Attempt' is checked"]

phone = []
status = []
for line in data:
    tmp = line.split(' ')
    if tmp[0] == 'Call':
        p_phone = re.compile('[(]\d{3}[)] \d{3}-\d{4}')
        p_status = re.compile('Partial|Positive')
        phone.append(p_phone.findall(line))
        status.append(p_status.findall(line))
    elif tmp[0] == "Procedure":
        pass
print(phone)
print(status)

[['(999) 555-9898'], ['(999) 555-9898'], ['(999) 555-9898'], ['(999) 555-9898']] [['Partial'], ['Partial'], ['Positive'], ['Partial']]