Python - 如何将文本输入拆分为单独的元素

时间:2017-04-19 20:54:54

标签: python parsing

输入将与换行符不一致,因此我不能将换行符用作某种分隔符。进入的文本将采用以下格式:

  

IDNumber FirstName LastName得分字母位置

     
      
  • IDNumber:9个数字
  •   
  • 得分:0-100
  •   
  • 信:A或B
  •   
  • 位置:可以是从缩写的州名到完全拼写的城市和州的任何内容。这是可选的。
  •   

前:

123456789 John Doe 90 A New York City 987654321
Jane Doe 70 B CAL 432167895 John

Cena 60 B FL 473829105 Donald Trump 70 E
098743215 Bernie Sanders 92 A AR

元素将是:

123456789 John Doe 90 A New York City
987654321 Jane Doe 70 B CAL
432167895 John Cena 60 B FL
473829105 Donald Trump 70 E
098743215 Bernie Sanders 92 A AR

我需要为每个人单独访问每个元素。所以对于John Cena对象,我需要能够访问ID:432167895,第一个名字:John,姓氏:Cena,B或A:B。我真的不需要这个位置,但它将成为投入的一部分。

编辑:值得一提的是我不允许导入任何模块,例如正则表达式。

5 个答案:

答案 0 :(得分:0)

您可以使用正则表达式,这需要每个记录以9位数字开头,在必要时将单词组合在一起,并跳过该位置:

res = re.findall(r"(\d{9})\s+(\S*)\s+(\S*(?:\s+\D\S*)*)\s+(\d+)\s+(\S*)", data)

Result是:

[('123456789', 'John', 'Doe', '90', 'A'), 
 ('987654321', 'Jane', 'Doe', '70', 'B'), 
 ('432167895', 'John', 'Cena', '60', 'B'), 
 ('473829105', 'Donald', 'Trump', '70', 'E'), 
 ('098743215', 'Bernie', 'Sanders', '92', 'A')]

答案 1 :(得分:0)

由于在空格上拆分对于识别位置没有帮助,我会直接找到一个正则表达式:

import re

input_string = """123456789 John Doe 90 A New York City 987654321
Jane Doe 70 B CAL 432167895 John

Cena 60 B FL 473829105 Donald Trump 70 E
098743215 Bernie Sanders 92 A AR"""

search_string=re.compile(r"([0-9]{9})\W+([a-zA-Z ]+)\W+([a-zA-Z ]+)\W+([0-9]{1,3})\W+([AB])\W+([a-zA-Z ]+)\W+")
person_list = re.findall(search_string, input_string)

这会产生:

[('123456789', 'John', 'Doe', '90', 'A', 'New York City'),
 ('987654321', 'Jane', 'Doe', '70', 'B', 'CAL'),
 ('432167895', 'John', 'Cena', '60', 'B', 'FL')]

正则表达式中的组的说明:

  • ID:9位数(后跟至少一个空格)
  • 名字和姓氏:2个独立的字符组除以至少一个空格(后跟至少一个空格)
  • 得分:一位,两位或三位数(后跟至少一个空格)
  • 信:A或B(后跟至少一个空格)
  • 位置:一组字符(后跟至少一个空格)

答案 2 :(得分:0)

因为您知道身份证号码将在每个"记录"的开头。并且长度为9位,尝试按9位数字ID分割:

# Assuming your file is read in as a string s:
import re
records = re.split(r'[ ](?=[0-9]{9}\b)', s)

# record locator will end up holding your records as: {'<full name>' -> {'ID'-><ID value>, 'FirstName'-><FirstName value>, 'LastName'-><LastName value>, 'Letter'-><LetterValue>}, 'full name 2'->{...} ...}
record_locator = {}

field_names = ['ID', 'FirstName', 'LastName', 'Letter']

# Get the individual records and store their values:
for record in records:

    # You could filter the record string before doing this if it contains newlines etc
    values = record.split(' ')[:5]

    # Discard the int after the name eg. 90 in the first record
    del values[3]

    # Create a new entry for the full name. This will overwrite entries with the same name so you might want to use a unique id instead
    record_locator[values[1]+values[2]] = dict(zip(field_names, values))

然后访问信息:

print record_locator['John Doe']['ID'] # 987654321

答案 3 :(得分:0)

我认为尝试按9位数字分割可能是最佳选择。

import re

with open('data.txt') as f:
    data = f.read()
    results = re.split(r'(\d{9}[\s\S]*?(?=[0-9]{9}))', data)
    results = list(filter(None, results))
    print(results)

给我这些结果

['123456789 John Doe 90 A New York City ', '987654321\nJane Doe 70 B CAL ', '432167895 John\n\nCena 60 B FL ', '473829105 Donald Trump 70 E\n', '098743215 Bernie Sanders 92 A AR']

答案 4 :(得分:0)

这可能是一种更优雅的方式,但基于下面的示例字符串输入是一个想法。

input = "123456789 John Doe 90 A New York City 987654321 Jane Doe 70 B CAL 473829105 Donald Trump 70 E 098743215 Bernie Sanders 92 A AR"

#split by whitespaces
output = input.split()

#create output to store as dictionary this could then be dumped to a json file
data = {'output':[]}
end = len(output)

i=0

while i< end:
    tmp = {}
    tmp['id'] = output[i]
    i=i+1
    tmp['fname']=output[i]
    i=i+1
    tmp['lname']=output[i]
    i=i+1
    tmp['score']=output[i]
    i=i+1
    tmp['letter']=output[i]
    i=i+1
    location = ""
    #Catch index out of bounds errors
    try:
        bool = output[i].isdigit()
        while not bool:
            location = location + " " + output[i]
            i=i+1
            bool = output[i].isdigit()
    except IndexError:
        print('Completed Array')

    tmp['location'] = location
    data['output'].append(tmp)

print(data)