Question

输入将与换行符不一致，因此我不能将换行符用作某种分隔符。进入的文本将采用以下格式：

IDNumber FirstName LastName得分字母位置


IDNumber：9个数字

得分：0-100

信：A或B

位置：可以是从缩写的州名到完全拼写的城市和州的任何内容。这是可选的。

前：

123456789 John Doe 90 A New York City 987654321
Jane Doe 70 B CAL 432167895 John

Cena 60 B FL 473829105 Donald Trump 70 E
098743215 Bernie Sanders 92 A AR

元素将是：

123456789 John Doe 90 A New York City
987654321 Jane Doe 70 B CAL
432167895 John Cena 60 B FL
473829105 Donald Trump 70 E
098743215 Bernie Sanders 92 A AR

我需要为每个人单独访问每个元素。所以对于John Cena对象，我需要能够访问ID：432167895，第一个名字：John，姓氏：Cena，B或A：B。我真的不需要这个位置，但它将成为投入的一部分。

编辑：值得一提的是我不允许导入任何模块，例如正则表达式。

Answer 1

您可以使用正则表达式，这需要每个记录以9位数字开头，在必要时将单词组合在一起，并跳过该位置：

res = re.findall(r"(\d{9})\s+(\S*)\s+(\S*(?:\s+\D\S*)*)\s+(\d+)\s+(\S*)", data)

Result是：

[('123456789', 'John', 'Doe', '90', 'A'), 
 ('987654321', 'Jane', 'Doe', '70', 'B'), 
 ('432167895', 'John', 'Cena', '60', 'B'), 
 ('473829105', 'Donald', 'Trump', '70', 'E'), 
 ('098743215', 'Bernie', 'Sanders', '92', 'A')]

Answer 2

由于在空格上拆分对于识别位置没有帮助，我会直接找到一个正则表达式：

import re

input_string = """123456789 John Doe 90 A New York City 987654321
Jane Doe 70 B CAL 432167895 John

Cena 60 B FL 473829105 Donald Trump 70 E
098743215 Bernie Sanders 92 A AR"""

search_string=re.compile(r"([0-9]{9})\W+([a-zA-Z ]+)\W+([a-zA-Z ]+)\W+([0-9]{1,3})\W+([AB])\W+([a-zA-Z ]+)\W+")
person_list = re.findall(search_string, input_string)

这会产生：

[('123456789', 'John', 'Doe', '90', 'A', 'New York City'),
 ('987654321', 'Jane', 'Doe', '70', 'B', 'CAL'),
 ('432167895', 'John', 'Cena', '60', 'B', 'FL')]

正则表达式中的组的说明：

ID：9位数（后跟至少一个空格）
名字和姓氏：2个独立的字符组除以至少一个空格（后跟至少一个空格）
得分：一位，两位或三位数（后跟至少一个空格）
信：A或B（后跟至少一个空格）
位置：一组字符（后跟至少一个空格）

Answer 3

因为您知道身份证号码将在每个＆＃34;记录＆＃34;的开头。并且长度为9位，尝试按9位数字ID分割：

# Assuming your file is read in as a string s:
import re
records = re.split(r'[ ](?=[0-9]{9}\b)', s)

# record locator will end up holding your records as: {'<full name>' -> {'ID'-><ID value>, 'FirstName'-><FirstName value>, 'LastName'-><LastName value>, 'Letter'-><LetterValue>}, 'full name 2'->{...} ...}
record_locator = {}

field_names = ['ID', 'FirstName', 'LastName', 'Letter']

# Get the individual records and store their values:
for record in records:

    # You could filter the record string before doing this if it contains newlines etc
    values = record.split(' ')[:5]

    # Discard the int after the name eg. 90 in the first record
    del values[3]

    # Create a new entry for the full name. This will overwrite entries with the same name so you might want to use a unique id instead
    record_locator[values[1]+values[2]] = dict(zip(field_names, values))

然后访问信息：

print record_locator['John Doe']['ID'] # 987654321

Answer 4

我认为尝试按9位数字分割可能是最佳选择。

import re

with open('data.txt') as f:
    data = f.read()
    results = re.split(r'(\d{9}[\s\S]*?(?=[0-9]{9}))', data)
    results = list(filter(None, results))
    print(results)

给我这些结果

['123456789 John Doe 90 A New York City ', '987654321\nJane Doe 70 B CAL ', '432167895 John\n\nCena 60 B FL ', '473829105 Donald Trump 70 E\n', '098743215 Bernie Sanders 92 A AR']

Answer 5

这可能是一种更优雅的方式，但基于下面的示例字符串输入是一个想法。

input = "123456789 John Doe 90 A New York City 987654321 Jane Doe 70 B CAL 473829105 Donald Trump 70 E 098743215 Bernie Sanders 92 A AR"

#split by whitespaces
output = input.split()

#create output to store as dictionary this could then be dumped to a json file
data = {'output':[]}
end = len(output)

i=0

while i< end:
    tmp = {}
    tmp['id'] = output[i]
    i=i+1
    tmp['fname']=output[i]
    i=i+1
    tmp['lname']=output[i]
    i=i+1
    tmp['score']=output[i]
    i=i+1
    tmp['letter']=output[i]
    i=i+1
    location = ""
    #Catch index out of bounds errors
    try:
        bool = output[i].isdigit()
        while not bool:
            location = location + " " + output[i]
            i=i+1
            bool = output[i].isdigit()
    except IndexError:
        print('Completed Array')

    tmp['location'] = location
    data['output'].append(tmp)

print(data)

Python - 如何将文本输入拆分为单独的元素

5 个答案: