输入将与换行符不一致,因此我不能将换行符用作某种分隔符。进入的文本将采用以下格式:
IDNumber FirstName LastName得分字母位置
- IDNumber:9个数字
- 得分:0-100
- 信:A或B
- 位置:可以是从缩写的州名到完全拼写的城市和州的任何内容。这是可选的。
前:
123456789 John Doe 90 A New York City 987654321
Jane Doe 70 B CAL 432167895 John
Cena 60 B FL 473829105 Donald Trump 70 E
098743215 Bernie Sanders 92 A AR
元素将是:
123456789 John Doe 90 A New York City
987654321 Jane Doe 70 B CAL
432167895 John Cena 60 B FL
473829105 Donald Trump 70 E
098743215 Bernie Sanders 92 A AR
我需要为每个人单独访问每个元素。所以对于John Cena对象,我需要能够访问ID:432167895,第一个名字:John,姓氏:Cena,B或A:B。我真的不需要这个位置,但它将成为投入的一部分。
编辑:值得一提的是我不允许导入任何模块,例如正则表达式。
答案 0 :(得分:0)
您可以使用正则表达式,这需要每个记录以9位数字开头,在必要时将单词组合在一起,并跳过该位置:
res = re.findall(r"(\d{9})\s+(\S*)\s+(\S*(?:\s+\D\S*)*)\s+(\d+)\s+(\S*)", data)
Result是:
[('123456789', 'John', 'Doe', '90', 'A'),
('987654321', 'Jane', 'Doe', '70', 'B'),
('432167895', 'John', 'Cena', '60', 'B'),
('473829105', 'Donald', 'Trump', '70', 'E'),
('098743215', 'Bernie', 'Sanders', '92', 'A')]
答案 1 :(得分:0)
由于在空格上拆分对于识别位置没有帮助,我会直接找到一个正则表达式:
import re
input_string = """123456789 John Doe 90 A New York City 987654321
Jane Doe 70 B CAL 432167895 John
Cena 60 B FL 473829105 Donald Trump 70 E
098743215 Bernie Sanders 92 A AR"""
search_string=re.compile(r"([0-9]{9})\W+([a-zA-Z ]+)\W+([a-zA-Z ]+)\W+([0-9]{1,3})\W+([AB])\W+([a-zA-Z ]+)\W+")
person_list = re.findall(search_string, input_string)
这会产生:
[('123456789', 'John', 'Doe', '90', 'A', 'New York City'),
('987654321', 'Jane', 'Doe', '70', 'B', 'CAL'),
('432167895', 'John', 'Cena', '60', 'B', 'FL')]
正则表达式中的组的说明:
答案 2 :(得分:0)
因为您知道身份证号码将在每个"记录"的开头。并且长度为9位,尝试按9位数字ID分割:
# Assuming your file is read in as a string s:
import re
records = re.split(r'[ ](?=[0-9]{9}\b)', s)
# record locator will end up holding your records as: {'<full name>' -> {'ID'-><ID value>, 'FirstName'-><FirstName value>, 'LastName'-><LastName value>, 'Letter'-><LetterValue>}, 'full name 2'->{...} ...}
record_locator = {}
field_names = ['ID', 'FirstName', 'LastName', 'Letter']
# Get the individual records and store their values:
for record in records:
# You could filter the record string before doing this if it contains newlines etc
values = record.split(' ')[:5]
# Discard the int after the name eg. 90 in the first record
del values[3]
# Create a new entry for the full name. This will overwrite entries with the same name so you might want to use a unique id instead
record_locator[values[1]+values[2]] = dict(zip(field_names, values))
然后访问信息:
print record_locator['John Doe']['ID'] # 987654321
答案 3 :(得分:0)
我认为尝试按9位数字分割可能是最佳选择。
import re
with open('data.txt') as f:
data = f.read()
results = re.split(r'(\d{9}[\s\S]*?(?=[0-9]{9}))', data)
results = list(filter(None, results))
print(results)
给我这些结果
['123456789 John Doe 90 A New York City ', '987654321\nJane Doe 70 B CAL ', '432167895 John\n\nCena 60 B FL ', '473829105 Donald Trump 70 E\n', '098743215 Bernie Sanders 92 A AR']
答案 4 :(得分:0)
这可能是一种更优雅的方式,但基于下面的示例字符串输入是一个想法。
input = "123456789 John Doe 90 A New York City 987654321 Jane Doe 70 B CAL 473829105 Donald Trump 70 E 098743215 Bernie Sanders 92 A AR"
#split by whitespaces
output = input.split()
#create output to store as dictionary this could then be dumped to a json file
data = {'output':[]}
end = len(output)
i=0
while i< end:
tmp = {}
tmp['id'] = output[i]
i=i+1
tmp['fname']=output[i]
i=i+1
tmp['lname']=output[i]
i=i+1
tmp['score']=output[i]
i=i+1
tmp['letter']=output[i]
i=i+1
location = ""
#Catch index out of bounds errors
try:
bool = output[i].isdigit()
while not bool:
location = location + " " + output[i]
i=i+1
bool = output[i].isdigit()
except IndexError:
print('Completed Array')
tmp['location'] = location
data['output'].append(tmp)
print(data)