我的原始文本如下:
<00> 0000001AB111172323235-8TH是 description.00000323CD41111944322Soft Dimcase 和RATE00000033322S112212234-3100 BN SN OPTION (LINUX)000022664422444433AVaaaaaccaaaaaaa
我正试图将其转换成如下:
0000001AB11117, 2323235-8, This is the description.
00000323CD4111, 1944322, Soft Dimcase andRating
00000033322S11, 2212234-3, 100 BN SN OPTION (LINUX)
00000226644224, 44433AV, aaaaaccaaaaaaa
上面的规则是取14个字符(可以是字母和数字的混合),然后插入“,”然后接下来的7个字符(可以是字母和数字的混合),如果下一个字符是“ - ”则包括连字符和立即数字,然后插入“,”,然后读出所有的描述,直到我们找到三个连续的000.一旦找到三个连续的0,然后我需要在000之前插入新行,然后重复相同的过程,这样我就可以它格式化了。基本上我想读出所有列值。请建议可以做些什么。
我试过下面但是在这里我需要硬编码值才能插入“,”或“\ n”但是在这里我很难编码,不知道如何让它变得动态。
def format_file(filename, find, insert):
with open(page2, 'r+') as file:
lines = file.read()
index = repr(lines).find(find) - 1
if index < 0:
raise ValueError("The text was not found.")
len_found = len(find) - 1
existing_lines = lines[index + len_found:]
file.seek(index)
file.write(find)
file.write(insert)
file.write(existing_lines)
format_file(page2, "0000001AB11117", ', ')
format_file(page2, "2323235-8", ', ')
format_file(page2, "This is the description.", '\n')
答案 0 :(得分:3)
在这种情况下,您可以通过regexp解析文件文本。
The rule above is to take 14 characters (could be mix of letters and numbers)
- [a-zA-Z\d]{14}
then take next 7 characters (could be mix of letters and numbers) and if next character is "-" then include hyphen and immediate digit and then insert "
- [\da-ZA-Z]{7}(\-\d)?
and then read out all the description until we find three consecutive 000
- .+?(?=(000|$))
请检查一下:
import re
expr = re.compile(r'(?P<first>[\da-zA-Z]{14})(?P<second>[\da-ZA-Z]{7}(\-\d)?)(?P<third>.+?(?=(000|$)))')
text = '''0000001AB111172323235-8THis is the description.00000323CD41111944322Soft Dimcase andRating00000033322S112212234-3100 BN SN OPTION (LINUX)000022664422444433AVaaaaaccaaaaaaa'''
for m in expr.finditer(text):
print "{}, {}, {}\n".format(m.group('first'), m.group('second'), m.group('third'))
输出:
0000001AB11117, 2323235-8, THis is the description.
00000323CD4111, 1944322, Soft Dimcase andRating
00000033322S11, 2212234-3, 100 BN SN OPTION (LINUX)
00002266442244, 4433AVa, aaaaccaaaaaaa
答案 1 :(得分:2)
re
正则表达式模块是解析简单文本结构的好方法。在你的情况下,当你点击下一个000
时,诀窍是将流分解为记录。使用前瞻模式(?=000)
对字符串进行迭代处理将与您的分隔符匹配。我们使用前瞻,因为您还希望将此作为以下记录的一部分。我们还希望在文件末尾终止记录,从而替代$
模式。模式的其余部分只是打破了字段。
re_line = re.compile(r'(.{14})(.{7})(-\d|)(.*?)((?=000)|$)')
with open(page2, 'r') as f:
for m in re_line.finditer(f.read()):
print '{0}, {1}{2}, {3}'.format(*m.groups())
输出:
0000001AB11117, 2323235-8, THis is the description.
00000323CD4111, 1944322, Soft Dimcase andRating
00000033322S11, 2212234-3, 100 BN SN OPTION (LINUX)
00002266442244, 4433AVa, aaaaccaaaaaaa