My laboratory is working with a software that generates a mess of data as output, so I’m trying to make things easier using Python. So far, I believe that the best approach is to generate lists and treat it as chunks of data, but that is not so easy: The first chunk of data is easy: the 3 columns are fixed and can be obtained simply with:
chunk1 = my_data[:3]
The 2nd chunk of data is not easy because it can have 2, 3 or 4 columns. I believe that the key here is that the 2nd chunk ends when we find a letter (something like 1 3 7 CCC). In this case I believe that it is possible to use the re module to parse the two, three or four columns and stop before the first letter, but I don’t know how to do it. I intend to “normalize” these columns by filling the vacant spots with zeros or “-”, so if I have the 2 columns case I’ll fill it to [x, y, 0, 0] and the 3 columns case with [x, y, z, 0].
The 3rd chunk is fixed (two, three or four letters and a number) like this: CCC 119.62
And the 4th chunk is the rest.
Here is a representation of the messy output:
The final result could be something like: ["s 91", "1.00", "OUT"] ["9", "3", "12", "7"] ["OCCC", "0.34"] ["f829", "27","f752","33"]
So far, I’m stuck trying to figure out how to make the re module work like this:
Any help is much appreciated, guys.
Data sample
s 27 1.00 STRE 30 16 OC 1.355049 f1291 50
s 28 -1.00 STRE 8 6 CC 1.494281 f1340 12 f1271 17
s 29 -1.00 STRE 14 15 NC 1.421282 f1358 49
s 30 1.00 STRE 14 15 NC 1.421282 f1337 10 f1290 33
s 31 1.00 STRE 8 6 CC 1.494281 f1171 15 f323 11
s 32 1.00 STRE 30 31 OC 1.419982 f1082 51 f1077 24
s 33 1.00 STRE 13 11 ClC 1.740581 f842 15 f323 19
s 34 -1.00 BEND 1 3 7 CCC 119.62 f1037 26 f485 10
s 35 -1.00 BEND 3 1 4 CCC 119.74 f1124 29
s 36 1.00 BEND 7 3 1 CCC 119.62 f733 25 f288 13
s 37 1.00 BEND 21 14 15 HNC 116.16 f1578 40 f1560 20
s 38 1.00 BEND 24 5 2 HCC 119.73 f1186 67
s 39 1.00 BEND 25 2 6 HCC 118.80 f1536 53 f1082 10 f1077 17
s 40 -1.00 BEND 24 5 2 HCC 119.73 f1508 44 f1171 14 f1124 13
s 41 1.00 BEND 25 2 6 HCC 118.80 f1669 14 f1271 32 f1124 15
s 42 -1.00 BEND 26 19 18 HCC 119.04 f1578 10 f1560 37 f1291 11
s 89 1.00 TORS 31 30 16 19 COCC 0.24 f161 14 f104 46 f87 19 f43 10
s 90 1.00 OUT 8 2 3 6 CCCC 1.09 f466 36 f125 22
s 91 1.00 OUT 9 3 12 7 OCCC 0.34 f829 27 f752 33
答案 0 :(得分:2)
您不需要正则表达式来解决此问题。你可以这样做:
text = """s 27 1.00 STRE 30 16 OC 1.355049 f1291 50
s 34 -1.00 BEND 1 3 7 CCC 119.62 f1037 26 f485 10
s 89 1.00 TORS 31 30 16 19 COCC 0.24 f161 14 f104 46 f87 19 f43 10
s 91 1.00 OUT 9 3 12 7 OCCC 0.34 f829 27 f752 33"""
my_file = StringIO(text)
chunks = []
for line in my_file:
my_data = line.split()
chunk1 = my_data[:4]
chunk2 = my_data[4:6]
for i in range(6, 8):
if my_data[i].isdigit():
chunk2.append(my_data[i])
else:
break
chunk3_start = len(chunk1) + len(chunk2)
chunk3 = my_data[chunk3_start:chunk3_start+2]
chunk4 = my_data[chunk3_start+2:]
chunks.append({1: chunk1, 2: chunk2, 3: chunk3, 4: chunk4})
产生以下输出:
[{1: ['s', '27', '1.00', 'STRE'],
2: ['30', '16'],
3: ['OC', '1.355049'],
4: ['f1291', '50']},
{1: ['s', '34', '-1.00', 'BEND'],
2: ['1', '3', '7'],
3: ['CCC', '119.62'],
4: ['f1037', '26', 'f485', '10']},
{1: ['s', '89', '1.00', 'TORS'],
2: ['31', '30', '16', '19'],
3: ['COCC', '0.24'],
4: ['f161', '14', 'f104', '46', 'f87', '19', 'f43', '10']},
{1: ['s', '91', '1.00', 'OUT'],
2: ['9', '3', '12', '7'],
3: ['OCCC', '0.34'],
4: ['f829', '27', 'f752', '33']}]
基本上你一直在向chunk2添加元素,直到你遇到一些不是数字的东西。使用chunk1和chunk2的长度来获取剩余的块。
答案 1 :(得分:2)
我编写了一个从迭代器中提取的生成器,直到找到一个alpha字符串。
from itertools import chain
def while_not_alpha(iterator):
iterator = iter(iterator)
for s in iterator:
if not str(s).isalpha():
yield s
else:
yield chain([s], iterator)
break
def parse(line):
*chunk1, rest = line.split(maxsplit=4)
*chunk2, rest = while_not_alpha(rest.split())
rest = list(rest)
chunk3 = rest[:2]
chunk4 = rest[2:]
return chunk1, chunk2, chunk3, chunk4
# See below for definition of `txt`
chunk1, chunk2, chunk3, chunk4 = map(list, zip(*map(parse, txt.splitlines())))
我们可以看到chunk2
看起来像
chunk2[:4]
[['30', '16'],
['8', '6'],
['14', '15'],
['14', '15']]
chunk3
chunk3[:4]
[['OC', '1.355049'],
['CC', '1.494281'],
['NC', '1.421282'],
['NC', '1.421282']]
我们可以更进一步制作数据框
chunk1, chunk2, chunk3, chunk4 = map(
pd.DataFrame, map(list, zip(*map(parse, txt.splitlines()))))
chunk2.head()
0 1 2 3
0 30 16 None None
1 8 6 None None
2 14 15 None None
3 14 15 None None
4 8 6 None None
5 30 31 None None
6 13 11 None None
7 1 3 7 None
8 3 1 4 None
9 7 3 1 None
10 21 14 15 None
11 24 5 2 None
12 25 2 6 None
13 24 5 2 None
14 25 2 6 None
15 26 19 18 None
16 31 30 16 19
17 8 2 3 6
18 9 3 12 7
或者更进一步:
df = pd.concat(
map(pd.DataFrame, map(list, zip(*map(parse, txt.splitlines())))),
axis=1, keys=[f'chunk{i}' for i in range(1, 5)]
)
df
chunk1 chunk2 chunk3 chunk4
0 1 2 3 0 1 2 3 0 1 0 1 2 3 4 5 6 7
0 s 27 1.00 STRE 30 16 None None OC 1.355049 f1291 50 None None None None None None
1 s 28 -1.00 STRE 8 6 None None CC 1.494281 f1340 12 f1271 17 None None None None
2 s 29 -1.00 STRE 14 15 None None NC 1.421282 f1358 49 None None None None None None
3 s 30 1.00 STRE 14 15 None None NC 1.421282 f1337 10 f1290 33 None None None None
4 s 31 1.00 STRE 8 6 None None CC 1.494281 f1171 15 f323 11 None None None None
5 s 32 1.00 STRE 30 31 None None OC 1.419982 f1082 51 f1077 24 None None None None
6 s 33 1.00 STRE 13 11 None None ClC 1.740581 f842 15 f323 19 None None None None
7 s 34 -1.00 BEND 1 3 7 None CCC 119.62 f1037 26 f485 10 None None None None
8 s 35 -1.00 BEND 3 1 4 None CCC 119.74 f1124 29 None None None None None None
9 s 36 1.00 BEND 7 3 1 None CCC 119.62 f733 25 f288 13 None None None None
10 s 37 1.00 BEND 21 14 15 None HNC 116.16 f1578 40 f1560 20 None None None None
11 s 38 1.00 BEND 24 5 2 None HCC 119.73 f1186 67 None None None None None None
12 s 39 1.00 BEND 25 2 6 None HCC 118.80 f1536 53 f1082 10 f1077 17 None None
13 s 40 -1.00 BEND 24 5 2 None HCC 119.73 f1508 44 f1171 14 f1124 13 None None
14 s 41 1.00 BEND 25 2 6 None HCC 118.80 f1669 14 f1271 32 f1124 15 None None
15 s 42 -1.00 BEND 26 19 18 None HCC 119.04 f1578 10 f1560 37 f1291 11 None None
16 s 89 1.00 TORS 31 30 16 19 COCC 0.24 f161 14 f104 46 f87 19 f43 10
17 s 90 1.00 OUT 8 2 3 6 CCCC 1.09 f466 36 f125 22 None None None None
18 s 91 1.00 OUT 9 3 12 7 OCCC 0.34 f829 27 f752 33 None None None None
设置
txt = """\
s 27 1.00 STRE 30 16 OC 1.355049 f1291 50
s 28 -1.00 STRE 8 6 CC 1.494281 f1340 12 f1271 17
s 29 -1.00 STRE 14 15 NC 1.421282 f1358 49
s 30 1.00 STRE 14 15 NC 1.421282 f1337 10 f1290 33
s 31 1.00 STRE 8 6 CC 1.494281 f1171 15 f323 11
s 32 1.00 STRE 30 31 OC 1.419982 f1082 51 f1077 24
s 33 1.00 STRE 13 11 ClC 1.740581 f842 15 f323 19
s 34 -1.00 BEND 1 3 7 CCC 119.62 f1037 26 f485 10
s 35 -1.00 BEND 3 1 4 CCC 119.74 f1124 29
s 36 1.00 BEND 7 3 1 CCC 119.62 f733 25 f288 13
s 37 1.00 BEND 21 14 15 HNC 116.16 f1578 40 f1560 20
s 38 1.00 BEND 24 5 2 HCC 119.73 f1186 67
s 39 1.00 BEND 25 2 6 HCC 118.80 f1536 53 f1082 10 f1077 17
s 40 -1.00 BEND 24 5 2 HCC 119.73 f1508 44 f1171 14 f1124 13
s 41 1.00 BEND 25 2 6 HCC 118.80 f1669 14 f1271 32 f1124 15
s 42 -1.00 BEND 26 19 18 HCC 119.04 f1578 10 f1560 37 f1291 11
s 89 1.00 TORS 31 30 16 19 COCC 0.24 f161 14 f104 46 f87 19 f43 10
s 90 1.00 OUT 8 2 3 6 CCCC 1.09 f466 36 f125 22
s 91 1.00 OUT 9 3 12 7 OCCC 0.34 f829 27 f752 33"""
答案 2 :(得分:2)
这是我的变体:
def simple_parsing(string):
from re import split
parts = split('\s+',string)
result = [];i=4
while not parts[i].isalpha():
result.append(parts[i])
i+=1
return([parts[0:4],result,parts[i:i+2],parts[i+2:]])
例如,拿了一串你的,结果是:
simple_parsing('s 91 1.00 OUT 9 3 12 7 OCCC 0.34 f829 27 f752 33')
[['s', '91', '1.00', 'OUT'], ['9', '3', '12', '7'], ['OCCC', '0.34'], ['f829', '27', 'f752', '33']]