我正在从我的控件之外的源解析一些文本,这不是一种非常方便的格式。我有这样的台词:
问题类别:人类的努力问题子类别:太空探索问题类型:无法启动软件版本:9.8.77.omni.3问题详情:信号障碍室问题。
我想用这样的键分割线:
Problem_Category = "Human Endeavors"
Problem_Subcategory = "Space Exploration"
Problem_Type = "Failure to Launch"
Software_Version = "9.8.77.omni.3"
Problem_Details = "Issue with signal barrier chamber."
键将始终采用相同的顺序,并且后面跟着一个分号,但在值和下一个键之间不一定有空格或换行符。我不确定什么可以用作分隔符来解析它,因为冒号和空格也可以出现在值中。我该如何解析这个文本?
答案 0 :(得分:4)
如果你的文本块是这个字符串:
text = 'Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.'
然后
import re
names = ['Problem Category', 'Problem Subcategory', 'Problem Type', 'Software Version', 'Problem Details']
text = 'Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.'
pat = r'({}):'.format('|'.join(names))
data = dict(zip(*[iter(re.split(pat, text, re.MULTILINE)[1:])]*2))
print(data)
产生字典
{'Problem Category': ' Human Endeavors ',
'Problem Details': ' Issue with signal barrier chamber.',
'Problem Subcategory': ' Space Exploration',
'Problem Type': ' Failure to Launch',
'Software Version': ' 9.8.77.omni.3'}
所以你可以分配
text = df_dict['NOTE_DETAILS'][0]
...
df_dict['NOTE_DETAILS'][0] = data
然后您可以使用dict索引访问子类别:
df_dict['NOTE_DETAILS'][0]['Problem_Category']
但是要小心。深层嵌套的dicts列表/ DataFrames通常是一个 糟糕的设计。正如Zen of Python所说, Flat优于嵌套。
答案 1 :(得分:3)
鉴于您提前知道关键字,请将文本分区为“当前关键字”,“剩余文本”,然后继续使用下一个关键字对剩余文本进行分区。
# get input from somewhere
raw = 'Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.'
# these are the keys, in order, without the colon, that will be captured
keys = ['Problem Category', 'Problem Subcategory', 'Problem Type', 'Software Version', 'Problem Details']
prev_key = None
remaining = raw
out = {}
for key in keys:
# get the value from before the key and after the key
prev_value, _, remaining = remaining.partition(key + ':')
# start storing values after the first iteration, since we need to partition the second key to get the first value
if prev_key is not None:
out[prev_key] = prev_value.strip()
# what key to store next iteration
prev_key = key
# capture the final value (since it lags behind the parse loop)
out[prev_key] = remaining.strip()
# out now contains the parsed values, print it out nicely
for key in keys:
print('{}: {}'.format(key, out[key]))
打印:
Problem Category: Human Endeavors
Problem Subcategory: Space Exploration
Problem Type: Failure to Launch
Software Version: 9.8.77.omni.3
Problem Details: Issue with signal barrier chamber.
答案 2 :(得分:3)
我讨厌并担心正则表达式,所以这里只是使用内置方法的解决方案。
#splits a string using multiple delimiters.
def multi_split(s, delims):
strings = [s]
for delim in delims:
strings = [x for s in strings for x in s.split(delim) if x]
return strings
s = "Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber."
categories = ["Problem Category", "Problem Subcategory", "Problem Type", "Software Version", "Problem Details"]
headers = [category + ": " for category in categories]
details = multi_split(s, headers)
print details
details_dict = dict(zip(categories, details))
print details_dict
结果(为了便于阅读,我添加了换行符):
[
'Human Endeavors ',
'Space Exploration',
'Failure to Launch',
'9.8.77.omni.3',
'Issue with signal barrier chamber.'
]
{
'Problem Subcategory': 'Space Exploration',
'Problem Details': 'Issue with signal barrier chamber.',
'Problem Category': 'Human Endeavors ',
'Software Version': '9.8.77.omni.3',
'Problem Type': 'Failure to Launch'
}
答案 3 :(得分:2)
这只是一般BNF解析的工作,可以很好地处理歧义。我使用了perl和Marpa,一般的BNF解析器。希望这会有所帮助。
use 5.010;
use strict;
use warnings;
use Marpa::R2;
my $g = Marpa::R2::Scanless::G->new( { source => \(<<'END_OF_SOURCE'),
:default ::= action => [ name, values ]
pairs ::= pair+
pair ::= name (' ') value
name ::= 'Problem Category:'
name ::= 'Problem Subcategory:'
name ::= 'Problem Type:'
name ::= 'Software Version:'
name ::= 'Problem Details:'
value ::= [\s\S]+
:discard ~ whitespace
whitespace ~ [\s]+
END_OF_SOURCE
} );
my $input = <<EOI;
Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.
EOI
my $ast = ${ $g->parse( \$input ) };
my @pairs;
ast_traverse($ast);
for my $pair (@pairs){
my ($name, $value) = @$pair;
say "$name = $value";
}
sub ast_traverse{
my $ast = shift;
if (ref $ast){
my ($id, @children) = @$ast;
if ($id eq 'pair'){
my ($name, $value) = @children;
chop $name->[1];
shift @$value;
$value = join('', @$value);
chomp $value;
push @pairs, [ $name->[1], '"' . $value . '"' ];
}
else {
ast_traverse($_) for @children;
}
}
}
打印:
Problem Category = "Human Endeavors "
Problem Subcategory = "Space Exploration"
Problem Type = "Failure to Launch"
Software Version = "9.8.77.omni.3"
Problem Details = "Issue with signal barrier chamber."