我有一个类似于以下内容的数据集 - 我在正则表达式上非常生疏,并且不知道“走树”尽管有几次微薄的尝试 - 由于各个术语的asinine组织,Excel中的Cols文本没有帮助EFFECT_DATA字段中的类/标签以及手动调整引入的错误。
ROW_ID|NAME | UNORDERED_CSV_CONCATD_TAG_DATA_STRING
123456|Prod123|"Minoxidistuff [MoA], Direct [PE], Agonists [EPC]"
123457|Prod124|"Minoxion [Chem], InterferonA [EPC], Delayed [PE]"
123458|Prod125|"Anotherion [EPC], Direct [MoA], Agonists [EPC]"
123459|Prod126|"Competitor [PE], Progestin [EPC], Agonists [EPC]"
123460|Prod127|"Minoxidistuff [Chem]"
PRODUCT|EPC |
Prod125|Antherion|
Prod125|Agonists |
PRODUCT|CMPD |
Prod127|Minoxidistuff|
Prod124|Minoxion |
产品[i] tag [j]的所有标签的等,如果这是有道理的,基本上是ea。 CSVD_TAG_DATA字段乱序并包含多个标记(在所需术语的末尾。
我开始只是一个多维哈希方法,即原谅我的屠宰正则表达式伪代码。
非常感谢。
答案 0 :(得分:1)
这是Perl方法。将下面的代码保存为parser.pl。将其作为perl parser.pl data.csv
运行,其中data.csv是数据文件的名称。 (或使其可执行并运行./parser.pl data.csv
。)
#!/usr/bin/perl -w
use strict;
# Take in the first arguement as the file
my $file = $ARGV[0];
# open a filehandle
open (my $fh, '<', $file);
# Well predefine a hashref
my $products = {};
# Loop through the file
while (<$fh>) {
# remove line breaks
chomp;
# split into our primary sections
my ($id, $product, $csv) = split(/\|/);
# skip a header line
next if ($id =~ /\D/);
# remove the quotes
($csv) = ($csv =~ /"(.*)"/);
# split the CSV an a comma possibly followed by a space
my @items = split(/,\s*/, $csv);
# loop through each item in the csv
foreach my $item(@items) {
# Our keys and values are reversed!
my ($value,$key) = ($item =~ /(.*)\[(.*)\]/);
# Remove trailing whitespace
$value =~ s/\s+$//;
# If the arrayref does not exist then create it
# Otherwise add to it
if (!exists($products->{$key}->{$product})) {
$products->{$key}->{$product} = [$value];
} else {
push(@{$products->{$key}->{$product}}, $value);
}
}
}
# We have a nicely formed hashref now. Loop through and print how we want
foreach my $key(keys %$products) {
# Header for this section
print "PRODUCT|$key\n";
# Go through each product and print the different values
foreach my $product(keys %{$products->{$key}}) {
while (my $value = shift(@{$products->{$key}->{$product}})) {
print "$product|$value\n";
}
}
# Add a space to divide the groups cleanly
print "\n";
}
示例输出:
PRODUCT|MoA
Prod123|Minoxidistuff
Prod125|Direct
PRODUCT|Chem
Prod127|Minoxidistuff
Prod124|Minoxion
PRODUCT|PE
Prod123|Direct
Prod124|Delayed
Prod126|Competitor
PRODUCT|EPC
Prod123|Agonists
Prod124|InterferonA
Prod126|Progestin
Prod126|Agonists
Prod125|Anotherion
Prod125|Agonists