Perl - 解析带有标签的文本文件,用于将数据转储到新文本文件中

时间:2014-11-21 22:35:32

标签: perl parsing text tabs tags

我在.txt文件中获得了数据,我需要将其格式化为可以上传到数据库中的内容。文本以任何东西为基础。根据标记,需要将数据转储到特定的txt文件和制表符分隔符中。我生命中的Perl很少,但我知道Perl可以很容易地处理这种类型的应用程序,我只是迷失在哪里开始。在Java之外,SQL和R我没用。这是一个单一条目的示例,我有近1000个要处理的内容):

<PaperTitle>True incidence of all complications following immediate and delayed breast reconstruction.</PaperTitle>
<Abstract>BACKGROUND: Improved self-image and psychological well-being after breast reconstruction are well documented. To determine methods that optimized results with minimal morbidity, the authors examined their results and complications based on reconstruction method and timing. METHODS: The authors reviewed all breast reconstructions after mastectomy for breast cancer performed under the supervision of a single surgeon over a 6-year period at a tertiary referral center. Reconstruction method and timing, patient characteristics, and complication rates were reviewed. RESULTS: Reconstruction was performed on 240 consecutive women (94 bilateral and 146 unilateral; 334 total reconstructions). Reconstruction timing was evenly split between immediate (n = 167) and delayed (n = 167). Autologous tissue (n = 192) was more common than tissue expander/implant reconstruction (n = 142), and the free deep inferior epigastric perforator was the most common free flap (n = 124). The authors found no difference in the complication incidence with autologous reconstruction, whether performed immediately or delayed. However, there was a significantly higher complication rate following immediate placement of a tissue expander when compared with delayed reconstruction (p = 0.008). Capsular contracture was a significantly more common late complication following immediate (40.4 percent) versus delayed (17.0 percent) reconstruction (p &lt; 0.001; odds ratio, 5.2; 95 percent confidence interval, 2.3 to 11.6). CONCLUSIONS: Autologous reconstruction can be performed immediately or delayed, with optimal aesthetic outcome and low flap loss risk. However, the overall complication and capsular contracture incidence following immediate tissue expander/implant reconstruction was much higher than when performed delayed. Thus, tissue expander placement at the time of mastectomy may not necessarily save the patient an extra operation and may compromise the final aesthetic outcome.</Abstract>
<BookTitle>Book1</BookTitle>
<Publisher>Publisher01, Boston</Publisher>
<Edition>1st</Edition>
<EditorList>
    <Editor>
        <LastName>Lewis</LastName>
        <ForeName>Philip M</ForeName>
        <Initials>PM</Initials>
    </Editor>
    <Editor>
        <LastName>Kiffer</LastName>
        <ForeName>Michael</ForeName>
        <Initials>M</Initials>
    </Editor>
</EditorList>
<Page>19-28</Page>
<Year>2008</Year>
<AuthorList>
                <Author ValidYN="Y">
                    <LastName>Sullivan</LastName>
                    <ForeName>Stephen R</ForeName>
                    <Initials>SR</Initials>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Fletcher</LastName>
                    <ForeName>Derek R D</ForeName>
                    <Initials>DR</Initials>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Isom</LastName>
                    <ForeName>Casey D</ForeName>
                    <Initials>CD</Initials>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Isik</LastName>
                    <ForeName>F Frank</ForeName>
                    <Initials>FF</Initials>
                </Author>
</AuthorList>
//

PaperTitle,Abstract和Page,需要进入Papers.txt文件

PaperTitle,BookTitle,Edition,Publisher和Year需要进入Book.txt文件

PaperTitle,所有编辑器数据LastName,ForeName,Initials需要进入Editors.txt

PaperTitle,所有作者信息LastName,ForeName,姓名缩写需要进入Authors.txt

//标记条目的结尾。所有文件都需要以制表符分隔。 虽然我不会拒绝完成的代码,但我希望至少有一些想法让我至少在代码中找到解析其中一个文件的正确方向(比如Book.txt)我很有可能想出来从那里。非常感谢。“

2 个答案:

答案 0 :(得分:0)

这个例子可以帮到你。它根据我的建议使用XML::Twig来提取Papers.txt输出文件的字段。记录分隔符设置为"//\n",以便一次性读取整个数据块,并在解析块之前将其包装在<Paper>...</Paper>标记中以使其成为有效的XML。

use strict;
use warnings;
use 5.010;
use autodie;

use XML::Twig;

my $twig = XML::Twig->new;

open my $fh, '<', 'papers.txt';
local $/ = "//\n";

while (<$fh>) {
  $twig->parse("<Paper>\n$_\n</Paper>\n");
  my $root = $twig->root;
  say $root->field($_) for qw/ PaperTitle Abstract Page/;
  say '---';
}

<强>输出

True incidence of all complications following immediate and delayed breast reconstruction.
BACKGROUND: Improved self-image and psychological well-being after breast reconstruction are well documented. To determine methods that optimized results with minimal morbidity, the authors examined their results and complications based on reconstruction method and timing. METHODS: The authors reviewed all breast reconstructions after mastectomy for breast cancer performed under the supervision of a single surgeon over a 6-year period at a tertiary referral center. Reconstruction method and timing, patient characteristics, and complication rates were reviewed. RESULTS: Reconstruction was performed on 240 consecutive women (94 bilateral and 146 unilateral; 334 total reconstructions). Reconstruction timing was evenly split between immediate (n = 167) and delayed (n = 167). Autologous tissue (n = 192) was more common than tissue expander/implant reconstruction (n = 142), and the free deep inferior epigastric perforator was the most common free flap (n = 124). The authors found no difference in the complication incidence with autologous reconstruction, whether performed immediately or delayed. However, there was a significantly higher complication rate following immediate placement of a tissue expander when compared with delayed reconstruction (p = 0.008). Capsular contracture was a significantly more common late complication following immediate (40.4 percent) versus delayed (17.0 percent) reconstruction (p < 0.001; odds ratio, 5.2; 95 percent confidence interval, 2.3 to 11.6). CONCLUSIONS: Autologous reconstruction can be performed immediately or delayed, with optimal aesthetic outcome and low flap loss risk. However, the overall complication and capsular contracture incidence following immediate tissue expander/implant reconstruction was much higher than when performed delayed. Thus, tissue expander placement at the time of mastectomy may not necessarily save the patient an extra operation and may compromise the final aesthetic outcome.
19-28
---

答案 1 :(得分:-1)

请检查一下:     用严格;     使用警告;     使用Cwd;

#Get Directory
my $dir = getcwd();

#Grep files from the directory
opendir(DIR, $dir) || die "Couldn't open/read the $dir: $!";
my @AllFiles = grep(/\.txt$/i, readdir(DIR));
closedir(DIR);

#Check files are available 
if(scalar(@AllFiles) ne '')
{
    #Create Text Files as per Requirement
    open(PAP, ">$dir/Papers.txt") || die "Couldn't able to create the file: $!";
    open(BOOK, ">$dir/Book.txt") || die "Couldn't able to create the file: $!";
    open(EDT, ">$dir/Editors.txt") || die "Couldn't able to create the file: $!";
    open(AUT, ">$dir/Authors.txt") || die "Couldn't able to create the file: $!";
}
else {  die "File Not found...$dir\n"; } #Die if not found files
foreach my $input (@AllFiles)
{
    print "Processing file $input\n";
    open(IN, "$dir/$input") || die "Couldn't able to open the file: $!";
    local $/; $_=<IN>; my $tmp=$_;
    close(IN);
    #Loop from <PaperTitle> to // end slash
    while($tmp=~m/(<PaperTitle>((?:(?!\/\/).)*)\/\/)/gs)
    {
        my $LoopCnt = $1;
        my ($pptle) = $LoopCnt=~m/<PaperTitle>([^<>]*)<\/PaperTitle>/g;
        my ($abstr) = $LoopCnt=~m/<Abstract>([^<>]*)<\/Abstract>/gs;
        my ($pgrng) = $LoopCnt=~m/<Page>([^<>]*)<\/Page>/g;
        my ($bktle) = $LoopCnt=~m/<BookTitle>([^<>]*)<\/BookTitle>/g;
        my ($edtns) = $LoopCnt=~m/<Edition>([^<>]*)<\/Edition>/g;
        my ($publr) = $LoopCnt=~m/<Publisher>([^<>]*)<\/Publisher>/g;
        my ($years) = $LoopCnt=~m/<Year>([^<>]*)<\/Year>/g;

        my ($EditorNames, $AuthorNames) = "";
        $LoopCnt=~s#<EditorList>((?:(?!<\/EditorList>).)*)</EditorList>#
        my $edtList = $1; my @Edlines = split/\n/, $edtList;
        my $i ='1'; \#Editor Count to check
        foreach my $EdsngLine(@Edlines)
        {
            if($EdsngLine=~m/<LastName>([^<>]*)<\/LastName>/)
            {  $EditorNames .= $i."".$1."\t"; $i++; }
            elsif($EdsngLine=~m/<ForeName>([^<>]*)<\/ForeName>/)
            {  $EditorNames .= $1."\t"; }
            elsif($EdsngLine=~m/<Initials>([^<>]*)<\/Initials>/)
            {  $EditorNames .= $1."\t"; }
        }
        #esg;
        $LoopCnt=~s#<AuthorList>((?:(?!<\/AuthorList>).)*)</AuthorList>#
        my $autList = $1; my @Autlines = split/\n/, $autList;
        my $j ='1'; \#Author Count to check
        foreach my $AutsngLine(@Autlines)
        {
            if($AutsngLine=~m/<LastName>([^<>]*)<\/LastName>/)
            {  $AuthorNames .= $j."".$1."\t"; $j++; }
            elsif($AutsngLine=~m/<ForeName>([^<>]*)<\/ForeName>/)
            {  $AuthorNames .= $1."\t"; }
            elsif($AutsngLine=~m/<Initials>([^<>]*)<\/Initials>/)
            {  $AuthorNames .= $1."\t"; }
        }
        #esg;

        #Print the output in the crossponding text files
        print PAP "$pptle\t$abstr\t$pgrng\t//\n";
        print BOOK "$pptle\t$bktle\t$edtns\t$publr\t$years\t//\n";
        print EDT "$pptle\t$EditorNames//\n";
        print AUT "$pptle\t$AuthorNames//\n";
    }
}

print "Process Completed...\n";

#Don't forget to close the files
close(PAP);
close(BOOK);
close(EDT);
close(AUT);
#End