文本空间和制表符分隔的表以分号分隔

时间:2019-07-15 17:48:37

标签: html linux perl csv delimiter

一个表空间和制表符分隔开,并且需要用分号分隔字段,已经直接用awk尝试过,但是没有用。如果一个我没有足够的东西来完成相同的工作,那么用一个perl脚本来处理带有ASCII样式的管道的表,这些表必须分开并加下划线。

            string str;
            bool over = false;
            string pattern = "^[a-z]{1,500}@[a-z]{1,500}.com|((?=.*.co)(?=.*.il))$";
            Regex Filter = new Regex(pattern);

            while (!over)
            {
                Console.WriteLine("please enter a string and we will tell you if its good or not");
                str=Console.ReadLine();
                Console.WriteLine(Filter.IsMatch(str).ToString());



                if (pattern == "over")
                    over = true;
            }

已经尝试过

Name full                  CI       FG   AG DG Date (UTC) Virnia Ray  
34842865 093161455 -    -     2019-07-12T12:09:31.378Z Vitoxia Sureez 
40151215 094063155 36.3   -     2019-07-14T13:18:11.733Z

删除所有空格

L。Scott制作的Perl脚本最初用于转换表ASCII样式的

 sed -e 's/^[ t]*//' -e 's/ /\;/g'

期望:

 while(<>) {
     @vals = split / /; # split fields into the val array taking space separator
     $size = @vals;
     for( $i = 0 ; $i < $size ; $i++ )
     {
         #clean up the values: remove underscores and extra spaces in the fields and remove possible semicolons there
         $vals[$i] =~ s/_/ /g;
         $vals[$i] =~ s/;/ /g;
         $vals[$i] =~ s/^ *//;
         $vals[$i] =~ s/ *$//;

         # append the value to the data record for this field
         $data[$i] .= $vals[$i];

         # special handling for first field: use spaces when joining
         $data[$i] .= " " if ($i==0); #do not know if this is necessary to the new requirement as we have space in more than the first field.

     }
    if(/\R/)  # Taking carriage return as the end of record
     {
         # clean up the first record; trim spaces
         $data[0] =~ s/^ *//;
         $data[0] =~ s/ *$//;
         $data[3] =~ s/\..*//; # remove the point and decimal for the field four
         # join the records with semicolons
         $line = join (";", @data);

         # collapse multiple spaces
         $line =~ s/ +/ /g;

         # print this line and start over
         print "$line\n" unless ($line eq '');
         @data = ();
     } }

当前输出:

Name full;CI;FG;AG;DG;Date (UTC)
Virnia Ray;34842865;093161455;-;-;2019-07-12T12:09:31.378Z
Vitoxia Sureez;40151215;094063155;36;-;2019-07-14T13:18:11.733Z

我在某些情况下与第一个字段类似:

Name;full;;;;;;;;;;;;;;;;;;CI;;;;;;;FG;;;AG;DG;Date;(UTC)    
Virnia;Ray;;;;;;;;;;;;;;;;;;;34842865;093161455;-;;;;-;;;;;2019-07-1T12:09:31.378Z    
Vitoxia;Sureez;;;;;;;;;;;;;;;;;;40151215;094063155;36;;;-;;;;;2019-07-14T13:18:11.733Z

我只需要在第一个字段中输入逗号前的数据,之后的所有内容都将被删除。如您所见,该行中其余数据的“行”不在同一行中。

原始数据来自一个由html2text解析的HTML代码,原始代码为:

Mar▒a Xatia Mecrdiz
M▒ndrz, yrcr▒a
cdcsurtmz at ruy opdx
lxtrb mxs2axs rl tsactfg
re xorts tdz drfod t     33743642 095518568 41   -     2019-06-12T13:48:40.200Z
zude def rtexetggacvc
opyxo ae f▒xuda tcso
dxzdtctfgs ti x9mdfggfhh
sx 7dfgab, asvro oi sz op
dgeto jxgdmszdd.

也许这里有一些实用程序可以代替html2text来直接通过渲染工具以更好的形状完成这项工作。

这是包含更多记录的html表。

  <b>Mon Jul 05 2019</b><hr><table style="border: 1px solid
#dddddd;border-collapse: collapse;text-align: left;"><tr><th style="padding: 8px;background-color: #cce6ff">Name Full</th><th style="padding: 8px;background-color: #cce6ff">FG</th><th style="padding: 8px;background-color: #cce6ff">CG</th><th style="padding: 8px;background-color: #cce6ff">AG</th><th style="padding: 8px;background-color: #cce6ff">MG</th><th style="padding: 8px;background-color: #cce6ff">Date (UTC)</th><tr><th style="padding: 8px;background-color: #dddddd">Mrída Xatia Mecrdiz Míndrz, yrcrría cdcsurtmz at ruy opdxlxtrb mxs2axs rl tsactfgre xorts tdz drfod t   zude def rtexetggacvcopyxo ae féxuda tcsodxzdtctfgs ti x9mdfggfhhsx 7dfgab, asvro oi sz op
    dgeto jxgdmszdd.</th><th style="padding: 8px;background-color: #dddddd">33743642</th><th style="padding: 8px;background-color: #dddddd">095518568</th><th style="padding: 8px;background-color: #dddddd">41</th><th style="padding: 8px;background-color: #dddddd">-</th><th style="padding: 8px;background-color: #dddddd">2019-05-12T13:48:40.200Z</th></tr><tr><th style="padding: 8px;">Cdlga foxa</th><th style="padding: 8px;">45285726</th><th style="padding: 8px;">092641968</th><th style="padding: 8px;">28</th><th style="padding: 8px;">-</th><th style="padding: 8px;">2019-06-11T13:50:52.091Z</th></tr></table>

1 个答案:

答案 0 :(得分:1)

我仍然没有足够的代表来评论,所以我假设名称由名字和姓氏组成(上面写着全名),并且不是空白。

代码:

while (<>) {
    #Removes new line
    chomp;

    #Sets delimiter
    $delimiter = ";";

    #clean up text
    #removes _ and ;
    s![_;]!!g; 

    #removes leading spaces                 
    s!^ *!!;        

    #replace multiple whitespace (you mentioned it's delimited by space and tab) with ;            
    s/(\s)\1*/$delimiter/ge;    

    #Remove Delimiter from the Date (UTC). /i ignores the case
    s/Date.+\(UTC\)/Date (UTC)/i;   

    #Removes delimiter in Full name
    s/^([^;]+);([^;]+)/$1 $2/i; 

    #Removes decimal on Field 4
    s/^([^;]+;[^;]+;[^;]+;)([^;.]+)\.?[^;]*/"$1$2"/e; 

    print "$_\n";
}

输出

Name full;CI;FG;AG;DG;Date (UTC)
Virnia Ray;34842865;093161455;-;-;2019-07-12T12:09:31.378Z
Vitoxia Sureez;40151215;094063155;36;-;2019-07-14T13:18:11.733Z

注释

在使用!时,我仅在某些正则表达式中使用/来修复语法突出显示