Question

这是有问题的剧本：

for file in `ls products`
do
  echo -n `cat products/$file \
  | grep '<td>.*</td>' | grep -v 'img' | grep -v 'href' | grep -v 'input' \
  | head -1  | sed -e 's/^ *<td>//g' -e 's/<.*//g'`
done

我将在50000多个文件上运行它，这个脚本大约需要12个小时。

算法如下：

仅查找包含不包含任何“img”，“href”或“输入”的表格单元格（<td>）的行。
选择第一个，然后在标签之间提取数据。

可以使用通常的bash文本过滤器（sed，grep，awk等）以及perl。

Answer 1

看起来可以用一个gawk命令替换它们：

gawk '
    /<td>.*<\/td>/ && !(/img/ || /href/ || /input/) {
        sub(/^ *<td>/,""); sub(/<.*/,"")
        print
        nextfile
    }
' products/*

这使用了gawk扩展名nextfile。

如果通配符扩展太大，那么

find products -type f -print | xargs gawk '...'

Answer 2

这里有一些快速的perl来完成应该更快的整个事情。

#!/usr/bin/perl

process_files($ARGV[0]); 

# process each file in the supplied directory
sub process_files($)
{
  my $dirpath = shift;
  my $dh;
  opendir($dh, $dirpath) or die "Cant readdir $dirpath. $!";
  # get a list of files
  my @files;
  do {
    @files = readdir($dh);
    foreach my $ent ( @files ){
      if ( -f "$dirpath/$ent" ){
        get_first_text_cell("$dirpath/$ent");
      }
    }
  } while ($#files > 0);
  closedir($dh);
}

# return the content of the first html table cell
# that does not contain img,href or input tags
sub get_first_text_cell($)
{
  my $filename = shift;
  my $fh;
  open($fh,"<$filename") or die "Cant open $filename. $!";
  my $found = 0;
  while ( ( my $line = <$fh> ) && ( $found == 0 ) ){
    ## capture html and text inside a table cell
    if ( $line =~ /<td>([&;\d\w\s"'<>]+)<\/td>/i ){
      my $cell = $1;

      ## omit anything with the following tags
      if ( $cell !~ /<(img|href|input)/ ){
        $found++;
        print "$cell\n";
      }
    }
  }
  close($fh);
}

只需通过将要搜索的目录作为第一个参数传递来调用它：

$ perl parse.pl /html/documents/

Answer 3

这个怎么样（应该更快更清晰）：

for file in products/*; do
    grep -P -o '(?<=<td>).*(?=<\/td>)' $file | grep -vP -m 1 '(img|input|href)'
done

for会查看products中的每个文件。 查看与语法的区别。
只要每个单元格在一行中，第一个grep将只输出<td>和</td>之间的文字，而不会为每个单元格生成这些标记。
最后，第二个grep将只输出第一行（这是我认为您希望通过该head -1实现的那些不包含的那些行） img，href或input（然后退出，然后减少允许更快处理下一个文件的总时间）

我本来喜欢只使用一个grep，但是正则表达式会非常糟糕。： - ）

免责声明：我当然没有测试过它

Bash脚本优化

3 个答案: