Question

我想找到一种方法来做到这一点，我知道它应该是可能的。首先是一点背景。

我想自动创建NCBI Sequin块的过程，以便将DNA序列提交给GenBank。我总是最终创建一个表格，列出物种名称，标本ID值，序列类型，最后列出集合的位置。我很容易将其导出到制表符分隔的文件中。现在我做这样的事情：

while ($csv) {
  foreach ($_) {
    if ($_ =! m/table|species|accession/i) {
      @csv = split('\t', $csv);
      print NEWFILE ">[species=$csv[0]] [molecule=DNA] [moltype=genomic] [country=$csv[2]] [spec-id=$csv[1]]\n";
    }
    else {
      next;
    }
  }
}

我知道这很麻烦，我只是输入类似于我记忆的东西（家里的任何一台电脑上都没有脚本，只在工作时）。

现在这对我很有用，因为我知道我需要的信息（种类，位置和ID号）在哪些列中。

但是有没有办法（必须有）让我找到动态所需信息的列？也就是说，无论列的顺序如何，正确列中的正确信息都会转到正确的位置？

第一行通常是表X（其中X是出版物中表的编号），下一行通常会有感兴趣的列标题，并且几乎是标题中的通用。几乎所有的表都有标准的标题来搜索，我可以使用|在我的模式匹配。

Answer 1

首先，如果我不推荐优秀的Text::CSV_XS模块，那将是我的疏忽;它可以更加可靠地读取CSV文件，甚至可以处理Barmar在上面提到的列映射方案。

也就是说，巴马尔有正确的方法，但它忽略了“表X”行完全是一个单独的行。我建议采用一种明确的方法，也许就是这样的（这样做会有更多细节，只是为了清楚说明;我可能会在生产代码中更紧密地编写它）：

# Assumes the file has been opened and that the filehandle is stored in $csv_fh.
# Get header information first.

my $hdr_data = {};

while( <$csv_fh> ) {
  if( ! $hdr_data->{'table'} && /Table (\d+)/ ) {
    $hdr_data->{'table'} = $1;
    next;
  }
  if( ! $hdr_data->{'species'} && /species/ ) {
    my $n = 0;
    # Takes the column headers as they come, creating
    # a map between the column name and column number.
    # Assumes that column names are case-insensitively
    # unique.
    my %columns = map { lc($_) => $n++ } split( /\t/ );
    # Now pick out exactly the columns we want.
    foreach my $thingy ( qw{ species accession country } ) {
      $hdr_data->{$thingy} = $columns{$thingy};
    }
    last;
  }
}

# Now process the rest of the lines.

while( <$csv_fh> ) {
  my $col = split( /\t/ );
  printf NEWFILE ">[species=%s] [molecule=DNA] [moltype=genomic] [country=%s] [spec-id=%s]\n",
    $col[$hdr_data->{'species'}],
    $col[$hdr_data->{'country'}],
    $col[$hdr_data->{'accession'}];
}

一些变化将使您接近所需。

Answer 2

创建一个将列标题映射到列号的哈希：

my %columns;
...

if (/table|species|accession/i) {
  my @headings = split('\t');
  my $col = 0;
  foreach my $col (@headings) {
    $columns{"\L$col"} = $col++;
  }
}

然后您可以使用$csv[$columns{'species'}]。

解析CSV文件，查找列并记住它们

2 个答案: