How to count occurence of a string in csv file

时间:2016-10-20 20:06:25

标签: csv unix text awk sed

I've CSV file

author,host,authority,contents
_angelsuman,http://twitter.com/_angelsuman,5,green tea piyo :( #kicktraileron6thjune
_angelsuman,http://twitter.com/_angelsuman,5,rt @121training fat burning foods: grapefruit  watermelon  berries  hot peppers  celery  greek yogurt  eggs  fish  green tea  coffee  water  oatmeal.
_angelsuman,http://twitter.com/_angelsuman,5,rt @121training fat burning foods: â´ grapefruit â´ watermelon â´ berries â´ hot peppers â´ celery â´ greek yogurt â´ eggs â´ fish â´ green tea â´ oatmeal
anukshp,http://twitter.com/anukshp,4,rt @_angelsuman dear green tea u suck..:/ but i need to sip uh for myh rsn :( zindagi ka kdwa such :/ :(

I want to identify count of occurrences of first column :"author" in fourth column "contents"

Ex: finding "_angelsuman" in contents.

Kindly suggest; how can i achieve same?

2 个答案:

答案 0 :(得分:1)

使用

use Text::CSV;

my $col = 4; // 4th column

my $count = 0;
my @rows;
my $csv = Text::CSV->new ( { binary => 1 } )  # should set binary attribute.
    or die "Cannot use CSV: ".Text::CSV->error_diag ();

open my $fh, "<:encoding(utf8)", "/tmp/test.csv" or die "test.csv: $!";
while ( my $row = $csv->getline( $fh ) ) {
    if ($row->[$col -1] eq 'author') {
        $count++;
    }
}
$csv->eof or $csv->error_diag();
close $fh;
print "There's $count occurences of 'author'\n";

输出:

There's 1 occurences of 'author'

注意:

这是一个使用perl模块的正确解析。

/tmp/test.csv替换为您自己的文件

答案 1 :(得分:1)

您可以按照以下方式执行此操作(假设您所说的值中没有逗号)。

<强>单行:

awk -F, 'NR>1 {author[$1]=0; content[NR]=$4} END {for (a in author) {for (c in content) {count[a]+=gsub(a,"",content[c])} print a, count[a]}}' file

<强>展开:

awk -F, '
    NR>1 {
        author[$1]=0;
        content[NR]=$4
    }
    END {
        for (a in author) {
          for (c in content) {
              count[a] += gsub(a,"",content[c])
          }
          print a, count[a]
        }
    }' file

工作原理

  • 使用逗号分隔符-F,读取文件并跳过第一行NR>1

    awk -F, 'NR>1

  • 将数组author中的第一列存储为键 - 因此每个唯一值将存储一次。将内容存储在数组content中,其密钥等于行号NR - 这是存储每行内容的结果。

    {
    author[$1]=0;
    content[NR]=$4
    }
    
  • 最后由每位唯一作者for (a in author)进行迭代,并且foreach作者按内容for (c in content)进行迭代,并增加特定作者count[a]+=gsub(a,"",content[c])内容中作者的出现次数。 如果按特定author计算,则打印结果print a, count[a]

    END {
        for (a in author) {
          for (c in content) {
            count[a]+=gsub(a,"",content[c])
          }
          print a, count[a]
        }
    }' file
    

<强>输出

_angelsuman 1
anukshp 0