Question

我想用perl脚本处理一些twitter数据集。该文件采用csv格式。

我想删除自我提及

csv列和数据就是这种方式，例如

user, mention(user), message  
vims789, vnjuei234, yea this is good  
dfion, youwen12, this is win  
don234, don234, this is green   
wen123, tileas, this is blue

"don234, don234"提到的副本，该行应删除。实施例

用户，提及（用户），消息
vims789，vnjuei234，是的，这很好 dfion，youwen12，这是胜利 wen123，tileas，这是蓝色的

Answer 1

也许是这样的：

#!/usr/bin/perl
use strict;
use warnings;

use Text::CSV;
my $csv = Text::CSV->new();

while ( my $row = $csv->getline( \*DATA ) ) {
    my ( $user, $mention, $message ) = @$row;
    print $message,"\n" unless $user eq $mention;
}
__DATA__
user, mention(user), Message  
vims789, vnjuei234, yea this is good  
dfion, youwen12, this is win  
don234, don234, this is green   
wen123, tileas, this is blue

Answer 2

您可以使用反向引用快速完成此操作。既然你想找到一些东西，一个逗号，一些空格，然后再想那个东西，假设字符串将是所有单词字符，这应该有效：

my $regex
    = qr{ ^     # beginning of the line
          (\w+) # A "word"
          ,     # A comma
          \s+   # space 
          \1    # a back reference to the first capture.
          \b    # demand that it end the sequence of word characters.
        }x;

my @filtered_lines = grep { !m/$regex/ } @lines;

Perl脚本：删除自重复的行

2 个答案: