Question

我对shell / mac终端缺乏经验，所以非常感谢任何帮助或建议。

我有一个非常大的数据集，带有制表符分隔符。以下是代码的示例。

0001    User1    Tweet1
0002    User2    Tweet2
0003    User3    Tweet3
0004    User2    Tweet4
0005    User2    Tweet5

我一直在尝试将csv导出为每个唯一身份用户的列表以及他们出现/发布推文的次数。

这是我目前对代码的尝试：

cut -f 2 Twitter_Data_1 |sort | uniq -c | wc -l > TweetFreq.csv

理想情况下，我希望导出一个类似于：

的csv

User1    1
User2    3
User3    1

Answer 1

$ awk -F '\t' '{ print $2 }' tweet | sort | uniq -c

输出：

  1 User1
  3 User2
  1 User3

Answer 2

不是最干净但是有效

#!/bin/bash
mkdir tmptweet # Creation of the temp directory
while read line; do
user=`echo $line | cut -d " " -f 2` # we access the username
echo $line >> tmptweet/$user # add a line to the selected user's counter
done < Twitter_Data_1

for file in tmptweet/*; do
i=`cat $file | wc -l` # we check the lines for each user ...
echo "${file##*/} $i" >> TweetFreq.csv # ... and put this into the final file
done
rm -rf tmptweet # remove of the temp directory

带有临时文件的临时目录用于存储值，比使用Array更容易。

将您的Twitter_Data_1的每一行插入以用户名命名的文件中，然后计算每个文件中的行数以创建TweetFreq.csv文件

测试：

Will /home/will # ls
script.sh     Twitter_Data_1
Will /home/will # ./script.sh
Will /home/will # ls
script.sh     Twitter_Data_1     TweetFreq.csv
Will /home/will # cat TweetFreq.csv
User1        1
User2        3
User3        1
Will /home/will #

使用shell查找列中每个项目的频率

2 个答案: