Question

我有一个.csv个字符串文件（大约5,400个），除了许多其他字符串外，还出现在一个大型语料库的大.txt文件中。我需要计算.txt语料库文件中5,400个字符串中每个字符串的出现次数。我正在使用外壳程序（我有Macbook Pro），但我不知道如何用一个文件的输入编写一个for循环，然后再处理另一个文件。 input_file.csv看起来像这样：

A_back
A_bill
A_boy
A_businessman
A_caress
A_chat
A_con
A_concur
A_cool
A_cousin
A_discredit
A_doctor
A_drone_AP_on
A_fellow
A_flatter
A_friend
A_gay
A_giddy
A_guilty
A_harangue
A_ignore
A_indulge
A_interested
A_kind
A_laugh
A_laugh_AP_at
...

我正在搜索的corpus_file.txt是一个经过清理和去词素化的语料库，每行只有一个句子；这是文本的4行：

A_recently N_pennsylvania N_state_N_university V_launch a N_program that V_pay A_black N_student AP_for V_improve their N_grade a N_c AP_to N_c A_average V_bring 550 and N_anything A_high V_bring 1,100
A_here V_be the N_sort AP_of A_guilty N_kindness that V_kill
what N_kind AP_of N_self_N_respect V_be a A_black N_student V_go AP_to V_have AP_as PR_he or PR_she V_reach AP_out AP_to V_take 550 AP_for N_c N_work A_when A_many A_white N_student V_would V_be V_embarrass AP_by A_so A_average a N_performance
A_white N_student V_would V_be V_embarrass AP_by A_so A_average a N_performance

我希望准确计算input_file.csv中的每个字符串出现在corpus_file.txt中的次数。我可以使用以下代码一次完成一个操作：

grep -c A_guilty corpus_file.txt

几秒钟后，我就计算出A_guilty在corpus_file.txt中出现了多少次（它在我放在这里的语料库中出现一次）。但是，我不想重复5400次，因此我试图将其放入一个循环中，该循环将输出每个字符串及其计数。

我尝试运行以下代码：

for input_file.csv in directory/path/folder/ do grep -c corpus_file.txt done

但是它不起作用。 input_file.csv和corpus_file.txt都在同一文件夹中，因此具有相同的目录。

我希望最终得到5400个字符串的列表以及每个字符串出现在大型corpus_file.txt文件中的次数。像这样：

term - count
A_back - 2093
A_bill - 873
A_boy - 1877
A_businessman - 148
A_caress - 97
A_chat - 208
A_con - 633

Answer 1

这可能就是您所需要的：

$ cat words
sweet_talk
white_man
hispanic_american

$ cat corpus
foo
sweet_talk
bar
hispanic_american
sweet_talk

$ grep -Fowf words corpus | sort | uniq -c
      1 hispanic_american
      2 sweet_talk

否则，请编辑您的问题以阐明您的要求，并提供更真实的示例输入/输出。

如何使用另一个文件的输入循环遍历一个文件

1 个答案: