Question

我一直在尝试搜索.csv文件的特定列，以查找包含特定单词的单元格。但是，它仅适用于我的.csv文件中的第一行（即标题）。

该文件是一系列超过10,000个论坛帖子，第1列为帖子，第2列为帖子文本。下面的标题是“关键”，“带注释的句子”。

key,annotated sentence
"(212, 2)","Got evidence to back that up??

I'm not sure how a stoner's worse than an alcoholic really.

-Wez"
"(537, 5)","Forgive me for laughing; no, not really ha, ha, ha ha ha
Could it be that people here as well as Canada and the rest of the world has figured out your infantile ""grading system of States"" is a complete sham and your very reason for existing is but an anti-constitutional farce and has lost any claims you have or will make? You stand alone now brady, with simply a few still clinging to the false hope of having others pay for your failures and unConstitutional stance so you can sit on your hands while you keep harping on overturning the 2A."
"(595, 0)",So you're actually claiming that it is a lie to say that the UK has a lower gun crime rate than the US? Even if the police were miscounting crimes it's still a huge and unjustified leap in logic to conclude from that that the UK does not have a lower gun crime rate.
"(736, 3)","The anti-abortionists claim a load of **** on many issues. I don't listen to them. To put the ""life"" of an unfertilized egg above that of a person is grotesquely sick IMO. I support any such stem cell research wholeheartedly."

CSV分隔符是逗号，文本分隔符是“。

如果我尝试：

awk -F, '$1 ~ /key/ {print}' posts_file.csv > output_file.csv

它会输出标题行没问题。但是，我试过了：

awk -F, '$1 ~ /212/ {print}' posts_file.csv > output_file.csv
awk -F, '$2 ~ /Canada/ {print}' posts_file.csv > output_file.csv

并且这些都不起作用 - 虽然应该没有找到匹配。我想不通为什么？有任何想法吗？提前谢谢。

Answer 1

awk救援！

一般情况下，复杂的csv不起作用，但在你的情况下，因为键和带注释的句子具有非常不同的值类型，你可以将模式搜索扩展到整个记录而不是键和值，诀窍是定义记录，根据您的格式也可以。例如

$ awk -v RS='\n"' '/Canada/{print RT $0}' csv    

"(537, 5)","Forgive me for laughing; no, not really ha, ha, ha ha ha
Could it be that people here as well as Canada and the rest of the world has figured out your infantile ""grading syst
em of States"" is a complete sham and your very reason for existing is but an anti-constitutional farce and has lost a
ny claims you have or will make? You stand alone now brady, with simply a few still clinging to the false hope of havi
ng others pay for your failures and unConstitutional stance so you can sit on your hands while you keep harping on ove
rturning the 2A."

和这个

$ awk -v RS='\n"' '/(212, 2)/{print RT $0}' csv       

"(212, 2)","Got evidence to back that up??

I'm not sure how a stoner's worse than an alcoholic really.

-Wez"

Answer 2

OpenRefine可以帮助您进行搜索。

Answer 3

使用复杂CSV安全地使用awk的一种方法是使用＆＃34; csv2tsv＆＃34;用于将CSV文件转换为awk可以正确处理的格式的实用程序。通常，TSV（＆＃34;制表符分隔值＆＃34;）格式适合该作业。

（如果最终输出必须是CSV，那么可以使用补充的＆＃34; tsv2csv＆＃34;实用程序，或者awk本身可以完成这项工作 - 尽管可能需要一些小心才能使其完全正确。）

所以管道可能如下所示：

csv2tsv < input.csv | awk -F\\t 'BEGIN{OFS=FS} ....' | tsv2csv

csv-to-tsv转换有多种替代方案，从自己滚动的脚本到Excel，但我建议您花时间检查您选择的工具或工具集是否满足＆＃34 ;边缘案例＆＃34;您感兴趣的要求。

Answer 4

Python的CSV解析支持开箱即用的格式。

下面是一个简单的脚本，您可以按如下方式调用：

# csvfilter <1-basedFieldNdx> <regexStr> < <infile> > <outfile>
csvfilter 1 'key' < posts_file.csv > output_file.csv
csvfilter 1 '212' < posts_file.csv > output_file.csv
csvfilter 2 'Canada' < posts_file.csv > output_file.csv

示例脚本csvfilter：

#!/usr/bin/env python
# coding=utf-8

import csv, sys, re

# Assign arguments to variables.
fieldNdx = int(sys.argv[1]) - 1  # Index of field to filter; Python arrays are 0-based!
reStr = sys.argv[2] if (len(sys.argv) > 2) else '' # Filter regex

# Read from stdin...
reader = csv.reader(sys.stdin)
# ... write to stdout.
writer = csv.writer(sys.stdout, reader.dialect)

# Read each line...
for row in reader:
  # Match the target field against the filter regex and
  # print the row only if it matches.  
  if (re.search(reStr, row[fieldNdx])):
    writer.writerow(row)

使用AWK搜索.CSV文件 - 仅适用于第一行

4 个答案: