我有一个Splunk生成的CSV文件,其格式类似于以下内容:
Category,URL,Hash,ID,"__mv_Hash","_mkv_ID"
binary,somebadsite.com/file.exe,12345abcdef,123,,,
callback,bad.com,,567,,,
我需要做的是遍历CSV文件,维护标题顺序,如果结果是二进制或回调则采取不同的操作。对于这个例子,如果结果是二进制,我将返回任意的" clean"或者"脏"评分,如果是回调,我会打印出详细信息。
以下是我目前计划使用的代码,但我是Python新手,希望对代码提供反馈,以及是否有更好的方法来实现这一目标。如果结果是二进制的,我还没有完全清楚我处理的方式之间的区别:for k in (k for k in r.fieldnames if (not k.startswith("""__mv_""") and not k.startswith("""_mkv_""")))
以及如果不是,我将如何处理。两者都取得了相同的结果,那么一方面的好处是什么呢?
import gzip
import csv
import json
csv_file = 'test_csv.csv.gz'
class GZipCSVReader:
def __init__(self, filename):
self.gzfile = gzip.open(filename)
self.reader = csv.DictReader(self.gzfile)
self.fieldnames = self.reader.fieldnames
def next(self):
return self.reader.next()
def close(self):
self.gzfile.close()
def __iter__(self):
return self.reader.__iter__()
def get_rating(hash):
if hash == "12345abcdef":
rating = "Dirty"
else:
rating = "Clean"
return hash, rating
def print_callback(result):
print json.dumps(result, sort_keys=True, indent=4, separators=(',',':'))
def process_results_content(r):
for row in r:
values = {}
values_misc = {}
if row["Category"] == "binary":
# Iterate through key:value pairs and add to dictionary
for k in (k for k in r.fieldnames if (not k.startswith("""__mv_""") and not k.startswith("""_mkv_"""))):
v = row[k]
values[k] = v
rating = get_rating(row["Hash"])
if rating[1] == "Dirty":
print rating
else:
for k in r.fieldnames:
if not k.startswith("""__mv_""") and not k.startswith("""_mkv_"""):
v = row[k]
values_misc[k] = v
print_callback(values_misc)
r.close()
if __name__ == '__main__':
r = GZipCSVReader(csv_file)
process_results_content(r)
最后,for...else
循环会更好而不是做if row["Category"] == "binary"
这样的事情吗?例如,我可以做一些如:
def process_results_content(r):
for row in r:
values = {}
values_misc = {}
for k in (k for k in r.fieldnames if (not row["Category"] == "binary")):
v = row[k]
...
else:
v = row[k]
...
似乎这就是相同的逻辑,其中第一个子句捕获任何不是二进制的东西,第二个子句捕获其他所有东西,但似乎没有产生正确的结果。
答案 0 :(得分:1)
我使用pandas
库。
<强>代码:强>
import pandas as pd
csv_file = 'test_csv.csv'
df = pd.read_csv(csv_file)
df = df[["Category","URL","Hash","ID"]] # Remove the other columns.
get_rating = lambda x: "Dirty" if x == "12345abcdef" else "Clean"
df["Rating"] = df["Hash"].apply(get_rating) # Assign a value to each row based on Hash value.
print df
j = df.to_json() # Self-explanatory. :)
print j
<强>结果:强>
Category URL Hash ID Rating
0 binary somebadsite.com/file.exe 12345abcdef 123 Dirty
1 callback bad.com NaN 567 Clean
{"Category":{"0":"binary","1":"callback"},"URL":{"0":"somebadsite.com\/file.exe","1":"bad.com"},"Hash":{"0":"12345abcdef","1":null},"ID":{"0":123,"1":567},"Rating":{"0":"Dirty","1":"Clean"}}
如果这是您的预期结果,请将上述内容替换为GZipReader
,因为我没有模拟gzip
文件的开头。