解析CSV并根据行

时间:2015-07-08 05:11:53

标签: python json csv splunk

我有一个Splunk生成的CSV文件,其格式类似于以下内容:

Category,URL,Hash,ID,"__mv_Hash","_mkv_ID"
binary,somebadsite.com/file.exe,12345abcdef,123,,,
callback,bad.com,,567,,,

我需要做的是遍历CSV文件,维护标题顺序,如果结果是二进制或回调则采取不同的操作。对于这个例子,如果结果是二进制,我将返回任意的" clean"或者"脏"评分,如果是回调,我会打印出详细信息。

以下是我目前计划使用的代码,但我是Python新手,希望对代码提供反馈,以及是否有更好的方法来实现这一目标。如果结果是二进制的,我还没有完全清楚我处理的方式之间的区别:for k in (k for k in r.fieldnames if (not k.startswith("""__mv_""") and not k.startswith("""_mkv_""")))以及如果不是,我将如何处理。两者都取得了相同的结果,那么一方面的好处是什么呢?

import gzip
import csv
import json

csv_file = 'test_csv.csv.gz'

class GZipCSVReader:
    def __init__(self, filename):
        self.gzfile = gzip.open(filename)
        self.reader = csv.DictReader(self.gzfile)
        self.fieldnames = self.reader.fieldnames

    def next(self):
        return self.reader.next()

    def close(self):
        self.gzfile.close()

    def __iter__(self):
        return self.reader.__iter__()

def get_rating(hash):
    if hash == "12345abcdef":
        rating = "Dirty"
    else:
        rating = "Clean"
    return hash, rating

def print_callback(result):
    print json.dumps(result, sort_keys=True, indent=4, separators=(',',':'))

def process_results_content(r):
    for row in r:
        values = {}
        values_misc = {}

        if row["Category"] == "binary":
            # Iterate through key:value pairs and add to dictionary
            for k in (k for k in r.fieldnames if (not k.startswith("""__mv_""") and not k.startswith("""_mkv_"""))):
                v = row[k]
                values[k] = v
            rating = get_rating(row["Hash"])
            if rating[1] == "Dirty":
                print rating
        else:
            for k in r.fieldnames:
                if not k.startswith("""__mv_""") and not k.startswith("""_mkv_"""):
                    v = row[k]
                    values_misc[k] = v
            print_callback(values_misc)
    r.close()

if __name__ == '__main__':
    r = GZipCSVReader(csv_file)
    process_results_content(r)

最后,for...else循环会更好而不是做if row["Category"] == "binary"这样的事情吗?例如,我可以做一些如:

def process_results_content(r):
    for row in r:
        values = {}
        values_misc = {}

        for k in (k for k in r.fieldnames if (not row["Category"] == "binary")):
            v = row[k]
            ...
        else:
            v = row[k]
            ...

似乎这就是相同的逻辑,其中第一个子句捕获任何不是二进制的东西,第二个子句捕获其他所有东西,但似乎没有产生正确的结果。

1 个答案:

答案 0 :(得分:1)

我使用pandas库。

<强>代码:

import pandas as pd

csv_file = 'test_csv.csv'
df = pd.read_csv(csv_file)
df = df[["Category","URL","Hash","ID"]] # Remove the other columns.

get_rating = lambda x: "Dirty" if x == "12345abcdef" else "Clean"
df["Rating"] = df["Hash"].apply(get_rating) # Assign a value to each row based on Hash value.

print df

j = df.to_json() # Self-explanatory. :)
print j

<强>结果:

   Category                       URL         Hash   ID Rating
0    binary  somebadsite.com/file.exe  12345abcdef  123  Dirty
1  callback                   bad.com          NaN  567  Clean
{"Category":{"0":"binary","1":"callback"},"URL":{"0":"somebadsite.com\/file.exe","1":"bad.com"},"Hash":{"0":"12345abcdef","1":null},"ID":{"0":123,"1":567},"Rating":{"0":"Dirty","1":"Clean"}}

如果这是您的预期结果,请将上述内容替换为GZipReader,因为我没有模拟gzip文件的开头。