解析csv文件并将其拆分为子文件

时间:2019-12-04 08:38:41

标签: python csv lambda

我正在尝试创建一个通用过滤器,以根据Yaml文件中的条件分割文件。

我的代码正在运行Pandas,但是由于环境中没有Pandas模块,我试图通过CSV库实现它。

当我对q处的值进行硬编码时,但是当我尝试从配置文件中传递它时,它不起作用。我也想在同一列上通过多个检查,例如('','Balance)。因此Asset进入一个文件,('','Balance)进入另一个文件。

import sys
import yaml
import csv


def dynamicQuery(config_file, data_file, outputPath):
    """Loading Configuration file into dataframe"""
    try:
        with open(config_file) as file:
            doc = yaml.full_load(file)

    except Exception as err:
        print("Error Configuration data file: ", err)

    try:

    for k, v in doc.items():
        if k != 'column':
            filename = k
            k = doc[k]
            q = ' , '.join(f'{v} ' for q, v in k.items())

            q = '"' + str(strip(q)) + '"'
            print(q) #-- "Asset"
            df = csv.reader(open(data_file), delimiter=',')
            df = filter(lambda x: (x[2] == q), df) # Not working here
            #df = filter(lambda x: x[2] == "Asset", df) --> this is working



            csv.writer(open(filename + ".txt", 'w', newline=' '), delimiter=',').writerows(df)
            print("File is created for " + filename)

    except Exception as err:
        print("Error executing queries and saving output data file: ", err)


def main():
    if len(sys.argv) == 3:
        """File will be passed as parameter """
        config_file = sys.argv[1]
        data_file = sys.argv[2]

        dynamicQuery(config_file, data_file)
    else:
        usage()


def usage():
    print("Usage: python splitGenric.py config_file data_file ")


main()

示例文件

1233,ACV,Asset,sample
1235,ACV,Asset,sample
1232,ACV,Asset,sample
1234,ACV,Asset,sample
1237,ACV,,sample
1238,ACV,,sample
1234,ACV,Balance,sample
1254,ACV,Balance,sample
1244,ACV,Balance,sample
1264,ACV,Balance,sample

Config.yaml

Asset :
  filter1: '"Asset"'


Balance:
    filter1: '"Balance"'
    filter2: '""'

2 个答案:

答案 0 :(得分:1)

对此,YAML配置文件格式不是特别方便,并且yaml不是标准的Python模块。我可能会选择正则表达式而不是YAML文件。但是只是为了解决眼前的问题,这里的问题是您在混合使用Python语法和文字引号字符。例如,您正在组装一个字符串,该字符串包含Asset周围的文字双引号,而您的CSV文件不包含此值周围的双引号;例如,因此您可以有效地比较if 'Asset' == '"Asset"'当然是False

以下内容可能并不能完全满足您的要求,但至少应证明我在这里尝试做的事情的初步切入点。

with open(config_file) as file:
    config = yaml.full_load(file)

filters = dict()
for k, v in config.items():
    handle = open(k + '.txt', 'w', newline='')
    writer = csv.writer(handle, delimiter=',')
    filt = {'handle': handle, 'writer': writer, 'conditions': []}
    for _, expr in v.items():
        filt['conditions'].append(expr.strip('"'))
    filters[k] = filt

with open(data_file) as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        for handle, conf in filters.items():
            for i in range(len(conf['conditions'])):
                if row[2] == conf['conditions'][i]:
                    conf['writer'].writerow(row)
                    break

for handle, conf in filters.items():
    conf['handle'].close()

我猜您使用了pyyaml,它似乎是Python的主要YAML模块。

答案 1 :(得分:0)

我尝试使用config.yaml,但出现此错误

File "C:\Users\XXXXXX\AppData\Local\Programs\Python\Python36-32\lib\site-packages\yaml\parser.py", line 439, in parse_block_mapping_key
"expected <block end>, but found %r" % token.id, token.start_mark)
yaml.parser.ParserError: while parsing a block mapping
in "config.yml", line 5, column 5
expected <block end>, but found ','
in "config.yml", line 5, column 17

但我会假装它可以正常工作,并且将内容加载到字典中,因为这似乎是故意的。 字典如下:

doc = {'Asset':'Asset','Balance':[' ','Balance']}

#load directly to dataframe
df = pd.read_csv('sample.txt',header=None)  

handler = ''
for k,v in doc.items():

    kList = {k:[]} #making empty lists with k values

    if isinstance(v,str): #Asset is string
        fil = v
    else:
        for i in range(len(v)): #Balance is list of values
            if v[i]:
                fil = v[i]
            else:
                handler = k #replace the null                           

    for types in df.values:

        if fil in types:
            kList[k].append(types) #append types to corresponding list

    csv.writer(open(k+".txt", 'a', newline='\n'), delimiter=',').writerows(kList[k])


if handler: #there is null values
    nulls = df[df.isnull().any(axis=1)].values.tolist()
    csv.writer(open(handler+".txt", 'a', newline='\n'), delimiter=',').writerows(nulls)

结果是两个文件,内容如下:

Asset.txt:

1233,ACV,Asset,sample
1235,ACV,Asset,sample
1232,ACV,Asset,sample
1234,ACV,Asset,sample

Balance.txt:

1234,ACV,Balance,sample
1254,ACV,Balance,sample
1244,ACV,Balance,sample
1264,ACV,Balance,sample
1237,ACV,nan,sample
1238,ACV,nan,sample