根据python中的值范围在单独的列中拆分csv文件

时间:2018-08-29 05:35:42

标签: python csv

我想拆分/分隔csv列范围内给定的值,为该范围内的每个数字添加新数据,同时保持所有其他列的数据匹配。

重要的是,我能够为(xy)范围内的任何数字维护其他列(Job ID)的数据,因此写入的结果csv显然将比原始数据长得多。

我希望输出的csv代表26-29、66-67等范围内每个数字的单独列。所以我想要一个输出的csv文件,例如:

职位ID 21879被代表4次,分别代表26、27、28和29。

我想在为脚本编写以下步骤之前先执行此操作,但此刻会陷入困境。

脚本的其余部分将日期值(/)分割,将它们分配给新行,并将它们与页码字段连接在一起。这是我要在显示范围内拆分的页码字段。

此脚本的结果列表仅从Job ID列中输出所需的值,并在第二个中显示连接的日期和页面字段。这部分工作正常,这是我需要将每个数字表示为给定范围内的单个数字的最后一个csv文件。

感谢帮助您拆分这些值范围并维护其他数据字段。

我的输入数据的一个子集如下:

Job ID  Job summary Link    Locality    Received    Job status  Asset   Date       Page No
21879   Addition    Documents Link  CBD 15/06/2018  Completed   Water   28/06/2018  26-29
21878   Addition    Documents Link  CBD 28/06/2018  Completed   Water       
21877   Addition    Documents Link  CBD 28/06/2018  Completed   Water       
21876   Addition    Documents Link  CBD 28/06/2018  Completed   Water       
21875   Addition    Documents Link  CBD 28/06/2018  Completed   Water       
21874   Addition    Documents Link  CBD 28/06/2018  Completed   Water   26/07/2018  42-43
21873   Addition    Documents Link  CBD 27/06/2018  Completed   Water   26/07/2018  
21872   Addition    Documents Link  CBD 27/06/2018  Completed   Water   26/07/2018  66-67
21871   Addition    Documents Link  CBD 27/06/2018  Completed   Water   26/07/2018  07-08
21870   Addition    Documents Link  CBD 27/06/2018  Completed   Water   28/06/2018  59
21869   Addition    Documents Link  CBD 27/06/2018  Completed   Water   28/06/2018  58
21868   Addition    Documents Link  CBD 26/06/2018  Completed   Water       
21867   Addition    Documents Link  CBD 26/06/2018  Completed   Water       

我想要的输出是:

Job ID  Job summary Link    Locality    Received    Job status  Asset   Date       Page No
21879   Addition    Documents Link  CBD 15/06/2018  Completed   Water   28/06/2018  26
21879   Addition    Documents Link  CBD 15/06/2018  Completed   Water   28/06/2018  27  
21879   Addition    Documents Link  CBD 15/06/2018  Completed   Water   28/06/2018  28  
21879   Addition    Documents Link  CBD 15/06/2018  Completed   Water   28/06/2018  29  
21878   Addition    Documents Link  CBD 28/06/2018  Completed   Water       
21877   Addition    Documents Link  CBD 28/06/2018  Completed   Water       
21876   Addition    Documents Link  CBD 28/06/2018  Completed   Water       
21875   Addition    Documents Link  CBD 28/06/2018  Completed   Water       
21874   Addition    Documents Link  CBD 28/06/2018  Completed   Water   26/07/2018  42
21874   Addition    Documents Link  CBD 28/06/2018  Completed   Water   26/07/2018  43
21873   Addition    Documents Link  CBD 27/06/2018  Completed   Water   26/07/2018  
21872   Addition    Documents Link  CBD 27/06/2018  Completed   Water   26/07/2018  66
21872   Addition    Documents Link  CBD 27/06/2018  Completed   Water   26/07/2018  67
21871   Addition    Documents Link  CBD 27/06/2018  Completed   Water   26/07/2018  07
21871   Addition    Documents Link  CBD 27/06/2018  Completed   Water   26/07/2018  08
21870   Addition    Documents Link  CBD 27/06/2018  Completed   Water   28/06/2018  59
21869   Addition    Documents Link  CBD 27/06/2018  Completed   Water   28/06/2018  58
21868   Addition    Documents Link  CBD 26/06/2018  Completed   Water       
21867   Addition    Documents Link  CBD 26/06/2018  Completed   Water       

当前脚本为:

import os
import csv
with open('CSV_File.csv','r') as csvinput:  
    with open('temp__spreadsheet_cache_1.csv', 'w') as csvoutput:
        writer = csv.writer(csvoutput)
        for row in csv.reader(csvinput):
            if row[7] == "Date":
                writer.writerow(row+["day"])
            else:
                writer.writerow(row+row[4].split('/'))
with open('temp__spreadsheet_cache_1.csv','r') as csvinput:
    with open('temp__spreadsheet_cache_2.csv', 'w') as csvoutput:
        writer = csv.writer(csvoutput)
        for row in csv.reader(csvinput):
            if row[7] == "Date":
                writer.writerow(row+["month"])
            else:
                writer.writerow(row+row[4].split('/'))
with open('temp__spreadsheet_cache_2.csv','r') as csvinput:
    with open('temp__spreadsheet_cache_3.csv', 'w') as csvoutput:
        writer = csv.writer(csvoutput)
        for row in csv.reader(csvinput):
            if row[7] == "Date":
                writer.writerow(row+["year"])
            else:
                writer.writerow(row+row[4].split('/'))
with open('temp__spreadsheet_cache_3.csv','r') as csvinput:
    with open('temp__spreadsheet_cache_4.csv', 'w') as csvoutput:
        writer = csv.writer(csvoutput)
        for row in csv.reader(csvinput):
            if row[7] == "Date":
                writer.writerow(row+["Concatenation"])
            else:
                writer.writerow(row+row[4].split('/'))
#---Using Current output (temp__spreadsheet_cache_4.csv) to create new list--
blank =[]
with open (r'temp__spreadsheet_cache_4.csv', 'r') as NEW_CSV:
    csvReader = csv.reader(NEW_CSV, delimiter=',', quotechar='"')
    header = csvReader.next()
    JobIndex = header.index("Job ID")
    PageIndex = header.index("Page No")
    DayIndex = header.index("day")
    MonthIndex = header.index("month")
    YearIndex = header.index("year")
    Summary = header.index("Job summary")
    StatusIndex = header.index("Job status")
    class_1 = header.index("Asset")
    for row in csvReader:
        Page = row[PageIndex]
        Day = row[DayIndex]
        Month = row[MonthIndex]
        Year = row[YearIndex]
        JobID = row[JobIndex]
        To_be_overridden_concat = row[PageIndex]
        Type = row[Summary]
        Status = row[StatusIndex]
        waterclass = row[class_1]
        if waterclass == 'Water'  
          blank.append([JobID,Day,Month,Year,Page,To_be_overridden_concat])
str(blank)
for column in blank:
    column[1] = column[1].lstrip('0')
    column[2] = column[2].lstrip('0')
    column[3] = column[3].lstrip('0')
    column[4] = column[4].lstrip('0')
for column in blank:
    column[0] = column[0].lstrip()
    column[1] = column[1].lstrip()
    column[2] = column[2].lstrip()
    column[3] = column[3].lstrip() 
    column[4] = column[4].lstrip()
for column in blank:
    column[0] = column[0].rstrip()
    column[1] = column[1].rstrip()
    column[2] = column[2].rstrip()
    column[3] = column[3].rstrip()
    column[4] = column[4].rstrip()
    column[5] = column[1]+column[2]+column[3]+column[4]
##os.remove("temp__spreadsheet_cache_4.csv")
os.remove("temp__spreadsheet_cache_3.csv")
os.remove("temp__spreadsheet_cache_2.csv")
os.remove("temp__spreadsheet_cache_1.csv")
for row in blank:
    del row[1:5]
print blank[0:10]

1 个答案:

答案 0 :(得分:0)

首先,我需要假设您有一个标准CSV文件,其中用逗号分隔了各个字段,例如:

Job ID,Job summary,Link,Locality,Received,Job status,Asset,Date,Page No
21879,Addition,Documents,Link,CBD,15/06/2018,Completed,Water,28/06/2018,26-29
21878,Addition,Documents,Link,CBD,28/06/2018,Completed,Water,,
21874,Addition,Documents,Link,CBD,28/06/2018,Completed,Water,26/07/2018,42-43
21873,Addition,Documents,Link,CBD,27/06/2018,Completed,Water,26/07/2018,1

在这种情况下,您的数据可以按以下方式修复:

from datetime import datetime
import csv

fieldnames = ["Job ID", "Job summary", "Link", "Locality", "ReceivedDay", "ReceivedMonth", "ReceivedYear", "Job status", "Asset", "Day", "Month", "Year", "Page No"]

with open("CSV_File.csv", "rb") as f_input, open("output.csv", "wb") as f_output:
    csv_input = csv.reader(f_input)
    next(csv_input) # skip the header

    csv_output = csv.writer(f_output)
    csv_output.writerow(fieldnames)

    for row in csv_input:
        date_received = row[5].split('/')

        if len(row[8]):
            date = row[8].split('/')
        else:
            date = ["", "", ""]

        if row[9].find('-') != -1:
            pages = map(int, row[9].split("-"))

            for page in range(pages[0], pages[1] + 1):
                output_row = row[:5] +  date_received + row[6:8] + date + [page]
                csv_output.writerow(output_row)
        else:
            output_row = row[:5] +  date_received + row[6:8] + date + [row[9]]
            csv_output.writerow(output_row)

这将为您提供一个开始的输出文件:

Job ID,Job summary,Link,Locality,ReceivedDay,ReceivedMonth,ReceivedYear,Job status,Asset,Day,Month,Year,Page No
21879,Addition,Documents,Link,CBD,15,06,2018,Completed,Water,28,06,2018,26
21879,Addition,Documents,Link,CBD,15,06,2018,Completed,Water,28,06,2018,27
21879,Addition,Documents,Link,CBD,15,06,2018,Completed,Water,28,06,2018,28
21879,Addition,Documents,Link,CBD,15,06,2018,Completed,Water,28,06,2018,29
21878,Addition,Documents,Link,CBD,28,06,2018,Completed,Water,,,,
21877,Addition,Documents,Link,CBD,28,06,2018,Completed,Water,,,,
21876,Addition,Documents,Link,CBD,28,06,2018,Completed,Water,,,,
21875,Addition,Documents,Link,CBD,28,06,2018,Completed,Water,,,,
21874,Addition,Documents,Link,CBD,28,06,2018,Completed,Water,26,07,2018,42
21874,Addition,Documents,Link,CBD,28,06,2018,Completed,Water,26,07,2018,43

通过首先跳过输入标头并编写合适的输出标头来工作。假定收到的日期始终存在。 split('/')用于将日期分为三部分。如果页码包含-符号,则使用split('-')来获取这两个部分,然后将其转换为两个整数。

通过将输入行的一部分与两个日期部分组合在一起来创建输出行。