如何通过查找关键字并删除某些分隔符来清理CSV中的整列?

时间:2017-06-16 17:12:38

标签: python python-2.7 csv

所以我试图清理我们的徽章系统导出的.csv。此导出的一个问题是它不会将徽章信息(徽章ID,激活状态,公司等)分成不同的列。

这是我需要做的事情:

  1. 仅使用部分列
  2. 创建新的.csv
  3. 重命名第一行
  4. 清理CREDENTIALS列,使其仅输出激活的徽章编号
  5. 问题:我已经执行了第1步和第2步,但是我需要帮助通过CREDENTIALS [3]列,找到" Active"关键字并删除除第一组数字之外的所有内容。但是,某些凭据将具有由|。

    分隔的多个徽章

    例如,以下是原始.csv的外观:

    COMMAND,PERSONID,PARTITION,CREDENTIALS,EMAIL,FIRSTNAME,LASTNAME
    NoCommand,43,Master,{9065~9065~Company~Active~~~},personone@company.com,person,one
    NoCommand,57,Master,{9482~9482~Company~Active~~~},persontwo@company.com,person,two
    NoCommand,323,Master,{8045~8045~Company~Disabled~~~},personthree@company.com,person,three
    NoCommand,84,Master,{8283~8283~Company~Disabled~~~|9861~9861~Company~Active~~~},personfour@company.com,person,four
    NoCommand,46,Master,{9693~9693~Company~Lost~~~|9648~9648~Company~Active~~~},personfive@company.com,person,five
    

    如您所见,CREDENTIALS专栏[3]包含了大量数据。它还将具有由|。

    分隔的多个徽章凭证

    这是我到目前为止完成步骤1和步骤2:

    import csv
    
    # Empty data set that will eventually be written with the new sanitized data
    data = []
    
    # Keyword to search for
    word = 'Active'
    
    # Source .csv file that we will be working with
    input_filename = '/path/to/original/csv'
    
    # Output .csv file that we will create with the data from input_filename
    output_filename = '/path/to/new/csv'
    
    with open(input_filename, "rb") as the_file:
        reader = csv.reader(the_file, delimiter=",")
        next(reader, None)
    
        # Test sanitizing column 3
        for row in reader:
            for col in row[3]:
                if word in row[3]:
                    print col
    
            new_row = [row[3], row[5], row[6], row[4]]
    
            data.append(new_row)
    
    
        with open(output_filename, "w+") as to_file:
            writer = csv.writer(to_file, delimiter=",")
    
    
            writer.writerow(['BadgeID', 'FirstName', 'LastName', 'EmployeeEmail'])
    
            for new_row in data:
                writer.writerow(new_row)
    

    到目前为止,新的.csv看起来像这样:

        BadgeID,FirstName,LastName,EmployeeEmail
    {9065~9065~Company~Active~~~},person,one,personone@company.com
    {9482~9482~Company~Active~~~},person,two,persontwo@company.com
    {8045~8045~Company~Disabled~~~},person,three,personthree@company.com
    {8283~8283~Company~Disabled~~~|9861~9861~Company~Active~~~},person,four,personfour@company.com
    {9693~9693~Company~Lost~~~|9648~9648~Company~Active~~~},person,five,personfive@company.com
    

    我希望它看起来像这样,使用" Active"凭证:

    BadgeID,FirstName,LastName,EmployeeEmail
        9066,person,one,personone@company.com
        9482,person,two,persontwo@company.com
        8045,person,three,personthree@company.com
        8283,person,four,personfour@company.com
        9693,person,five,personfive@company.com
    

    然而,对于我的第3列测试代码块,我试图至少确保我抓住正确的数据。奇怪的是,当我打印那个列时,看起来很奇怪:

    # Test sanitizing column 3
        for row in reader:
            for col in row[3]:
                if word in row[3]:
                    print col
    

    输出如下内容:

    C
    a
    r
    d
    s
    ~
    A
    c
    t
    i
    v
    e
    ~
    ~
    ~
    }
    {
    8
    8
    2
    4
    ~
    8
    8
    2
    4
    ~
    

    有人有任何想法吗?

1 个答案:

答案 0 :(得分:1)

按照您的输出,您正在抓取正确的数据!问题是:第3列是一个字符串。你从一开始就像一个列表一样对待它,导致从单词中拉出字符。使用字符串方法首先获取单词列表。

使用伪代码逐步执行:

剥去那些括号

column3 = column3.strip("{}")

由于你可能有多个徽章由" |"分隔,你应该

badges_str = column3.split("|")

现在您有一个字符串列表,每个字符串代表一个徽章。

badges = []
for badge in badges_str:
    badges.append(badge.split("~"))

现在您有一个可以使用索引的单个徽章列表列表。

for badge in badges:
    # test for the Active badges, then do things
    if badge[3] == "Active":
        do_something(badge[0])
        do_something_else(badge[1])
        etc...

这并没有为您提供实际代码,但应该让您接下来的步骤来实现目标。