Question

所以我试图清理我们的徽章系统导出的.csv。此导出的一个问题是它不会将徽章信息（徽章ID，激活状态，公司等）分成不同的列。

这是我需要做的事情：

仅使用部分列
重命名第一行
清理CREDENTIALS列，使其仅输出激活的徽章编号

问题：我已经执行了第1步和第2步，但是我需要帮助通过CREDENTIALS [3]列，找到＆＃34; Active＆＃34;关键字并删除除第一组数字之外的所有内容。但是，某些凭据将具有由|。

例如，以下是原始.csv的外观：

COMMAND,PERSONID,PARTITION,CREDENTIALS,EMAIL,FIRSTNAME,LASTNAME
NoCommand,43,Master,{9065~9065~Company~Active~~~},personone@company.com,person,one
NoCommand,57,Master,{9482~9482~Company~Active~~~},persontwo@company.com,person,two
NoCommand,323,Master,{8045~8045~Company~Disabled~~~},personthree@company.com,person,three
NoCommand,84,Master,{8283~8283~Company~Disabled~~~|9861~9861~Company~Active~~~},personfour@company.com,person,four
NoCommand,46,Master,{9693~9693~Company~Lost~~~|9648~9648~Company~Active~~~},personfive@company.com,person,five

如您所见，CREDENTIALS专栏[3]包含了大量数据。它还将具有由|。

这是我到目前为止完成步骤1和步骤2：

import csv

# Empty data set that will eventually be written with the new sanitized data
data = []

# Keyword to search for
word = 'Active'

# Source .csv file that we will be working with
input_filename = '/path/to/original/csv'

# Output .csv file that we will create with the data from input_filename
output_filename = '/path/to/new/csv'

with open(input_filename, "rb") as the_file:
    reader = csv.reader(the_file, delimiter=",")
    next(reader, None)

    # Test sanitizing column 3
    for row in reader:
        for col in row[3]:
            if word in row[3]:
                print col

        new_row = [row[3], row[5], row[6], row[4]]

        data.append(new_row)


    with open(output_filename, "w+") as to_file:
        writer = csv.writer(to_file, delimiter=",")


        writer.writerow(['BadgeID', 'FirstName', 'LastName', 'EmployeeEmail'])

        for new_row in data:
            writer.writerow(new_row)

到目前为止，新的.csv看起来像这样：

    BadgeID,FirstName,LastName,EmployeeEmail
{9065~9065~Company~Active~~~},person,one,personone@company.com
{9482~9482~Company~Active~~~},person,two,persontwo@company.com
{8045~8045~Company~Disabled~~~},person,three,personthree@company.com
{8283~8283~Company~Disabled~~~|9861~9861~Company~Active~~~},person,four,personfour@company.com
{9693~9693~Company~Lost~~~|9648~9648~Company~Active~~~},person,five,personfive@company.com

我希望它看起来像这样，使用＆＃34; Active＆＃34;凭证：

BadgeID,FirstName,LastName,EmployeeEmail
    9066,person,one,personone@company.com
    9482,person,two,persontwo@company.com
    8045,person,three,personthree@company.com
    8283,person,four,personfour@company.com
    9693,person,five,personfive@company.com

然而，对于我的第3列测试代码块，我试图至少确保我抓住正确的数据。奇怪的是，当我打印那个列时，看起来很奇怪：

# Test sanitizing column 3
    for row in reader:
        for col in row[3]:
            if word in row[3]:
                print col

输出如下内容：

C
a
r
d
s
~
A
c
t
i
v
e
~
~
~
}
{
8
8
2
4
~
8
8
2
4
~

有人有任何想法吗？

Answer 1

按照您的输出，您正在抓取正确的数据！问题是：第3列是一个字符串。你从一开始就像一个列表一样对待它，导致从单词中拉出字符。使用字符串方法首先获取单词列表。

使用伪代码逐步执行：

剥去那些括号

column3 = column3.strip("{}")

由于你可能有多个徽章由＆＃34; |＆＃34;分隔，你应该

badges_str = column3.split("|")

现在您有一个字符串列表，每个字符串代表一个徽章。

badges = []
for badge in badges_str:
    badges.append(badge.split("~"))

现在您有一个可以使用索引的单个徽章列表列表。

for badge in badges:
    # test for the Active badges, then do things
    if badge[3] == "Active":
        do_something(badge[0])
        do_something_else(badge[1])
        etc...

这并没有为您提供实际代码，但应该让您接下来的步骤来实现目标。

如何通过查找关键字并删除某些分隔符来清理CSV中的整列？

1 个答案: