所以我试图清理我们的徽章系统导出的.csv。此导出的一个问题是它不会将徽章信息(徽章ID,激活状态,公司等)分成不同的列。
这是我需要做的事情:
问题:我已经执行了第1步和第2步,但是我需要帮助通过CREDENTIALS [3]列,找到" Active"关键字并删除除第一组数字之外的所有内容。但是,某些凭据将具有由|。
分隔的多个徽章例如,以下是原始.csv的外观:
COMMAND,PERSONID,PARTITION,CREDENTIALS,EMAIL,FIRSTNAME,LASTNAME
NoCommand,43,Master,{9065~9065~Company~Active~~~},personone@company.com,person,one
NoCommand,57,Master,{9482~9482~Company~Active~~~},persontwo@company.com,person,two
NoCommand,323,Master,{8045~8045~Company~Disabled~~~},personthree@company.com,person,three
NoCommand,84,Master,{8283~8283~Company~Disabled~~~|9861~9861~Company~Active~~~},personfour@company.com,person,four
NoCommand,46,Master,{9693~9693~Company~Lost~~~|9648~9648~Company~Active~~~},personfive@company.com,person,five
如您所见,CREDENTIALS专栏[3]包含了大量数据。它还将具有由|。
分隔的多个徽章凭证这是我到目前为止完成步骤1和步骤2:
import csv
# Empty data set that will eventually be written with the new sanitized data
data = []
# Keyword to search for
word = 'Active'
# Source .csv file that we will be working with
input_filename = '/path/to/original/csv'
# Output .csv file that we will create with the data from input_filename
output_filename = '/path/to/new/csv'
with open(input_filename, "rb") as the_file:
reader = csv.reader(the_file, delimiter=",")
next(reader, None)
# Test sanitizing column 3
for row in reader:
for col in row[3]:
if word in row[3]:
print col
new_row = [row[3], row[5], row[6], row[4]]
data.append(new_row)
with open(output_filename, "w+") as to_file:
writer = csv.writer(to_file, delimiter=",")
writer.writerow(['BadgeID', 'FirstName', 'LastName', 'EmployeeEmail'])
for new_row in data:
writer.writerow(new_row)
到目前为止,新的.csv看起来像这样:
BadgeID,FirstName,LastName,EmployeeEmail
{9065~9065~Company~Active~~~},person,one,personone@company.com
{9482~9482~Company~Active~~~},person,two,persontwo@company.com
{8045~8045~Company~Disabled~~~},person,three,personthree@company.com
{8283~8283~Company~Disabled~~~|9861~9861~Company~Active~~~},person,four,personfour@company.com
{9693~9693~Company~Lost~~~|9648~9648~Company~Active~~~},person,five,personfive@company.com
我希望它看起来像这样,使用" Active"凭证:
BadgeID,FirstName,LastName,EmployeeEmail
9066,person,one,personone@company.com
9482,person,two,persontwo@company.com
8045,person,three,personthree@company.com
8283,person,four,personfour@company.com
9693,person,five,personfive@company.com
然而,对于我的第3列测试代码块,我试图至少确保我抓住正确的数据。奇怪的是,当我打印那个列时,看起来很奇怪:
# Test sanitizing column 3
for row in reader:
for col in row[3]:
if word in row[3]:
print col
输出如下内容:
C
a
r
d
s
~
A
c
t
i
v
e
~
~
~
}
{
8
8
2
4
~
8
8
2
4
~
有人有任何想法吗?
答案 0 :(得分:1)
按照您的输出,您正在抓取正确的数据!问题是:第3列是一个字符串。你从一开始就像一个列表一样对待它,导致从单词中拉出字符。使用字符串方法首先获取单词列表。
使用伪代码逐步执行:
剥去那些括号
column3 = column3.strip("{}")
由于你可能有多个徽章由" |"分隔,你应该
badges_str = column3.split("|")
现在您有一个字符串列表,每个字符串代表一个徽章。
badges = []
for badge in badges_str:
badges.append(badge.split("~"))
现在您有一个可以使用索引的单个徽章列表列表。
for badge in badges:
# test for the Active badges, then do things
if badge[3] == "Active":
do_something(badge[0])
do_something_else(badge[1])
etc...
这并没有为您提供实际代码,但应该让您接下来的步骤来实现目标。