我有一个csv文件,第一行有产品名,第二行和第三行的数据头包含每个用户状态的实际数据。
csv文件如下所示:
adidas,,
USER_ID,USER_NAME
b012345,zaihan,Process
b212345,nurhanani,Check
b843432,nasirah,Call
b712345,ibrahim,Check
nike,,
USER_ID,USER_NAME
b842134,khalee,Call
h123455,shabree,Process
b777345,ibrahim,Process
b012345,zaihan,Check
b843432,nasirah,Call
b312451,nurhanani,Process
我想明智地拆分数据产品并重新排列标题和数据,如下所示:
来自此标题
adidas,, USER_ID,USER_NAME b012345,zaihan,Process
要像这样的标题
USER_ID,USER_NAME,adidas b012345,zaihan,Process
我一直在编写代码并且我认为我要对标题进行硬编码(例如,'adidas'和'nike')因为我从阅读SO的答案中理解的是,我需要唯一的标题名称并且以下代码没有得到我想要的东西:
我的python代码是:
import csvkit
import sys
import os
from csvkit import convert
with open('/tmp/csvdata.csv', 'rb') as q:
reader = csvkit.reader(q)
with open('/tmp/csvdata2.csv', 'wb') as s:
data = csvkit.writer(s)
data.writerow(['Name', 'Userid', 'adidas', 'nike'])
for row in reader:
row_data = [row[0], row[1], row[2], '']
data = csvkit.writer(s)
data.writerow(row_data)
修改
所以我从@piRSquared得到了一个解决方案,如果产品有唯一的记录集,这是正确的,但同一产品的每个用户可能有多个状态。解决方案提供了ValueError: Index contains duplicate entries, cannot reshape
具有多个状态并将导致此问题的输入CSV数据的示例:
adidas,,
USER_ID,USER_NAME
b012345,zaihan,Process
h003455,shabree,Check
b212345,nurhanani,Check
b843432,nasirah,Call
b712345,ibrahim,Check
b712345,ibrahim,Process
nike,,
USER_ID,USER_NAME
b842134,khalee,Call
h123455,shabree,Process
b777345,ibrahim,Process
b012345,zaihan,Check
b843432,nasirah,Call
b312451,nurhanani,Process
我希望能够实现这样的结果,看起来同一品牌类别的用户可以拥有相同的ID,名称以及Process和Check。
USER_ID,USER_NAME,adidas,nike
b012345,zaihan,Process
h003455,shabree,Check,Process
b212345,nurhanani,Check,Process
b843432,nasirah,Call,Call
b712345,ibrahim,Check
b712345,ibrahim,Process
b777345,ibrahim,,Process
b842134,khalee,,Call
对于在相同品牌中具有 检查和处理 的用户,最终结果应该像上面那样有一个额外的行(在这种情况下 nike品牌的用户ibrahim )
答案 0 :(得分:2)
好的,这很复杂。
from StringIO import StringIO
import re
import pandas as pd
text = """adidas,,
USER_ID,USER_NAME
b012345,zaihan,Process
b212345,nurhanani,Check
b451234,nasirah,Call
c234567,ibrahim,Check
nike,,
USER_ID,USER_NAME
b842134,khalee,Call
h123455,shabree,Process
c234567,ibrahim,Process
c143322,zaihan,Check
b451234,nasirah,Call
"""
m = re.findall(r'(.*,,\n(.*([^,]|,[^,])\n)*)', text)
dfs = range(len(m))
keys = range(len(m))
for i, f in enumerate(m):
lines = f[0].split('\n')
lines[1] += ','
keys[i] = lines[0].split(',')[0]
dfs[i] = pd.read_csv(StringIO('\n'.join(lines[1:])))
df = pd.concat(dfs, keys=keys)
df = df.set_index(['USER_ID', 'USER_NAME'], append=True).unstack(0)
df.index = df.index.droplevel(0)
df.columns = df.columns.droplevel(0)
df = df.stack().unstack()
print df.to_csv()
USER_ID,USER_NAME,adidas,nike
b012345,zaihan,Process,
b212345,nurhanani,Check,
b451234,nasirah,Call,Call
b842134,khalee,,Call
c143322,zaihan,,Check
c234567,ibrahim,Check,Process
h123455,shabree,,Process
# regular expression to match line with a single value identified
# by having two commas at the end of the line.
# This grabs nike and adidas.
# It also grabs all lines after that until the next single valued line.
m = re.findall(r'(.*,,\n(.*([^,]|,[^,])\n)*)', text)
# place holder for list of sub dataframes
dfs = range(len(m))
# place holder for list of keys. In this example this will be nike and adidas
keys = range(len(m))
# Loop through each regex match. This example will only have 2.
for i, f in enumerate(m):
# split on new line so I can grab and fix stuff
lines = f[0].split('\n')
# Fix that header row only has 2 columns and data has 3
lines[1] += ','
# Grab nike or adidas or other single value
keys[i] = lines[0].split(',')[0]
# Create dataframe by reading in rest of lines
dfs[i] = pd.read_csv(StringIO('\n'.join(lines[1:])))
# Concat dataframes with appropriate keys and pivot stuff
df = pd.concat(dfs, keys=keys)
df = df.set_index(['USER_ID', 'USER_NAME'], append=True).unstack(0)
df.index = df.index.droplevel(0)
df.columns = df.columns.droplevel(0)
df = df.stack().unstack()
答案 1 :(得分:1)
首先,Ctrl+C
您的示例数据并尝试在下面运行。
import pandas as pd
import numpy as np
df = pd.read_clipboard(header=None)
i = np.where(~df[0].str.contains(','))[0].astype(int).tolist()+[len(df)]
frames = []
for n in range(len(i))[:-1]:
part = df.iloc[i[n]:i[n+1]]
part_df = part.iloc[2:, 0].str.extract('(.+),(.+),(.+)')
part_df.columns = ['USER_ID', 'USER_NAME', '{}'.format(part.iloc[0, 0])]
frames.append(part_df.set_index(['USER_ID', 'USER_NAME']))
final = pd.concat(frames, axis=1).fillna('')
final.to_csv('result.csv')
结果是,
USER_ID,USER_NAME,adidas,nike
b012345,zaihan,Process,
b212345,nurhanani,Check,
b451234,nasirah,Call,
b712345,ibrahim,,Process
b842134,khalee,,Call
b843432,nasirah,,Call
c143322,zaihan,,Check
c234567,ibrahim,Check,
h123455,shabree,,Process
答案 2 :(得分:-1)
也许这会有所帮助,您可以使用Pandas来合并您的2个数据集。
import pandas as pd
df1 = pd.read_csv("csvdata.csv")
df2 = pd.read_csv("csvdata2.csv")
df3 = df1.merge(df2, on='USER_ID', how='left')
df3 = df3[['USER_ID', 'USER_NAME', 'NIKE', 'ADIDAS']]
print df3
您应该更改您的数据,使其包含Nike / Adidas的标题,删除其中的所有标题并使用Pandas编写标题,就像您在原始代码中所做的那样:
df1 = pd.read_csv("csvdata.csv", names = ['USER_ID', 'USER_NAME', 'NIKE'])
或
重命名标题:
USER_ID,USERNAME,NIKE
b842134,khalee,Call
h123455,shabree,Process
b712345,ibrahim,Process
c143322,zaihan,Check
b843432,nasirah,Call
编辑: 如果您的数据在一个文件中,您可以尝试将其拆分为2个数据帧,如下所示:
index = df1.index[df1['adidas'] == 'nike'].tolist()[0]
df2 = df1[index:]
df1 = df1[:index]
它有点草率,但应该有用......