我有一个带逗号分隔符的csv,它在列中有多个由管道分隔的值,我需要将它们映射到具有多个管道分隔值的另一列,然后为它们提供自己的行以及原始数据没有多个值的行。我的CSV看起来像这样(在类别之间使用逗号):
row name city amount
1 frank | john | dave toronto | new york | anaheim 10
2 george | joe | fred fresno | kansas city | reno 20
我需要它看起来像这样:
row name city amount
1 frank toronto 10
2 john new york 10
3 dave anaheim 10
4 george fresno 20
5 joe kansas city 20
6 fred reno 20
答案 0 :(得分:1)
也许不是最好但有效的解决方案: (不使用管道线和不同的管道长度)
df = pd.read_csv('<your_data>.csv')
str_split = ' | '
# Calculate maximum length of piped (' | ') values
df['max_len'] = df[['name', 'city']].apply(lambda x: max(len(x[0].split(str_split)),
len(x[0].split(str_split))), axis=1)
max_len = df['max_len'].max()
# Split '|' piped cell values into columns (needed at unpivot step)
# Create as many new 'name_<x>' & 'city_<x>' columns as 'max_len'
df[['name_{}'.format(i) for i in range(max_len)]] = df['name'].apply(lambda x: \
pd.Series(x.split(str_split)))
df[['city_{}'.format(i) for i in range(max_len)]] = df['city'].apply(lambda x: \
pd.Series(x.split(str_split)))
# Unpivot 'name_<x>' & 'city_<x>' columns into rows
df_pv_name = pd.melt(df, value_vars=['name_{}'.format(i) for i in range(max_len)],
id_vars=['amount'])
df_pv_city = pd.melt(df, value_vars=['city_{}'.format(i) for i in range(max_len)],
id_vars=['amount'])
# Rename upivoted columns (these are the final columns)
df_pv_name = df_pv_name.rename(columns={'value':'name'})
df_pv_city = df_pv_city.rename(columns={'value':'city'})
# Rename 'city_<x>' values (rows) to be 'key' for join (merge)
df_pv_city['variable'] = df_pv_city['variable'].map({'city_{}'.format(i):'name_{}'\
.format(i) for i in range(max_len)})
# Join unpivoted 'name' & 'city' dataframes
df_res = df_pv_name.merge(df_pv_city, on=['variable', 'amount'])
# Drop 'variable' column and NULL rows if you have not equal pipe-length in original rows
# If you want to drop any NULL rows then replace 'all' to 'any'
df_res = df_res.drop(['variable'], axis=1).dropna(subset=['name', 'city'], how='all',
axis=0).reset_index(drop=True)
结果是:
amount name city
0 10 frank toronto
1 20 george fresno
2 10 john new york
3 20 joe kansas city
4 10 dave anaheim
5 20 fred reno
另一个测试输入:
name city amount
0 frank | john | dave | joe | bill toronto | new york | anaheim | los angeles | caracas 10
1 george | joe | fred fresno | kansas city 20
2 danny miami 30
此测试的结果(如果您不想要NaN
行,请在合并时将how='all'
替换为代码中的how='any'
:
amount name city
0 10 frank toronto
1 20 george fresno
2 30 danny miami
3 10 john new york
4 20 joe kansas city
5 10 dave anaheim
6 20 fred NaN
7 10 joe los angeles
8 10 bill caracas
答案 1 :(得分:1)
给出一行:
['1','frank|joe|dave', 'toronto|new york|anaheim', '20']
你可以使用
itertools.izip_longest(*[value.split('|') for value in row])
获取以下结构:
[('1', 'frank', 'toronto', '20'),
(None, 'joe', 'new york', None),
(None, 'dave', 'anaheim', None)]
在这里,我们要将所有None
值替换为相应列中最后看到的值。循环结果时可以完成。
所以鉴于TSV已经被代码分割后的代码应该可以解决问题:
import itertools
def flatten_tsv(lines):
result = []
for line in lines:
flat_lines = itertools.izip_longest(*[value.split('|') for value in line])
for flat_line in flat_lines:
result.append([result[-1][i] if v is None else v
for i, v in enumerate(flat_line)])
return result