Question

我有一个带逗号分隔符的csv，它在列中有多个由管道分隔的值，我需要将它们映射到具有多个管道分隔值的另一列，然后为它们提供自己的行以及原始数据没有多个值的行。我的CSV看起来像这样（在类别之间使用逗号）：

row    name                  city                          amount
1      frank | john | dave   toronto | new york | anaheim  10
2      george | joe | fred   fresno | kansas city | reno   20

我需要它看起来像这样：

row    name    city          amount
1      frank   toronto       10
2      john    new york      10
3      dave    anaheim       10
4      george  fresno        20
5      joe     kansas city   20
6      fred    reno          20

Answer 1

也许不是最好但有效的解决方案：（不使用管道线和不同的管道长度）

df = pd.read_csv('<your_data>.csv')
str_split = ' | '

# Calculate maximum length of piped (' | ') values
df['max_len'] = df[['name', 'city']].apply(lambda x: max(len(x[0].split(str_split)),
    len(x[0].split(str_split))), axis=1)
max_len = df['max_len'].max()

# Split '|' piped cell values into columns (needed at unpivot step)
# Create as many new 'name_<x>' & 'city_<x>' columns as 'max_len'
df[['name_{}'.format(i) for i in range(max_len)]] = df['name'].apply(lambda x: \
    pd.Series(x.split(str_split)))
df[['city_{}'.format(i) for i in range(max_len)]] = df['city'].apply(lambda x: \
    pd.Series(x.split(str_split)))

# Unpivot 'name_<x>' & 'city_<x>' columns into rows
df_pv_name = pd.melt(df, value_vars=['name_{}'.format(i) for i in range(max_len)],
    id_vars=['amount'])
df_pv_city = pd.melt(df, value_vars=['city_{}'.format(i) for i in range(max_len)],
    id_vars=['amount'])

# Rename upivoted columns (these are the final columns)
df_pv_name = df_pv_name.rename(columns={'value':'name'})
df_pv_city = df_pv_city.rename(columns={'value':'city'})

# Rename 'city_<x>' values (rows) to be 'key' for join (merge)
df_pv_city['variable'] = df_pv_city['variable'].map({'city_{}'.format(i):'name_{}'\
    .format(i) for i in range(max_len)})

# Join unpivoted 'name' & 'city' dataframes
df_res = df_pv_name.merge(df_pv_city, on=['variable', 'amount'])

# Drop 'variable' column and NULL rows if you have not equal pipe-length in original rows
# If you want to drop any NULL rows then replace 'all' to 'any'
df_res = df_res.drop(['variable'], axis=1).dropna(subset=['name', 'city'], how='all',
    axis=0).reset_index(drop=True)

结果是：

   amount    name         city
0      10   frank      toronto
1      20  george       fresno
2      10    john     new york
3      20     joe  kansas city
4      10    dave      anaheim
5      20    fred         reno

另一个测试输入：

                               name                                                  city  amount
0  frank | john | dave | joe | bill  toronto | new york | anaheim | los angeles | caracas      10
1               george | joe | fred                                  fresno | kansas city      20
2                             danny                                                 miami      30

此测试的结果（如果您不想要NaN行，请在合并时将how='all'替换为代码中的how='any'：

   amount    name         city
0      10   frank      toronto
1      20  george       fresno
2      30   danny        miami
3      10    john     new york
4      20     joe  kansas city
5      10    dave      anaheim
6      20    fred          NaN
7      10     joe  los angeles
8      10    bill      caracas

Answer 2

给出一行：

['1','frank|joe|dave', 'toronto|new york|anaheim', '20']

你可以使用

itertools.izip_longest(*[value.split('|') for value in row])

获取以下结构：

[('1', 'frank', 'toronto', '20'),
 (None, 'joe', 'new york', None),
 (None, 'dave', 'anaheim', None)]

在这里，我们要将所有None值替换为相应列中最后看到的值。循环结果时可以完成。

所以鉴于TSV已经被代码分割后的代码应该可以解决问题：

import itertools 


def flatten_tsv(lines):
    result = []
    for line in lines:
        flat_lines = itertools.izip_longest(*[value.split('|') for value in line])
        for flat_line in flat_lines:
            result.append([result[-1][i] if v is None else v 
                           for i, v in enumerate(flat_line)])
    return result

在逗号分隔的CSV的多个列中拆分多个管道分隔值，并将它们相互映射

2 个答案: