Question

这是我的CSV：

languages,    origin,     other_test1,       other_test2
"[{'name': 'French', 'vowel_count': 3}, {'name': 'Dutch', 'vowel_count': 4}, {'name': 'English', 'vowel_count': 5}]",Germanic,ABC,DEF

我想将CSV的语言列转换为以下输出：

Language_name ,Language_vowel_count, origin,    other.test1, other.test2
French,        3,                    Germanic,  ABC,         DEF
Dutch,         4,                    Germanic,  ABC,         DEF
English,       5,                    Germanic,  ABC,         DEF

我尝试过的代码：

 from itertools import chain

 a = df['languages'].str.findall("'(.*?)'").astype(np.object)
 lens = a.str.len()

  df = pd.DataFrame({
'origin' : df['origin'].repeat(lens),
'other_test1' : df['other_test1'].repeat(lens),
'other_test2' : df['other_test2'].repeat(lens),
'name' : list(chain.from_iterable(a.tolist())),
'vowel_count' : list(chain.from_iterable(a.tolist())),
})

df

但是它没有给我预期的输出。

Answer 1

您可以使用嵌套列表推导来解压缩数据，并使用ast.literal_eval将JSON字符串转换为python字典。

import ast

>>> pd.DataFrame(
    [[languages.get('name'), languages.get('vowel_count'), row['origin'], row['other_test1'], row['other_test2']]
     for idx, row in df.iterrows() 
     for languages in ast.literal_eval(row['languages'])],
    columns=['Language_name', 'Language_vowel_count', 'origin', 'other.test1', 'other.test2'])
  Language_name  Language_vowel_count    origin other.test1 other.test2
0        French                     3  Germanic         ABC         DEF
1         Dutch                     4  Germanic         ABC         DEF
2       English                     5  Germanic         ABC         DEF

不使用iterrows的另一种方法将解压缩的语言与基本数据连接起来：

languages = df['languages'].apply(lambda x: ast.literal_eval(x))

df_lang = pd.DataFrame(
    [(lang.get('name'), lang.get('vowel_count')) 
     for language in languages 
     for lang in language])

df_new = pd.concat([
    df_lang, 
    df.iloc[:, 1:].reindex(df.index.repeat([len(x) for x in languages])).reset_index(drop=True)], axis=1)

df_new.columns = ['Language_name', 'Language_vowel_count', 'origin', 'other.test1', 'other.test2']

Answer 2

import re
import pandas as pd
import json
csv = """"[{'name': 'French', 'vowel_count': 3}, {'name': 'Dutch', 'vowel_count': 4}, {'name': 'English', 'vowel_count': 5}]",Germanic,ABC,DEF"""
csv = re.split('(?![^)(]*\([^)(]*?\)\)),(?![^\[]*\])',csv)
df = pd.DataFrame(json.loads(csv[0].replace("'",'"')[1:-1]))
df['Origin']=csv[1]
df['other.test1']=csv[2]
df['other.test2']=csv[3]
df

将dataframe列转换为多行，重复其他列的值

2 个答案: