如果单元格包含多个字符串,请将其放入Pandas中的新单元格

时间:2017-05-10 08:22:32

标签: python database pandas

所以我和Pandas合作,我在一个单元格中有多个单词(即字符串),我需要将每个单词放入新行并保持协调数据。我找到了一个可以帮助我的方法,但它适用于数字,而不是字符串。 那么我需要使用什么方法?

我桌子的简单例子:

id name     method
1  adenosis mammography, mri

我需要它:

id name     method
1  adenosis mammography
            mri

谢谢!

更新

根据@ jezrael的提议,

这就是我想要做的事情:

import pandas as pd
import numpy as np
xl = pd.ExcelFile("./dev/eyetoai/google_form_pure.xlsx")
xl.sheet_names
df = xl.parse("Form Responses 1")
df.groupby(['Name of condition','Condition description','Relevant Modality','Type of finding Mammography', 'Type of finding MRI', 'Type of finding US']).mean()
splitted = df['Relevant Modality'].str.split(',')
l = splitted.str.len()
df = pd.DataFrame({col: np.repeat(df[col], l) for col in ['Name of condition','Condition description']})
df['Relevant Modality'] = np.concatenate(splitted)

但我有这种错误: TypeError:repeat()只需要2个参数(给定3个)

2 个答案:

答案 0 :(得分:1)

您可以使用read_excel + split + stack + drop + join + reset_index

#define columns which need split by , and then flatten them
cols = ['Condition description','Relevant Modality']

#read csv to dataframe
df = pd.read_excel('Untitled 1.xlsx')
#print (df)

df1 = pd.DataFrame({col: df[col].str.split(',', expand=True).stack() for col in cols})
print (df1)
                                 Condition description Relevant Modality
0 0  Fibroadenomas are the most common cause of a b...       Mammography
  1                                                NaN                US
  2                                                NaN               MRI
1 0                    Papillomas are benign neoplasms       Mammography
  1                                  arising in a duct                US
  2   either centrally or peripherally within the b...               MRI
  3   leading to a nipple discharge. As they are of...               NaN
  4                 the discharge may be bloodstained.               NaN
2 0                                                 OK       Mammography
3 0                                      breast cancer       Mammography
  1                                                NaN                US
4 0                                breast inflammation       Mammography
  1                                                NaN                US

#remove original columns
df = df.drop(cols, axis=1)
#create Multiindex in original df for align rows
df.index = [df.index, [0]* len(df.index)]
#join original to flattened columns, remove Multiindex
df = df1.join(df).reset_index(drop=True)
#print (df)

答案 1 :(得分:1)

之前的回答是正确的,我认为你应该使用id的引用。 一种更简单的方法可能是将方法字符串解析为列表:

method_list = method.split(',')
method_list = np.asarray(method_list)

如果在初始化Dataframe时遇到索引问题,只需将index设置为:

pd.Dataframe(data, index=[0,0])
df.set_index('id')

将列表作为方法键的值传递,将自动创建索引的副本 - ' id'和'名称'

id       method      name
1   mammography  adenosis
1           mri  adenosis

我希望这会有所帮助,一切顺利