我有in.csv:
Box,Color,Contents
1,Blue,"thing one [version 1] [dd/mm/yyyy]
thing two [version 1] [dd/mm/yyyy]
thing three [version 1] [dd/mm/yyyy]"
2,Red,thing four [version 1] [dd/mm/yyyy]
3,Green,"thing five [version 1] [dd/mm/yyyy]
thing six version 1] [dd/mm/yyyy]"
并且我正在尝试创建out.csv:
Box,Color,Contents
1,Blue,thing one [version 1] [dd/mm/yyyy]
1,Blue,thing two [version 1] [dd/mm/yyyy]
1,Blue,thing three [version 1] [dd/mm/yyyy]
2,Red,thing four [version 1] [dd/mm/yyyy]
3,Green,thing five [version 1] [dd/mm/yyyy]
3,Green,thing six version 1] [dd/mm/yyyy]
我可以使用str.split,如下所示:
df = pd.DataFrame(df['Contents'].str.split(' ').values.tolist())
那只是一个定界符。我需要双精度空格和行尾(EOL),但是我在正则表达式上进行的所有搜索都说我需要使用re.split来代替。我的语法无法正常工作,相反,我得到了:
df = pd.DataFrame(df['Contents'].re.split('\n' , ' ').values.tolist())
AttributeError: 'Series' object has no attribute 're'
我的搜索结果失控了。请协助? tnx
答案 0 :(得分:0)
您要先创建新行(使用'\r\n'
分隔),然后再创建新列(使用双倍空格分隔)。
这可能是一种乏味的操作方式,如果您还有Pythonic,请告诉我。
df['repeats'] = df['Contents'].str.split('\r\n').apply(lambda x:len(x)) # number of repeats
# create new df
df1 = pd.DataFrame(columns = df.columns)
for i in range(len(df)):
df1 = df1.append([df.iloc[[i]]]*df.iloc[i]['repeats'])
df1 = df1.reset_index().drop('index',axis = 1)
df1.Contents = sum(df['Contents'].str.split('\r\n').values,[]) #flattening the list
df1[['thing','version','date']] = pd.DataFrame(df1.Contents.str.split(' ').values.tolist())
df1 = df1[['Box','Color','thing','version','date']]
输出:
Box Color thing version date
0 1 Blue thing one [version 1] [dd/mm/yyyy]
1 1 Blue thing two [version 1] [dd/mm/yyyy]
2 1 Blue thing three [version 1] [dd/mm/yyyy]
3 2 Red thing four [version 1] [dd/mm/yyyy]
4 3 Green thing five [version 1] [dd/mm/yyyy]
5 3 Green thing six version 1] [dd/mm/yyyy]