Question

我有以下数据框：

 index    Index_Date    A    B    C    D
 ===========================================
 1        2015-01-31    10   10   we   10
 2        2015-02-01     2    3   jk   22 and 23 and 24 
 3        2015-02-02    10   60   nm   280 and 284
 4        2015-02-03    10  100   oi   250
 5        2015-02-03    10  100   yh  Egyptian and Hittite

我想要达到

 index    Index_Date    A    B    C    D
 ===========================================
 1        2015-01-31    10   10   we  10
 2        2015-02-01     2    3   jk  22
 3        2015-02-01     2    3   jk  23
 4        2015-02-01     2    3   jk  24
 5        2015-02-02    10   60   nm  280
 6        2015-02-02    10   60   nm  284
 7        2015-02-03    10  100   oi  250
 8        2015-02-03    10  100   yh  Egyptian
 9        2015-02-03    10  100   yh  Hittite

基本上程序需要查找and语句，如果找到重复的那一行，然后将before and部分（22）留在第一个after and部分（{{1} }）在重复的行和其余的。

我从这开始，但我不知道应该去哪里。

我之前也问过更简单的版本。我再次不确定它是太难还是太容易。

Answer 1

这是一种方式

df = pd.read_clipboard(sep = '\s\s+')

Index_Date    A    B    C    D
2015-01-31    10   10   we  10
2015-02-01     2    3   jk  22 and 23 and 24 
2015-02-02    10   60   nm  280
2015-02-03    10  100   oi  250


df.set_index(['Index_Date', 'A', 'B', 'C']).D.str.split('and', expand = True)\
.stack().reset_index(4,drop = True).reset_index(name = 'D')

    Index_Date  A   B   C   D
0   2015-01-31  10  10  we  10
1   2015-02-01  2   3   jk  22
2   2015-02-01  2   3   jk  23
3   2015-02-01  2   3   jk  24
4   2015-02-02  10  60  nm  280
5   2015-02-03  10  100 oi  250

Answer 2

这是一种方式：

import pandas as pd

df = pd.DataFrame([['2015-01-31', 10, 10, 'we', 10],
                   ['2015-02-01', 2, 3, 'jk', '22 and 23 and 24'],
                   ['2015-02-02', 10, 60, 'nm', 280],
                   ['2015-02-03', 10, 100, 'oi', 250]],
                  columns=['Index_Date', 'A', 'B', 'C', 'D'])

df.loc[df.D.astype(str).str.contains('and').fillna(False), 'D'] = df.D.str.split('and')

res = df.set_index(['Index_Date', 'A', 'B', 'C'])['D'].apply(pd.Series).stack().reset_index()
res = res.rename(columns={0: 'D'})
res.D = res.D.astype(int)
res = res[['Index_Date', 'A', 'B', 'C', 'D']]

#    Index_Date   A    B   C    D
# 0  2015-01-31  10   10  we   10
# 1  2015-02-01   2    3  jk   22
# 2  2015-02-01   2    3  jk   23
# 3  2015-02-01   2    3  jk   24
# 4  2015-02-02  10   60  nm  280
# 5  2015-02-03  10  100  oi  250

Answer 3

很多次要求许多方法来微调这个和变体。

D = df.D.astype(str).str.split(' and ')
idx = df.index.repeat(D.str.len())
df.loc[idx].assign(D=np.concatenate(D).astype(int))

   Index_Date   A    B   C    D
0  2015-01-31  10   10  we   10
1  2015-02-01   2    3  jk   22
1  2015-02-01   2    3  jk   23
1  2015-02-01   2    3  jk   24
2  2015-02-02  10   60  nm  280
3  2015-02-03  10  100  oi  250

根据前一行添加新行

3 个答案: