嗨,我有一个类似下面的df
index a b c d
0 xx aa av NaN
1 pp as ka [1,2,3,4]
2 pa aj q 1234
3 xq aq aq NaN
4 pn an kn [10,20,30,40]
5 px ax kx "00012"
我想转换成下面的样子
index a b c d d-separated
0 xx aa av NaN NaN
1 pp as ka [1,2,3,4] 1
2 pp as ka [1,2,3,4] 2
3 pp as ka [1,2,3,4] 3
4 pp as ka [1,2,3,4] 4
5 pa aj q 1234 1234
6 xq aq aq NaN NaN
7 pn an kn [10,20,30,40] 10
8 pn an kn [10,20,30,40] 20
9 pn an kn [10,20,30,40] 30
10 pn an kn [10,20,30,40] 40
11 px ax kx "00012" "00012"
我参考了
pandas: When cell contents are lists, create a row for each element in the list和
Split (explode) pandas dataframe string entry to separate rows
但是,由于我的情况与他们不同。该解决方案在我的示例中不起作用。谢谢您的帮助
答案 0 :(得分:0)
设置
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>SampleGrp</groupId>
<artifactId>SampleArtifactID</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>SampleArtifactID</name>
<!-- FIXME change it to the project's website -->
<url>http://www.example.com</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>2.53.0</version>
</dependency>
<dependency>
<groupId>info.cukes</groupId>
<artifactId>cucumber-java</artifactId>
<version>1.2.4</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>info.cukes</groupId>
<artifactId>cucumber-junit</artifactId>
<version>1.2.4</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-resources-plugin</artifactId>
<version>2.5</version>
</dependency>
<dependency>
<groupId>com.aventstack</groupId>
<artifactId>extentreports</artifactId>
<version>3.0.6</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.vimalselvam</groupId>
<artifactId>cucumber-extentsreport</artifactId>
<version>3.0.1</version>
</dependency>
<dependency>
<groupId>org.freemarker</groupId>
<artifactId>freemarker</artifactId>
<version>2.3.23</version>
</dependency>
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>bson</artifactId>
<version>3.2.2</version>
</dependency>
<dependency>
<groupId>commons-lang</groupId>
<artifactId>commons-lang</artifactId>
<version>2.6</version>
</dependency>
<dependency>
<groupId>ojdbc14</groupId>
<artifactId>ojdbc14</artifactId>
<version>10.2.0.3.0</version>
</dependency>
<dependency>
<groupId>org.jenkins-ci.plugins</groupId>
<artifactId>scm-api</artifactId>
<version>1.1</version>
</dependency>
<dependency>
<groupId>org.jenkins-ci.plugins.workflow</groupId>
<artifactId>workflow-job</artifactId>
<version>2.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.jenkins-ci.plugins.workflow</groupId>
<artifactId>workflow-support</artifactId>
<version>2.2</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.jenkins-ci.plugins.workflow</groupId>
<artifactId>workflow-step-api</artifactId>
<version>2.2</version>
</dependency>
<dependency>
<groupId>org.apache.maven.wagon</groupId>
<artifactId>wagon-provider-api</artifactId>
<version>1.0-beta-2</version>
</dependency>
<dependency>
<groupId>org.apache.maven.wagon</groupId>
<artifactId>wagon-file</artifactId>
<version>1.0-beta-2</version>
</dependency>
<dependency>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-clean-plugin</artifactId>
<version>2.5</version>
</dependency>
</dependencies>
这是一个棘手的问题,主要是因为df = pd.DataFrame({'a': ['xx', 'pp', 'pa', 'xq', 'pn', 'px'], 'b': ['aa', 'as', 'aj', 'aq', 'an', 'ax'], 'c': ['av', 'ka', 'q', 'aq', 'kn', 'kx'], 'd': [np.nan, [1,2,3,4], 1234, np.nan, [10, 20, 30, 40], '00012']})
的原因,所以我先用填充值替换了它们,然后在最后将其改回:
NaN
此确实丢失了原始的(df.join(df.fillna(-999)
.d.apply(pd.Series))
.drop('d', 1).set_index(['a', 'b', 'c'])
.stack().reset_index()
.drop('level_3',1)
.replace(-999, np.nan).rename(columns={0: 'd-separated'})
)
a b c d-separated
0 xx aa av NaN
1 pp as ka 1
2 pp as ka 2
3 pp as ka 3
4 pp as ka 4
5 pa aj q 1234
6 xq aq aq NaN
7 pn an kn 10
8 pn an kn 20
9 pn an kn 30
10 pn an kn 40
11 px ax kx 00012
列,因为它包含不可散列的类型,因此无法将其设置为索引级别。
答案 1 :(得分:0)
这是可能的,但并非无关紧要的-对于需要索引ID的列,将list
转换为tuple
用于可哈希类型,并将DataFrame
从构造函数标量转换为一个元素{{1 }}:
list
df = pd.DataFrame({'a': ['xx', 'pp', 'pa', 'xq', 'pn', 'px'],
'b': ['aa', 'as', 'aj', 'aq', 'an', 'ax'],
'c': ['av', 'ka', 'q', 'aq', 'kn', 'kx'],
'd': [np.nan, [1,2,3,4], '1234', np.nan, [10, 20, 30, 40], '00012']})
s = (df.assign(d1=df['d'].fillna('NANval').apply(lambda x: x if isinstance(x, list) else [x]),
d = df['d'].apply(lambda x: tuple(x) if isinstance(x, list) else x))
.set_index(['a','b','c','d'])['d1']
)
print (s)
a b c d
xx aa av NaN [NANval]
pp as ka (1, 2, 3, 4) [1, 2, 3, 4]
pa aj q 1234 [1234]
xq aq aq NaN [NANval]
pn an kn (10, 20, 30, 40) [10, 20, 30, 40]
px ax kx 00012 [00012]
Name: d1, dtype: object
在必要时最后将df = (pd.DataFrame(s.values.tolist(), index=s.index)
.stack()
.reset_index(4, drop=True)
.reset_index(name='d-separated')
.replace('NANval', np.nan)
)
转换为tuple
s:
list
答案 2 :(得分:0)
首先将数据框扩展到所需的大小,并根据需要重复每一行:
df1 = df.loc[df.index.repeat([len(x) if isinstance(x,list) else 1 for x in df.d])]
现在取消列出列d并将其与上面的df连接
d_sep= pd.DataFrame({'d_Sep':sum([x if isinstance(x,list) else [x] for x in df.d],[])})
df2 = pd.concat([df1.reset_index(drop=True),d_sep],axis=1)
a b c d d_Sep
0 xx aa av NaN NaN
1 pp as ka [1, 2, 3, 4] 1
2 pp as ka [1, 2, 3, 4] 2
3 pp as ka [1, 2, 3, 4] 3
4 pp as ka [1, 2, 3, 4] 4
5 pa aj q 1234 1234
6 xq aq aq NaN NaN
7 pn an kn [10, 20, 30, 40] 10
8 pn an kn [10, 20, 30, 40] 20
9 pn an kn [10, 20, 30, 40] 30
10 pn an kn [10, 20, 30, 40] 40
11 px ax kx 00012 00012