熊猫:当单元格内容为列表/ NaN /字符串时,为每个元素创建一行

时间:2018-07-19 04:24:22

标签: python python-3.x pandas

嗨,我有一个类似下面的df

index a  b  c  d
0     xx aa av NaN
1     pp as ka [1,2,3,4]
2     pa aj q  1234
3     xq aq aq NaN
4     pn an kn [10,20,30,40]
5     px ax kx "00012" 

我想转换成下面的样子

index a  b  c  d              d-separated
0     xx aa av NaN            NaN
1     pp as ka [1,2,3,4]      1
2     pp as ka [1,2,3,4]      2
3     pp as ka [1,2,3,4]      3
4     pp as ka [1,2,3,4]      4
5     pa aj q  1234           1234
6     xq aq aq NaN            NaN
7     pn an kn [10,20,30,40]  10
8     pn an kn [10,20,30,40]  20
9     pn an kn [10,20,30,40]  30
10    pn an kn [10,20,30,40]  40
11    px ax kx "00012"        "00012"

我参考了

pandas: When cell contents are lists, create a row for each element in the list

Split (explode) pandas dataframe string entry to separate rows

但是,由于我的情况与他们不同。该解决方案在我的示例中不起作用。谢谢您的帮助

3 个答案:

答案 0 :(得分:0)

设置

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>SampleGrp</groupId>
  <artifactId>SampleArtifactID</artifactId>
  <version>0.0.1-SNAPSHOT</version>

  <name>SampleArtifactID</name>
  <!-- FIXME change it to the project's website -->
  <url>http://www.example.com</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.7</maven.compiler.source>
    <maven.compiler.target>1.7</maven.compiler.target>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>
    <dependency>
    <groupId>org.seleniumhq.selenium</groupId> 
 <artifactId>selenium-java</artifactId>
 <version>2.53.0</version>
 </dependency>
 <dependency>
 <groupId>info.cukes</groupId>
 <artifactId>cucumber-java</artifactId>
 <version>1.2.4</version> 
 <scope>test</scope>
 </dependency> 
 <dependency> 
 <groupId>info.cukes</groupId> 
 <artifactId>cucumber-junit</artifactId> 
 <version>1.2.4</version>
 <scope>test</scope>
 </dependency> 
 
 <dependency>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-resources-plugin</artifactId>
    <version>2.5</version>
</dependency>
 
    
    <dependency>
    <groupId>com.aventstack</groupId>
    <artifactId>extentreports</artifactId>
    <version>3.0.6</version>
    <scope>provided</scope>
</dependency>


 
 <dependency>
    <groupId>com.vimalselvam</groupId>
    <artifactId>cucumber-extentsreport</artifactId>
    <version>3.0.1</version>
</dependency>

<dependency>
    <groupId>org.freemarker</groupId>
    <artifactId>freemarker</artifactId>
    <version>2.3.23</version>
</dependency>

<dependency>
    <groupId>org.mongodb</groupId>
    <artifactId>bson</artifactId>
    <version>3.2.2</version>
</dependency>

<dependency>
    <groupId>commons-lang</groupId>
    <artifactId>commons-lang</artifactId>
    <version>2.6</version>
</dependency>
 
 
<dependency>
    <groupId>ojdbc14</groupId>
    <artifactId>ojdbc14</artifactId>
    <version>10.2.0.3.0</version>
</dependency>

 <dependency>
    <groupId>org.jenkins-ci.plugins</groupId>
    <artifactId>scm-api</artifactId>
    <version>1.1</version>
    
</dependency>
<dependency>
    <groupId>org.jenkins-ci.plugins.workflow</groupId>
    <artifactId>workflow-job</artifactId>
    <version>2.1</version>
    <scope>test</scope>
</dependency>

<dependency>
    <groupId>org.jenkins-ci.plugins.workflow</groupId>
    <artifactId>workflow-support</artifactId>
    <version>2.2</version>
    <scope>test</scope>
</dependency>


<dependency>
    <groupId>org.jenkins-ci.plugins.workflow</groupId>
    <artifactId>workflow-step-api</artifactId>
    <version>2.2</version>
</dependency>

<dependency>
    <groupId>org.apache.maven.wagon</groupId>
    <artifactId>wagon-provider-api</artifactId>
    <version>1.0-beta-2</version>
</dependency>

<dependency>
    <groupId>org.apache.maven.wagon</groupId>
    <artifactId>wagon-file</artifactId>
    <version>1.0-beta-2</version>
</dependency>


<dependency>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-clean-plugin</artifactId>
    <version>2.5</version>
</dependency>
      
    
  </dependencies>

这是一个棘手的问题,主要是因为df = pd.DataFrame({'a': ['xx', 'pp', 'pa', 'xq', 'pn', 'px'], 'b': ['aa', 'as', 'aj', 'aq', 'an', 'ax'], 'c': ['av', 'ka', 'q', 'aq', 'kn', 'kx'], 'd': [np.nan, [1,2,3,4], 1234, np.nan, [10, 20, 30, 40], '00012']}) 的原因,所以我先用填充值替换了它们,然后在最后将其改回:

NaN

确实丢失了原始的(df.join(df.fillna(-999) .d.apply(pd.Series)) .drop('d', 1).set_index(['a', 'b', 'c']) .stack().reset_index() .drop('level_3',1) .replace(-999, np.nan).rename(columns={0: 'd-separated'}) ) a b c d-separated 0 xx aa av NaN 1 pp as ka 1 2 pp as ka 2 3 pp as ka 3 4 pp as ka 4 5 pa aj q 1234 6 xq aq aq NaN 7 pn an kn 10 8 pn an kn 20 9 pn an kn 30 10 pn an kn 40 11 px ax kx 00012 列,因为它包含不可散列的类型,因此无法将其设置为索引级别。

答案 1 :(得分:0)

这是可能的,但并非无关紧要的-对于需要索引ID的列,将list转换为tuple用于可哈希类型,并将DataFrame从构造函数标量转换为一个元素{{1 }}:

list

df = pd.DataFrame({'a': ['xx', 'pp', 'pa', 'xq', 'pn', 'px'], 
                   'b': ['aa', 'as', 'aj', 'aq', 'an', 'ax'], 
                   'c': ['av', 'ka', 'q', 'aq', 'kn', 'kx'], 
                   'd': [np.nan, [1,2,3,4], '1234', np.nan, [10, 20, 30, 40], '00012']})


s = (df.assign(d1=df['d'].fillna('NANval').apply(lambda x: x if isinstance(x, list) else [x]),
               d = df['d'].apply(lambda x: tuple(x) if isinstance(x, list) else x))
       .set_index(['a','b','c','d'])['d1']
       )
print (s)
a   b   c   d               
xx  aa  av  NaN                         [NANval]
pp  as  ka  (1, 2, 3, 4)            [1, 2, 3, 4]
pa  aj  q   1234                          [1234]
xq  aq  aq  NaN                         [NANval]
pn  an  kn  (10, 20, 30, 40)    [10, 20, 30, 40]
px  ax  kx  00012                        [00012]
Name: d1, dtype: object

在必要时最后将df = (pd.DataFrame(s.values.tolist(), index=s.index) .stack() .reset_index(4, drop=True) .reset_index(name='d-separated') .replace('NANval', np.nan) ) 转换为tuple s:

list

答案 2 :(得分:0)

首先将数据框扩展到所需的大小,并根据需要重复每一行:

df1 = df.loc[df.index.repeat([len(x) if isinstance(x,list) else 1 for x in df.d])]

现在取消列出列d并将其与上面的df连接

d_sep= pd.DataFrame({'d_Sep':sum([x if isinstance(x,list) else [x] for x in df.d],[])})

df2 = pd.concat([df1.reset_index(drop=True),d_sep],axis=1)

   a   b   c                 d  d_Sep
0   xx  aa  av               NaN    NaN
1   pp  as  ka      [1, 2, 3, 4]      1
2   pp  as  ka      [1, 2, 3, 4]      2
3   pp  as  ka      [1, 2, 3, 4]      3
4   pp  as  ka      [1, 2, 3, 4]      4
5   pa  aj   q              1234   1234
6   xq  aq  aq               NaN    NaN
7   pn  an  kn  [10, 20, 30, 40]     10
8   pn  an  kn  [10, 20, 30, 40]     20
9   pn  an  kn  [10, 20, 30, 40]     30
10  pn  an  kn  [10, 20, 30, 40]     40
11  px  ax  kx             00012  00012