Pandas内部合并产生新的重复行

时间:2017-08-05 00:47:51

标签: python pandas

我有一个最佳女演员数据框,如下所示。

Year    Ceremony    Award       Winner  Name            Film        Ceremony_Date
1927/1928   1   Best Actress    NaN Louise Dresser  A Ship Comes In 1929-05-16
1927/1928   1   Best Actress    1.0 Janet Gaynor    7th Heaven      1929-05-16
1927/1928   1   Best Actress    NaN Gloria Swanson  Sadie Thompson  1929-05-16
1928/1929   2   Best Actress    NaN Ruth Chatterton Madame X        1930-04-03
1928/1929   2   Best Actress    NaN Betty Compson   The Barker      1930-04-03

我与以下最佳女演员进行了内部联接(合并)'名称列上的出生日期数据框,因为我希望上面的数据框也有出生日期信息。

    Name           DOB
0   Janet Gaynor    1906-10-06
1   Louise Dresser  1878-10-17
2   Gloria Swanson  1899-03-27
3   Mary Pickford   1892-04-08
4   Ruth Chatterton 1892-12-24
5   Betty Compson   1897-03-19 

编辑 - > ba_dob = pd.merge(ba, df_birthdays, how='inner', on='Name')

结果是一个包含重复行的数据框。例如(见下文),梅丽尔斯特里普一次获得一部电影的提名,并且该唱片(加入后)神秘地复制了不合时宜的次数。我认为内部联接只是将出生日期与两个数据框之间的名称列上匹配的名称相关联,而不是重新复制整个记录。我尝试了一个左边连接,最好的女演员表作为左表,并得到了类似的重复记录。任何洞察到正在发生的事情将不胜感激。

Year    Ceremony    Award   Winner  Name    Film       Ceremony_Date    DOB
    1102    1981    54  Best Actress    NaN Meryl Streep    The French Lieutenant's Woman   1982-03-29  1949-06-22
    1103    1981    54  Best Actress    NaN Meryl Streep    The French Lieutenant's Woman   1982-03-29  1949-06-22
    1104    1981    54  Best Actress    NaN Meryl Streep    The French Lieutenant's Woman   1982-03-29  1949-06-22
    1105    1981    54  Best Actress    NaN Meryl Streep    The French Lieutenant's Woman   1982-03-29  1949-06-22
    1106    1981    54  Best Actress    NaN Meryl Streep    The French Lieutenant's Woman   1982-03-29  1949-06-22
    1107    1981    54  Best Actress    NaN Meryl Streep    The French Lieutenant's Woman   1982-03-29  1949-06-22
    1108    1981    54  Best Actress    NaN Meryl Streep    The French Lieutenant's Woman   1982-03-29  1949-06-22

编辑以下是dict格式的上述数据框头(按要求):

最佳女主角

{'Award': {2: 'Best Actress',
  3: 'Best Actress',
  4: 'Best Actress',
  40: 'Best Actress',
  41: 'Best Actress'},
 'Ceremony': {2: 1, 3: 1, 4: 1, 40: 2, 41: 2},
 'Ceremony_Date': {2: Timestamp('1929-05-16 00:00:00'),
  3: Timestamp('1929-05-16 00:00:00'),
  4: Timestamp('1929-05-16 00:00:00'),
  40: Timestamp('1930-04-03 00:00:00'),
  41: Timestamp('1930-04-03 00:00:00')},
 'Film': {2: 'A Ship Comes In',
  3: '7th Heaven',
  4: 'Sadie Thompson',
  40: 'Madame X',
  41: 'The Barker'},
 'Name': {2: 'Louise Dresser',
  3: 'Janet Gaynor',
  4: 'Gloria Swanson',
  40: 'Ruth Chatterton',
  41: 'Betty Compson'},
 'Winner': {2: nan, 3: 1.0, 4: nan, 40: nan, 41: nan},
 'Year': {2: '1927/1928',
  3: '1927/1928',
  4: '1927/1928',
  40: '1928/1929',
  41: '1928/1929'}}

出生日期

{'DOB': {0: Timestamp('1906-10-06 00:00:00'),
  1: Timestamp('1878-10-17 00:00:00'),
  2: Timestamp('1899-03-27 00:00:00'),
  3: Timestamp('1892-04-08 00:00:00'),
  4: Timestamp('1892-12-24 00:00:00')},
 'Name': {0: 'Janet Gaynor',
  1: 'Louise Dresser',
  2: 'Gloria Swanson',
  3: 'Mary Pickford',
  4: 'Ruth Chatterton'}}

合并(内部联接)数据框

{'Award': {0: 'Best Actress',
  1: 'Best Actress',
  2: 'Best Actress',
  3: 'Best Actress',
  4: 'Best Actress'},
 'Ceremony': {0: 1, 1: 1, 2: 1, 3: 10, 4: 10},
 'Ceremony_Date': {0: Timestamp('1929-05-16 00:00:00'),
  1: Timestamp('1929-05-16 00:00:00'),
  2: Timestamp('1929-05-16 00:00:00'),
  3: Timestamp('1938-03-10 00:00:00'),
  4: Timestamp('1938-03-10 00:00:00')},
 'DOB': {0: Timestamp('1878-10-17 00:00:00'),
  1: Timestamp('1906-10-06 00:00:00'),
  2: Timestamp('1906-10-06 00:00:00'),
  3: Timestamp('1906-10-06 00:00:00'),
  4: Timestamp('1906-10-06 00:00:00')},
 'Film': {0: 'A Ship Comes In',
  1: '7th Heaven',
  2: '7th Heaven',
  3: 'A Star Is Born',
  4: 'A Star Is Born'},
 'Name': {0: 'Louise Dresser',
  1: 'Janet Gaynor',
  2: 'Janet Gaynor',
  3: 'Janet Gaynor',
  4: 'Janet Gaynor'},
 'Winner': {0: nan, 1: 1.0, 2: 1.0, 3: nan, 4: nan},
 'Year': {0: '1927/1928',
  1: '1927/1928',
  2: '1927/1928',
  3: '1937',
  4: '1937'}}

修改

Meryl Streep从最佳女演员数据框架中获得

{'Award': {5957: 'Best Actress',
  6061: 'Best Actress',
  6172: 'Best Actress',
  6389: 'Best Actress',
  6606: 'Best Actress',
  6708: 'Best Actress',
  6922: 'Best Actress',
  7483: 'Best Actress',
  7835: 'Best Actress',
  7950: 'Best Actress',
  8748: 'Best Actress',
  8983: 'Best Actress',
  9098: 'Best Actress',
  9347: 'Best Actress',
  9599: 'Best Actress'},
 'Ceremony': {5957: 54,
  6061: 55,
  6172: 56,
  6389: 58,
  6606: 60,
  6708: 61,
  6922: 63,
  7483: 68,
  7835: 71,
  7950: 72,
  8748: 79,
  8983: 81,
  9098: 82,
  9347: 84,
  9599: 86},
 'Ceremony_Date': {5957: Timestamp('1982-03-29 00:00:00'),
  6061: Timestamp('1983-04-11 00:00:00'),
  6172: Timestamp('1984-04-09 00:00:00'),
  6389: Timestamp('1986-03-24 00:00:00'),
  6606: Timestamp('1988-04-11 00:00:00'),
  6708: Timestamp('1989-03-29 00:00:00'),
  6922: Timestamp('1991-03-25 00:00:00'),
  7483: Timestamp('1996-03-25 00:00:00'),
  7835: Timestamp('1999-03-21 00:00:00'),
  7950: Timestamp('2000-03-26 00:00:00'),
  8748: Timestamp('2007-02-25 00:00:00'),
  8983: Timestamp('2009-02-22 00:00:00'),
  9098: Timestamp('2010-03-07 00:00:00'),
  9347: Timestamp('2012-02-26 00:00:00'),
  9599: Timestamp('2014-03-02 00:00:00')},
 'Film': {5957: "The French Lieutenant's Woman",
  6061: "Sophie's Choice",
  6172: 'Silkwood',
  6389: 'Out of Africa',
  6606: 'Ironweed',
  6708: 'A Cry in the Dark',
  6922: 'Postcards from the Edge',
  7483: 'The Bridges of Madison County',
  7835: 'One True Thing',
  7950: 'Music of the Heart',
  8748: 'The Devil Wears Prada',
  8983: 'Doubt',
  9098: 'Julie & Julia',
  9347: 'The Iron Lady',
  9599: 'August: Osage County'},
 'Name': {5957: 'Meryl Streep',
  6061: 'Meryl Streep',
  6172: 'Meryl Streep',
  6389: 'Meryl Streep',
  6606: 'Meryl Streep',
  6708: 'Meryl Streep',
  6922: 'Meryl Streep',
  7483: 'Meryl Streep',
  7835: 'Meryl Streep',
  7950: 'Meryl Streep',
  8748: 'Meryl Streep',
  8983: 'Meryl Streep',
  9098: 'Meryl Streep',
  9347: 'Meryl Streep',
  9599: 'Meryl Streep'},
 'Winner': {5957: nan,
  6061: 1.0,
  6172: nan,
  6389: nan,
  6606: nan,
  6708: nan,
  6922: nan,
  7483: nan,
  7835: nan,
  7950: nan,
  8748: nan,
  8983: nan,
  9098: nan,
  9347: 1.0,
  9599: nan},
 'Year': {5957: '1981',
  6061: '1982',
  6172: '1983',
  6389: '1985',
  6606: '1987',
  6708: '1988',
  6922: '1990',
  7483: '1995',
  7835: '1998',
  7950: '1999',
  8748: '2006',
  8983: '2008',
  9098: '2009',
  9347: '2011',
  9599: '2013'}}

3 个答案:

答案 0 :(得分:1)

当我跑步时,我没有重复:

df = pd.merge(df1, df2, how='inner', on='Name')

打印:

print(json.dumps(json.loads(df.to_json()),indent=4))

我明白了:

{
    "Winner": {
        "0": null,
        "1": 1.0,
        "3": null,
        "2": null
    },
    "Ceremony": {
        "0": 1,
        "1": 1,
        "3": 2,
        "2": 1
    },
    "Year": {
        "0": "1927/1928",
        "1": "1927/1928",
        "3": "1928/1929",
        "2": "1927/1928"
    },
    "Film": {
        "0": "A Ship Comes In",
        "1": "7th Heaven",
        "3": "Madame X",
        "2": "Sadie Thompson"
    },
    "Name": {
        "0": "Louise Dresser",
        "1": "Janet Gaynor",
        "3": "Ruth Chatterton",
        "2": "Gloria Swanson"
    },
    "Award": {
        "0": "Best Actress",
        "1": "Best Actress",
        "3": "Best Actress",
        "2": "Best Actress"
    },
    "DOB": {
        "0": -2878243200000,
        "1": -1995667200000,
        "3": -2430518400000,
        "2": -2233180800000
    },
    "Ceremony_Date": {
        "0": -1282176000000,
        "1": -1282176000000,
        "3": -1254355200000,
        "2": -1282176000000
    }
}



<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>Award</th>      <th>Ceremony</th>      <th>Ceremony_Date</th>      <th>Film</th>      <th>Name</th>      <th>Winner</th>      <th>Year</th>      <th>DOB</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>Best Actress</td>      <td>1</td>      <td>1929-05-16</td>      <td>A Ship Comes In</td>      <td>Louise Dresser</td>      <td>NaN</td>      <td>1927/1928</td>      <td>1878-10-17</td>    </tr>    <tr>      <th>1</th>      <td>Best Actress</td>      <td>1</td>      <td>1929-05-16</td>      <td>7th Heaven</td>      <td>Janet Gaynor</td>      <td>1.0</td>      <td>1927/1928</td>      <td>1906-10-06</td>    </tr>    <tr>      <th>2</th>      <td>Best Actress</td>      <td>1</td>      <td>1929-05-16</td>      <td>Sadie Thompson</td>      <td>Gloria Swanson</td>      <td>NaN</td>      <td>1927/1928</td>      <td>1899-03-27</td>    </tr>    <tr>      <th>3</th>      <td>Best Actress</td>      <td>2</td>      <td>1930-04-03</td>      <td>Madame X</td>      <td>Ruth Chatterton</td>      <td>NaN</td>      <td>1928/1929</td>      <td>1892-12-24</td>    </tr>  </tbody></table>
&#13;
&#13;
&#13;

df = pd.merge(df1, df2, how='outer', on='Name')

&#13;
&#13;
<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>Award</th>      <th>Ceremony</th>      <th>Ceremony_Date</th>      <th>Film</th>      <th>Name</th>      <th>Winner</th>      <th>Year</th>      <th>DOB</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>Best Actress</td>      <td>1.0</td>      <td>1929-05-16</td>      <td>A Ship Comes In</td>      <td>Louise Dresser</td>      <td>NaN</td>      <td>1927/1928</td>      <td>1878-10-17</td>    </tr>    <tr>      <th>1</th>      <td>Best Actress</td>      <td>1.0</td>      <td>1929-05-16</td>      <td>7th Heaven</td>      <td>Janet Gaynor</td>      <td>1.0</td>      <td>1927/1928</td>      <td>1906-10-06</td>    </tr>    <tr>      <th>2</th>      <td>Best Actress</td>      <td>1.0</td>      <td>1929-05-16</td>      <td>Sadie Thompson</td>      <td>Gloria Swanson</td>      <td>NaN</td>      <td>1927/1928</td>      <td>1899-03-27</td>    </tr>    <tr>      <th>3</th>      <td>Best Actress</td>      <td>2.0</td>      <td>1930-04-03</td>      <td>Madame X</td>      <td>Ruth Chatterton</td>      <td>NaN</td>      <td>1928/1929</td>      <td>1892-12-24</td>    </tr>    <tr>      <th>4</th>      <td>Best Actress</td>      <td>2.0</td>      <td>1930-04-03</td>      <td>The Barker</td>      <td>Betty Compson</td>      <td>NaN</td>      <td>1928/1929</td>      <td>NaT</td>    </tr>    <tr>      <th>5</th>      <td>NaN</td>      <td>NaN</td>      <td>NaT</td>      <td>NaN</td>      <td>Mary Pickford</td>      <td>NaN</td>      <td>NaN</td>      <td>1892-04-08</td>    </tr>  </tbody></table>
&#13;
&#13;
&#13;

答案 1 :(得分:1)

使用drop_duplicates()方法。例如:

merged_df = pd.merge(left_df, right_df, on='common_key', how='inner').drop_duplicates()

答案 2 :(得分:0)

为了回答我自己的问题,请允许我查明问题所在并找出解决方法。

如上所述,最佳女演员数据框ba很好。实际上,所有数据帧都是有序的。最初的问题是关于如何执行内连接以及可能出错的方式(即,提示合并过程创建重复记录)。

Meryl Streep将成为我们的指南,如上所述。在数据集中,她获得了16项最佳女演员提名(对于任何保持分数的人,数据不包括她最近的提名)。当在ba和出生日期DOB,数据帧之间执行内连接时,她被提名的每部电影都重复了16次,这不是我想要的结果(见错误的结果)以上)。事实上,她的姓名和出生日期在DOB数据框中出现了16次。这与我编写的scrape代码一致,而不是意外的结果或bug。

当我在两个框架之间进行内部连接时,我(错误地)认为Pandas会看到她的提名,例如“朱莉和朱莉娅”,与她的生日匹配一次,并完成它。显然,内连接意味着如果两个表中的连接列都匹配,则每行将匹配可能的最大次数。因此,对于每一部电影,合并后的表格有16个记录(其中一个是她最佳女演员提名的16个,这相当于她的生日出现在被提名生日的网页上,成为数据帧的次数)。我不确定这是否正确,但它描述了我在前面看到的内容。我欢迎澄清这一点。

解决方法只是从DOB数据框中删除重复的名称并重新合并。这是代码和输出,使用Meryl作为示例。

ba_dob_revised = df_birthdays.drop_duplicates('Name')
ba_dob = pd.merge(ba, ba_dob_revised, on='Name')
ba_dob[ba_dob.Name=="Meryl Streep"] 

{'Award': {282: 'Best Actress',
  283: 'Best Actress',
  284: 'Best Actress',
  285: 'Best Actress',
  286: 'Best Actress',
  287: 'Best Actress',
  288: 'Best Actress',
  289: 'Best Actress',
  290: 'Best Actress',
  291: 'Best Actress',
  292: 'Best Actress',
  293: 'Best Actress',
  294: 'Best Actress',
  295: 'Best Actress',
  296: 'Best Actress'},
 'Ceremony': {282: 54,
  283: 55,
  284: 56,
  285: 58,
  286: 60,
  287: 61,
  288: 63,
  289: 68,
  290: 71,
  291: 72,
  292: 79,
  293: 81,
  294: 82,
  295: 84,
  296: 86},
 'Ceremony_Date': {282: Timestamp('1982-03-29 00:00:00'),
  283: Timestamp('1983-04-11 00:00:00'),
  284: Timestamp('1984-04-09 00:00:00'),
  285: Timestamp('1986-03-24 00:00:00'),
  286: Timestamp('1988-04-11 00:00:00'),
  287: Timestamp('1989-03-29 00:00:00'),
  288: Timestamp('1991-03-25 00:00:00'),
  289: Timestamp('1996-03-25 00:00:00'),
  290: Timestamp('1999-03-21 00:00:00'),
  291: Timestamp('2000-03-26 00:00:00'),
  292: Timestamp('2007-02-25 00:00:00'),
  293: Timestamp('2009-02-22 00:00:00'),
  294: Timestamp('2010-03-07 00:00:00'),
  295: Timestamp('2012-02-26 00:00:00'),
  296: Timestamp('2014-03-02 00:00:00')},
 'DOB': {282: Timestamp('1949-06-22 00:00:00'),
  283: Timestamp('1949-06-22 00:00:00'),
  284: Timestamp('1949-06-22 00:00:00'),
  285: Timestamp('1949-06-22 00:00:00'),
  286: Timestamp('1949-06-22 00:00:00'),
  287: Timestamp('1949-06-22 00:00:00'),
  288: Timestamp('1949-06-22 00:00:00'),
  289: Timestamp('1949-06-22 00:00:00'),
  290: Timestamp('1949-06-22 00:00:00'),
  291: Timestamp('1949-06-22 00:00:00'),
  292: Timestamp('1949-06-22 00:00:00'),
  293: Timestamp('1949-06-22 00:00:00'),
  294: Timestamp('1949-06-22 00:00:00'),
  295: Timestamp('1949-06-22 00:00:00'),
  296: Timestamp('1949-06-22 00:00:00')},
 'Film': {282: "The French Lieutenant's Woman",
  283: "Sophie's Choice",
  284: 'Silkwood',
  285: 'Out of Africa',
  286: 'Ironweed',
  287: 'A Cry in the Dark',
  288: 'Postcards from the Edge',
  289: 'The Bridges of Madison County',
  290: 'One True Thing',
  291: 'Music of the Heart',
  292: 'The Devil Wears Prada',
  293: 'Doubt',
  294: 'Julie & Julia',
  295: 'The Iron Lady',
  296: 'August: Osage County'},
 'Name': {282: 'Meryl Streep',
  283: 'Meryl Streep',
  284: 'Meryl Streep',
  285: 'Meryl Streep',
  286: 'Meryl Streep',
  287: 'Meryl Streep',
  288: 'Meryl Streep',
  289: 'Meryl Streep',
  290: 'Meryl Streep',
  291: 'Meryl Streep',
  292: 'Meryl Streep',
  293: 'Meryl Streep',
  294: 'Meryl Streep',
  295: 'Meryl Streep',
  296: 'Meryl Streep'},
 'Winner': {282: nan,
  283: 1.0,
  284: nan,
  285: nan,
  286: nan,
  287: nan,
  288: nan,
  289: nan,
  290: nan,
  291: nan,
  292: nan,
  293: nan,
  294: nan,
  295: 1.0,
  296: nan},
 'Year': {282: '1981',
  283: '1982',
  284: '1983',
  285: '1985',
  286: '1987',
  287: '1988',
  288: '1990',
  289: '1995',
  290: '1998',
  291: '1999',
  292: '2006',
  293: '2008',
  294: '2009',
  295: '2011',
  296: '2013'}}

关键点:虽然内连接是合适的(切换连接类型肯定没有解决问题),但我没有充分考虑机器/ Pandas如何考虑内部连接。最终,确定错误结果中的模式并在其中一个数据框中找到类似的模式,这两个数据框都已经过检查过错误,证明是最有帮助的。