我有一个最佳女演员数据框,如下所示。
Year Ceremony Award Winner Name Film Ceremony_Date
1927/1928 1 Best Actress NaN Louise Dresser A Ship Comes In 1929-05-16
1927/1928 1 Best Actress 1.0 Janet Gaynor 7th Heaven 1929-05-16
1927/1928 1 Best Actress NaN Gloria Swanson Sadie Thompson 1929-05-16
1928/1929 2 Best Actress NaN Ruth Chatterton Madame X 1930-04-03
1928/1929 2 Best Actress NaN Betty Compson The Barker 1930-04-03
我与以下最佳女演员进行了内部联接(合并)'名称列上的出生日期数据框,因为我希望上面的数据框也有出生日期信息。
Name DOB
0 Janet Gaynor 1906-10-06
1 Louise Dresser 1878-10-17
2 Gloria Swanson 1899-03-27
3 Mary Pickford 1892-04-08
4 Ruth Chatterton 1892-12-24
5 Betty Compson 1897-03-19
编辑 - > ba_dob = pd.merge(ba, df_birthdays, how='inner', on='Name')
结果是一个包含重复行的数据框。例如(见下文),梅丽尔斯特里普一次获得一部电影的提名,并且该唱片(加入后)神秘地复制了不合时宜的次数。我认为内部联接只是将出生日期与两个数据框之间的名称列上匹配的名称相关联,而不是重新复制整个记录。我尝试了一个左边连接,最好的女演员表作为左表,并得到了类似的重复记录。任何洞察到正在发生的事情将不胜感激。
Year Ceremony Award Winner Name Film Ceremony_Date DOB
1102 1981 54 Best Actress NaN Meryl Streep The French Lieutenant's Woman 1982-03-29 1949-06-22
1103 1981 54 Best Actress NaN Meryl Streep The French Lieutenant's Woman 1982-03-29 1949-06-22
1104 1981 54 Best Actress NaN Meryl Streep The French Lieutenant's Woman 1982-03-29 1949-06-22
1105 1981 54 Best Actress NaN Meryl Streep The French Lieutenant's Woman 1982-03-29 1949-06-22
1106 1981 54 Best Actress NaN Meryl Streep The French Lieutenant's Woman 1982-03-29 1949-06-22
1107 1981 54 Best Actress NaN Meryl Streep The French Lieutenant's Woman 1982-03-29 1949-06-22
1108 1981 54 Best Actress NaN Meryl Streep The French Lieutenant's Woman 1982-03-29 1949-06-22
编辑以下是dict格式的上述数据框头(按要求):
最佳女主角
{'Award': {2: 'Best Actress',
3: 'Best Actress',
4: 'Best Actress',
40: 'Best Actress',
41: 'Best Actress'},
'Ceremony': {2: 1, 3: 1, 4: 1, 40: 2, 41: 2},
'Ceremony_Date': {2: Timestamp('1929-05-16 00:00:00'),
3: Timestamp('1929-05-16 00:00:00'),
4: Timestamp('1929-05-16 00:00:00'),
40: Timestamp('1930-04-03 00:00:00'),
41: Timestamp('1930-04-03 00:00:00')},
'Film': {2: 'A Ship Comes In',
3: '7th Heaven',
4: 'Sadie Thompson',
40: 'Madame X',
41: 'The Barker'},
'Name': {2: 'Louise Dresser',
3: 'Janet Gaynor',
4: 'Gloria Swanson',
40: 'Ruth Chatterton',
41: 'Betty Compson'},
'Winner': {2: nan, 3: 1.0, 4: nan, 40: nan, 41: nan},
'Year': {2: '1927/1928',
3: '1927/1928',
4: '1927/1928',
40: '1928/1929',
41: '1928/1929'}}
出生日期
{'DOB': {0: Timestamp('1906-10-06 00:00:00'),
1: Timestamp('1878-10-17 00:00:00'),
2: Timestamp('1899-03-27 00:00:00'),
3: Timestamp('1892-04-08 00:00:00'),
4: Timestamp('1892-12-24 00:00:00')},
'Name': {0: 'Janet Gaynor',
1: 'Louise Dresser',
2: 'Gloria Swanson',
3: 'Mary Pickford',
4: 'Ruth Chatterton'}}
合并(内部联接)数据框
{'Award': {0: 'Best Actress',
1: 'Best Actress',
2: 'Best Actress',
3: 'Best Actress',
4: 'Best Actress'},
'Ceremony': {0: 1, 1: 1, 2: 1, 3: 10, 4: 10},
'Ceremony_Date': {0: Timestamp('1929-05-16 00:00:00'),
1: Timestamp('1929-05-16 00:00:00'),
2: Timestamp('1929-05-16 00:00:00'),
3: Timestamp('1938-03-10 00:00:00'),
4: Timestamp('1938-03-10 00:00:00')},
'DOB': {0: Timestamp('1878-10-17 00:00:00'),
1: Timestamp('1906-10-06 00:00:00'),
2: Timestamp('1906-10-06 00:00:00'),
3: Timestamp('1906-10-06 00:00:00'),
4: Timestamp('1906-10-06 00:00:00')},
'Film': {0: 'A Ship Comes In',
1: '7th Heaven',
2: '7th Heaven',
3: 'A Star Is Born',
4: 'A Star Is Born'},
'Name': {0: 'Louise Dresser',
1: 'Janet Gaynor',
2: 'Janet Gaynor',
3: 'Janet Gaynor',
4: 'Janet Gaynor'},
'Winner': {0: nan, 1: 1.0, 2: 1.0, 3: nan, 4: nan},
'Year': {0: '1927/1928',
1: '1927/1928',
2: '1927/1928',
3: '1937',
4: '1937'}}
修改
Meryl Streep从最佳女演员数据框架中获得
{'Award': {5957: 'Best Actress',
6061: 'Best Actress',
6172: 'Best Actress',
6389: 'Best Actress',
6606: 'Best Actress',
6708: 'Best Actress',
6922: 'Best Actress',
7483: 'Best Actress',
7835: 'Best Actress',
7950: 'Best Actress',
8748: 'Best Actress',
8983: 'Best Actress',
9098: 'Best Actress',
9347: 'Best Actress',
9599: 'Best Actress'},
'Ceremony': {5957: 54,
6061: 55,
6172: 56,
6389: 58,
6606: 60,
6708: 61,
6922: 63,
7483: 68,
7835: 71,
7950: 72,
8748: 79,
8983: 81,
9098: 82,
9347: 84,
9599: 86},
'Ceremony_Date': {5957: Timestamp('1982-03-29 00:00:00'),
6061: Timestamp('1983-04-11 00:00:00'),
6172: Timestamp('1984-04-09 00:00:00'),
6389: Timestamp('1986-03-24 00:00:00'),
6606: Timestamp('1988-04-11 00:00:00'),
6708: Timestamp('1989-03-29 00:00:00'),
6922: Timestamp('1991-03-25 00:00:00'),
7483: Timestamp('1996-03-25 00:00:00'),
7835: Timestamp('1999-03-21 00:00:00'),
7950: Timestamp('2000-03-26 00:00:00'),
8748: Timestamp('2007-02-25 00:00:00'),
8983: Timestamp('2009-02-22 00:00:00'),
9098: Timestamp('2010-03-07 00:00:00'),
9347: Timestamp('2012-02-26 00:00:00'),
9599: Timestamp('2014-03-02 00:00:00')},
'Film': {5957: "The French Lieutenant's Woman",
6061: "Sophie's Choice",
6172: 'Silkwood',
6389: 'Out of Africa',
6606: 'Ironweed',
6708: 'A Cry in the Dark',
6922: 'Postcards from the Edge',
7483: 'The Bridges of Madison County',
7835: 'One True Thing',
7950: 'Music of the Heart',
8748: 'The Devil Wears Prada',
8983: 'Doubt',
9098: 'Julie & Julia',
9347: 'The Iron Lady',
9599: 'August: Osage County'},
'Name': {5957: 'Meryl Streep',
6061: 'Meryl Streep',
6172: 'Meryl Streep',
6389: 'Meryl Streep',
6606: 'Meryl Streep',
6708: 'Meryl Streep',
6922: 'Meryl Streep',
7483: 'Meryl Streep',
7835: 'Meryl Streep',
7950: 'Meryl Streep',
8748: 'Meryl Streep',
8983: 'Meryl Streep',
9098: 'Meryl Streep',
9347: 'Meryl Streep',
9599: 'Meryl Streep'},
'Winner': {5957: nan,
6061: 1.0,
6172: nan,
6389: nan,
6606: nan,
6708: nan,
6922: nan,
7483: nan,
7835: nan,
7950: nan,
8748: nan,
8983: nan,
9098: nan,
9347: 1.0,
9599: nan},
'Year': {5957: '1981',
6061: '1982',
6172: '1983',
6389: '1985',
6606: '1987',
6708: '1988',
6922: '1990',
7483: '1995',
7835: '1998',
7950: '1999',
8748: '2006',
8983: '2008',
9098: '2009',
9347: '2011',
9599: '2013'}}
答案 0 :(得分:1)
当我跑步时,我没有重复:
df = pd.merge(df1, df2, how='inner', on='Name')
打印:
print(json.dumps(json.loads(df.to_json()),indent=4))
我明白了:
{
"Winner": {
"0": null,
"1": 1.0,
"3": null,
"2": null
},
"Ceremony": {
"0": 1,
"1": 1,
"3": 2,
"2": 1
},
"Year": {
"0": "1927/1928",
"1": "1927/1928",
"3": "1928/1929",
"2": "1927/1928"
},
"Film": {
"0": "A Ship Comes In",
"1": "7th Heaven",
"3": "Madame X",
"2": "Sadie Thompson"
},
"Name": {
"0": "Louise Dresser",
"1": "Janet Gaynor",
"3": "Ruth Chatterton",
"2": "Gloria Swanson"
},
"Award": {
"0": "Best Actress",
"1": "Best Actress",
"3": "Best Actress",
"2": "Best Actress"
},
"DOB": {
"0": -2878243200000,
"1": -1995667200000,
"3": -2430518400000,
"2": -2233180800000
},
"Ceremony_Date": {
"0": -1282176000000,
"1": -1282176000000,
"3": -1254355200000,
"2": -1282176000000
}
}
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Award</th> <th>Ceremony</th> <th>Ceremony_Date</th> <th>Film</th> <th>Name</th> <th>Winner</th> <th>Year</th> <th>DOB</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>Best Actress</td> <td>1</td> <td>1929-05-16</td> <td>A Ship Comes In</td> <td>Louise Dresser</td> <td>NaN</td> <td>1927/1928</td> <td>1878-10-17</td> </tr> <tr> <th>1</th> <td>Best Actress</td> <td>1</td> <td>1929-05-16</td> <td>7th Heaven</td> <td>Janet Gaynor</td> <td>1.0</td> <td>1927/1928</td> <td>1906-10-06</td> </tr> <tr> <th>2</th> <td>Best Actress</td> <td>1</td> <td>1929-05-16</td> <td>Sadie Thompson</td> <td>Gloria Swanson</td> <td>NaN</td> <td>1927/1928</td> <td>1899-03-27</td> </tr> <tr> <th>3</th> <td>Best Actress</td> <td>2</td> <td>1930-04-03</td> <td>Madame X</td> <td>Ruth Chatterton</td> <td>NaN</td> <td>1928/1929</td> <td>1892-12-24</td> </tr> </tbody></table>
&#13;
和
df = pd.merge(df1, df2, how='outer', on='Name')
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Award</th> <th>Ceremony</th> <th>Ceremony_Date</th> <th>Film</th> <th>Name</th> <th>Winner</th> <th>Year</th> <th>DOB</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>Best Actress</td> <td>1.0</td> <td>1929-05-16</td> <td>A Ship Comes In</td> <td>Louise Dresser</td> <td>NaN</td> <td>1927/1928</td> <td>1878-10-17</td> </tr> <tr> <th>1</th> <td>Best Actress</td> <td>1.0</td> <td>1929-05-16</td> <td>7th Heaven</td> <td>Janet Gaynor</td> <td>1.0</td> <td>1927/1928</td> <td>1906-10-06</td> </tr> <tr> <th>2</th> <td>Best Actress</td> <td>1.0</td> <td>1929-05-16</td> <td>Sadie Thompson</td> <td>Gloria Swanson</td> <td>NaN</td> <td>1927/1928</td> <td>1899-03-27</td> </tr> <tr> <th>3</th> <td>Best Actress</td> <td>2.0</td> <td>1930-04-03</td> <td>Madame X</td> <td>Ruth Chatterton</td> <td>NaN</td> <td>1928/1929</td> <td>1892-12-24</td> </tr> <tr> <th>4</th> <td>Best Actress</td> <td>2.0</td> <td>1930-04-03</td> <td>The Barker</td> <td>Betty Compson</td> <td>NaN</td> <td>1928/1929</td> <td>NaT</td> </tr> <tr> <th>5</th> <td>NaN</td> <td>NaN</td> <td>NaT</td> <td>NaN</td> <td>Mary Pickford</td> <td>NaN</td> <td>NaN</td> <td>1892-04-08</td> </tr> </tbody></table>
&#13;
答案 1 :(得分:1)
使用drop_duplicates()
方法。例如:
merged_df = pd.merge(left_df, right_df, on='common_key', how='inner').drop_duplicates()
答案 2 :(得分:0)
为了回答我自己的问题,请允许我查明问题所在并找出解决方法。
如上所述,最佳女演员数据框ba
很好。实际上,所有数据帧都是有序的。最初的问题是关于如何执行内连接以及可能出错的方式(即,提示合并过程创建重复记录)。
Meryl Streep将成为我们的指南,如上所述。在数据集中,她获得了16项最佳女演员提名(对于任何保持分数的人,数据不包括她最近的提名)。当在ba
和出生日期DOB
,数据帧之间执行内连接时,她被提名的每部电影都重复了16次,这不是我想要的结果(见错误的结果)以上)。事实上,她的姓名和出生日期在DOB
数据框中出现了16次。这与我编写的scrape代码一致,而不是意外的结果或bug。
当我在两个框架之间进行内部连接时,我(错误地)认为Pandas会看到她的提名,例如“朱莉和朱莉娅”,与她的生日匹配一次,并完成它。显然,内连接意味着如果两个表中的连接列都匹配,则每行将匹配可能的最大次数。因此,对于每一部电影,合并后的表格有16个记录(其中一个是她最佳女演员提名的16个,这相当于她的生日出现在被提名生日的网页上,成为数据帧的次数)。我不确定这是否正确,但它描述了我在前面看到的内容。我欢迎澄清这一点。
解决方法只是从DOB
数据框中删除重复的名称并重新合并。这是代码和输出,使用Meryl作为示例。
ba_dob_revised = df_birthdays.drop_duplicates('Name')
ba_dob = pd.merge(ba, ba_dob_revised, on='Name')
ba_dob[ba_dob.Name=="Meryl Streep"]
{'Award': {282: 'Best Actress',
283: 'Best Actress',
284: 'Best Actress',
285: 'Best Actress',
286: 'Best Actress',
287: 'Best Actress',
288: 'Best Actress',
289: 'Best Actress',
290: 'Best Actress',
291: 'Best Actress',
292: 'Best Actress',
293: 'Best Actress',
294: 'Best Actress',
295: 'Best Actress',
296: 'Best Actress'},
'Ceremony': {282: 54,
283: 55,
284: 56,
285: 58,
286: 60,
287: 61,
288: 63,
289: 68,
290: 71,
291: 72,
292: 79,
293: 81,
294: 82,
295: 84,
296: 86},
'Ceremony_Date': {282: Timestamp('1982-03-29 00:00:00'),
283: Timestamp('1983-04-11 00:00:00'),
284: Timestamp('1984-04-09 00:00:00'),
285: Timestamp('1986-03-24 00:00:00'),
286: Timestamp('1988-04-11 00:00:00'),
287: Timestamp('1989-03-29 00:00:00'),
288: Timestamp('1991-03-25 00:00:00'),
289: Timestamp('1996-03-25 00:00:00'),
290: Timestamp('1999-03-21 00:00:00'),
291: Timestamp('2000-03-26 00:00:00'),
292: Timestamp('2007-02-25 00:00:00'),
293: Timestamp('2009-02-22 00:00:00'),
294: Timestamp('2010-03-07 00:00:00'),
295: Timestamp('2012-02-26 00:00:00'),
296: Timestamp('2014-03-02 00:00:00')},
'DOB': {282: Timestamp('1949-06-22 00:00:00'),
283: Timestamp('1949-06-22 00:00:00'),
284: Timestamp('1949-06-22 00:00:00'),
285: Timestamp('1949-06-22 00:00:00'),
286: Timestamp('1949-06-22 00:00:00'),
287: Timestamp('1949-06-22 00:00:00'),
288: Timestamp('1949-06-22 00:00:00'),
289: Timestamp('1949-06-22 00:00:00'),
290: Timestamp('1949-06-22 00:00:00'),
291: Timestamp('1949-06-22 00:00:00'),
292: Timestamp('1949-06-22 00:00:00'),
293: Timestamp('1949-06-22 00:00:00'),
294: Timestamp('1949-06-22 00:00:00'),
295: Timestamp('1949-06-22 00:00:00'),
296: Timestamp('1949-06-22 00:00:00')},
'Film': {282: "The French Lieutenant's Woman",
283: "Sophie's Choice",
284: 'Silkwood',
285: 'Out of Africa',
286: 'Ironweed',
287: 'A Cry in the Dark',
288: 'Postcards from the Edge',
289: 'The Bridges of Madison County',
290: 'One True Thing',
291: 'Music of the Heart',
292: 'The Devil Wears Prada',
293: 'Doubt',
294: 'Julie & Julia',
295: 'The Iron Lady',
296: 'August: Osage County'},
'Name': {282: 'Meryl Streep',
283: 'Meryl Streep',
284: 'Meryl Streep',
285: 'Meryl Streep',
286: 'Meryl Streep',
287: 'Meryl Streep',
288: 'Meryl Streep',
289: 'Meryl Streep',
290: 'Meryl Streep',
291: 'Meryl Streep',
292: 'Meryl Streep',
293: 'Meryl Streep',
294: 'Meryl Streep',
295: 'Meryl Streep',
296: 'Meryl Streep'},
'Winner': {282: nan,
283: 1.0,
284: nan,
285: nan,
286: nan,
287: nan,
288: nan,
289: nan,
290: nan,
291: nan,
292: nan,
293: nan,
294: nan,
295: 1.0,
296: nan},
'Year': {282: '1981',
283: '1982',
284: '1983',
285: '1985',
286: '1987',
287: '1988',
288: '1990',
289: '1995',
290: '1998',
291: '1999',
292: '2006',
293: '2008',
294: '2009',
295: '2011',
296: '2013'}}
关键点:虽然内连接是合适的(切换连接类型肯定没有解决问题),但我没有充分考虑机器/ Pandas如何考虑内部连接。最终,确定错误结果中的模式并在其中一个数据框中找到类似的模式,这两个数据框都已经过检查过错误,证明是最有帮助的。