Python / R:删除重复的行 - 保留唯一的作者对

时间:2017-11-23 18:15:44

标签: python r duplicates mapping data-manipulation

这是我从数据库中提取的一个例子。我正在合作中使用可视化,因此基于此示例,我必须在两位作者中保持一种关系。比如我必须删除一个Brian Norton --- Maria Roo Ons或Maria Roo Ons --- Brian Norton以保持关系的独特性。

-------------------------------------------------------------------------------------------------
|              article_title                                | author_name     |   coauthor_name |
-------------------------------------------------------------------------------------------------
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Brian Norton    | Maria Roo Ons
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Brian Norton    | Max Ammann
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Brian Norton    | S. Shynu
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Brian Norton    | Sarah McCormack
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Maria Roo Ons   | Brian Norton
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Maria Roo Ons   | Max Ammann
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Maria Roo Ons   | S. Shynu
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Maria Roo Ons   | Sarah McCormack
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Max Ammann      | Brian Norton
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Max Ammann      | Maria Roo Ons
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Max Ammann      | S. Shynu
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Max Ammann      | Sarah McCormack
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | S. Shynu        | Brian Norton
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | S. Shynu        | Maria Roo Ons
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | S. Shynu        | Max Ammann
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | S. Shynu        | Sarah McCormack
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Sarah McCormack | Brian Norton
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Sarah McCormack | Maria Roo Ons
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Sarah McCormack | Max Ammann
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Sarah McCormack | S. Shynu
-------------------------------------------------------------------------------------------------

理想的最终输出如下。

-------------------------------------------------------------------------------------------------
|              article_title                                | author_name     |   coauthor_name |
-------------------------------------------------------------------------------------------------
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Brian Norton    | Maria Roo Ons
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Brian Norton    | Max Ammann
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Brian Norton    | S. Shynu
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Brian Norton    | Sarah McCormack
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Maria Roo Ons   | Max Ammann
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Maria Roo Ons   | S. Shynu
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Maria Roo Ons   | Sarah McCormack
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Max Ammann      | S. Shynu
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Max Ammann      | Sarah McCormack
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | S. Shynu        | Sarah McCormack

在这种情况下,我只想保留一行。我如何在R或Python中处理它? 非常感谢你的帮助。

2 个答案:

答案 0 :(得分:1)

我假设您有一个单独的数据库,并使用python连接它。

可能的方法:

1)您可以根据article列添加行号,然后执行重复数据删除。您可以查看this答案,了解如何在SQL中进行操作。

然后您可以使用python-db连接器

运行查询

2)您可以将记录拉入pandas数据帧并在那里进行分析。 Pandas适用于处理和操作数据。

答案 1 :(得分:0)

我假设您的数据框架看起来像我在下面显示的那样,因为您还没有分享可能出现的其他可能性。

article author1 author2
A       a       b
A       b       a
A       a       a
A       b       b

在R中,这就是我可以获取您正在寻找的行的方式。我假设您的数据框为df1

# This will create a new dataframe df2 with only those rows where author1 and author2 are different

df2 <- df1[df1$author1 != df1$author2, ]

输出看起来就像你在问题中提供的那样。

article author1 author2
  A       a       b
  A       b       a

请告诉我这是否是你需要的。