检查一栏中具有相同值的记录在整个期间是否相同

时间:2019-07-24 19:50:18

标签: python pandas

我有一个叫df的熊猫,有大约100万条记录。 df有80多个列,其中一列是asset_id。我想为所有记录创建一个子集,这些记录具有重复的asset_id,但是至少在其他至少一列中具有不同的值。

示例:

df = pd.DataFrame({"asset_id": [1,1,1,2,2,3,4,5,5], "Name":["Canola", "Canola", "Canola", "Precision", "Precision", "Explore", "Testing", "Conda", "Conda Inc"], "Country":["CAN", "CAN", "USA", "CAN", "CAN", "USA", "CAN", "USA", "USA"]}) 
asset_id    Name    Country 
  1       Canola     CAN
  1       Canola     CAN
  1       Canola     USA
  2       Precision  CAN
  2       Precision  CAN
  3       Explore    USA
  4       Testing    CAN
  5       Conda      USA
  5       Conda Inc  USA

我希望结果表看起来像这样:

asset_id    Name    Country
  1       Canola     CAN
  1       Canola     USA
  5       Conda      USA
  5       Conda Inc  USA

提前感谢所有帮助!

3 个答案:

答案 0 :(得分:2)

您可以直接滤除具有多个Name或具有多个Country的组,然后使用以下方法删除所有剩余的重复项:

df.groupby('asset_id').filter(lambda x: (x.Name.nunique()>1) | (x.Country.nunique()>1)).drop_duplicates()

输出:

asset_id    Name    Country
0   1   Canola  CAN
2   1   Canola  USA
7   5   Conda   USA
8   5   Conda Inc   USA

答案 1 :(得分:1)

您可以创建一个自定义函数来进行选择,并将其与groupbyapply一起使用。

def selecting(x):
    lencol = set(len(x[col].unique()) for col in x.columns)
    if len(lencol) == 1:
        return pd.DataFrame(columns=x.columns) #empty dataframe
    else:
        return x[~x.duplicated()]

ddf = df.groupby('asset_id').apply(selecting)

如果删除由groupby创建的索引,则会得到:

ddf.reset_index(drop=True)

  asset_id       Name Country
0        1     Canola     CAN
1        1     Canola     USA
2        5      Conda     USA
3        5  Conda Inc     USA

说明

lencolset,存储每一列有多少个唯一元素。作为一个集合,不会出现具有相同数量元素的列。
因此,如果len(lencol)为1(set有一个元素),则返回一个空数据帧。否则,将返回没有重复行的数据框。检查duplicated方法以了解其工作原理。

答案 2 :(得分:1)

使用 <measInfo measInfoId="KPISystemCP-ISA"> <granPeriod duration="PT300S" endTime="2019-05-14T12:05:01-03:00" /> <measType p="1">VS.avgCpuUtilization</measType> <measType p="2">VS.avgMemoryUtilization</measType> <measType p="3">VS.avgMemoryUtilization1M</measType> <measType p="4">VS.SDFsFpUtilization</measType> <measType p="5">VS.SDFsLcpUtilization</measType> <measType p="6">VS.avgVmFpCpuNicUsage</measType> <measType p="7">VS.avgVmFpCpuWorkerUsage</measType> <measType p="8">VS.avgVmFpCpuSchedulerUsage</measType> <measType p="9">VS.avgVmFpCpuCollapsedUsage</measType> <measType p="10">VS.avgVmFpCpuCombinedUsage</measType> <measType p="11">VS.hwCfgBitsInfo</measType> <measValue measObjLdn="KPI=System,GroupName=CP-ISA,group=1,slot=3,mda=1"> <r p="1">1</r> <r p="2">72</r> <r p="3">72</r> <r p="4">0.00</r> <r p="5">0.00</r> <r p="6">0.00</r> <r p="7">0.05</r> <r p="8">0.00</r> <r p="9">0.00</r> <r p="10">0.00</r> <r p="11">4</r> </measValue> <measValue measObjLdn="KPI=System,GroupName=CP-ISA2,group=2,slot=4,mda=1"> <r p="1">1</r> <r p="2">86</r> <r p="3">86</r> <r p="4">0.00</r> <r p="5">0.00</r> <r p="6">0.00</r> <r p="7">0.05</r> <r p="8">0.00</r> <r p="9">0.00</r> <r p="10">0.00</r> <r p="11">7</r> </measValue> </measInfo> <measInfo> <granPeriod duration="PT300S" endTime="2019-05-14T12:05:01-03:00" /> <measType p="1">VS.avgUtilization</measType> <measType p="2">VS.avgPDPUtilization</measType> <measType p="3">VS.avgPDPUtilization1M</measType> <measValue measObjLdn="KPI=System2,GroupName=1,group=1,slot=3,mda=1"> <r p="1">1</r> <r p="2">29</r> <r p="3">99</r> </measValue> <measValue measObjLdn="KPI=System2,GroupName=2,group=2,slot=4,mda=1"> <r p="1">1</r> <r p="2">32</r> <r p="3">16</r> </measValue> </measInfo> 。它完成了工作。

drop_duplicates()

它产生所需内容的import pandas as pd df = pd.DataFrame( { "asset_id": [1, 1, 1, 2, 2, 3, 4, 5, 5], "Name": [ "Canola", "Canola", "Canola", "Precision", "Precision", "Explore", "Testing", "Conda", "Conda Inc", ], "Country": ["CAN", "CAN", "USA", "CAN", "CAN", "USA", "CAN", "USA", "USA"], } ) df = df.drop_duplicates() x = df["asset_id"].value_counts() data = [] for elem, elem1 in zip(x.index, x): if elem1 > 1: y = df.loc[df["asset_id"] == elem] print(y.values) (上面的代码产生它):

list