我有一个叫df的熊猫,有大约100万条记录。 df有80多个列,其中一列是asset_id。我想为所有记录创建一个子集,这些记录具有重复的asset_id,但是至少在其他至少一列中具有不同的值。
示例:
df = pd.DataFrame({"asset_id": [1,1,1,2,2,3,4,5,5], "Name":["Canola", "Canola", "Canola", "Precision", "Precision", "Explore", "Testing", "Conda", "Conda Inc"], "Country":["CAN", "CAN", "USA", "CAN", "CAN", "USA", "CAN", "USA", "USA"]})
asset_id Name Country
1 Canola CAN
1 Canola CAN
1 Canola USA
2 Precision CAN
2 Precision CAN
3 Explore USA
4 Testing CAN
5 Conda USA
5 Conda Inc USA
我希望结果表看起来像这样:
asset_id Name Country
1 Canola CAN
1 Canola USA
5 Conda USA
5 Conda Inc USA
提前感谢所有帮助!
答案 0 :(得分:2)
您可以直接滤除具有多个Name
或具有多个Country
的组,然后使用以下方法删除所有剩余的重复项:
df.groupby('asset_id').filter(lambda x: (x.Name.nunique()>1) | (x.Country.nunique()>1)).drop_duplicates()
输出:
asset_id Name Country
0 1 Canola CAN
2 1 Canola USA
7 5 Conda USA
8 5 Conda Inc USA
答案 1 :(得分:1)
您可以创建一个自定义函数来进行选择,并将其与groupby
和apply
一起使用。
def selecting(x):
lencol = set(len(x[col].unique()) for col in x.columns)
if len(lencol) == 1:
return pd.DataFrame(columns=x.columns) #empty dataframe
else:
return x[~x.duplicated()]
ddf = df.groupby('asset_id').apply(selecting)
如果删除由groupby
创建的索引,则会得到:
ddf.reset_index(drop=True)
asset_id Name Country
0 1 Canola CAN
1 1 Canola USA
2 5 Conda USA
3 5 Conda Inc USA
lencol
是set
,存储每一列有多少个唯一元素。作为一个集合,不会出现具有相同数量元素的列。
因此,如果len(lencol)
为1(set
有一个元素),则返回一个空数据帧。否则,将返回没有重复行的数据框。检查duplicated方法以了解其工作原理。
答案 2 :(得分:1)
使用 <measInfo measInfoId="KPISystemCP-ISA">
<granPeriod duration="PT300S" endTime="2019-05-14T12:05:01-03:00" />
<measType p="1">VS.avgCpuUtilization</measType>
<measType p="2">VS.avgMemoryUtilization</measType>
<measType p="3">VS.avgMemoryUtilization1M</measType>
<measType p="4">VS.SDFsFpUtilization</measType>
<measType p="5">VS.SDFsLcpUtilization</measType>
<measType p="6">VS.avgVmFpCpuNicUsage</measType>
<measType p="7">VS.avgVmFpCpuWorkerUsage</measType>
<measType p="8">VS.avgVmFpCpuSchedulerUsage</measType>
<measType p="9">VS.avgVmFpCpuCollapsedUsage</measType>
<measType p="10">VS.avgVmFpCpuCombinedUsage</measType>
<measType p="11">VS.hwCfgBitsInfo</measType>
<measValue measObjLdn="KPI=System,GroupName=CP-ISA,group=1,slot=3,mda=1">
<r p="1">1</r>
<r p="2">72</r>
<r p="3">72</r>
<r p="4">0.00</r>
<r p="5">0.00</r>
<r p="6">0.00</r>
<r p="7">0.05</r>
<r p="8">0.00</r>
<r p="9">0.00</r>
<r p="10">0.00</r>
<r p="11">4</r>
</measValue>
<measValue measObjLdn="KPI=System,GroupName=CP-ISA2,group=2,slot=4,mda=1">
<r p="1">1</r>
<r p="2">86</r>
<r p="3">86</r>
<r p="4">0.00</r>
<r p="5">0.00</r>
<r p="6">0.00</r>
<r p="7">0.05</r>
<r p="8">0.00</r>
<r p="9">0.00</r>
<r p="10">0.00</r>
<r p="11">7</r>
</measValue>
</measInfo>
<measInfo>
<granPeriod duration="PT300S" endTime="2019-05-14T12:05:01-03:00" />
<measType p="1">VS.avgUtilization</measType>
<measType p="2">VS.avgPDPUtilization</measType>
<measType p="3">VS.avgPDPUtilization1M</measType>
<measValue measObjLdn="KPI=System2,GroupName=1,group=1,slot=3,mda=1">
<r p="1">1</r>
<r p="2">29</r>
<r p="3">99</r>
</measValue>
<measValue measObjLdn="KPI=System2,GroupName=2,group=2,slot=4,mda=1">
<r p="1">1</r>
<r p="2">32</r>
<r p="3">16</r>
</measValue>
</measInfo>
。它完成了工作。
drop_duplicates()
它产生所需内容的import pandas as pd
df = pd.DataFrame(
{
"asset_id": [1, 1, 1, 2, 2, 3, 4, 5, 5],
"Name": [
"Canola",
"Canola",
"Canola",
"Precision",
"Precision",
"Explore",
"Testing",
"Conda",
"Conda Inc",
],
"Country": ["CAN", "CAN", "USA", "CAN", "CAN", "USA", "CAN", "USA", "USA"],
}
)
df = df.drop_duplicates()
x = df["asset_id"].value_counts()
data = []
for elem, elem1 in zip(x.index, x):
if elem1 > 1:
y = df.loc[df["asset_id"] == elem]
print(y.values)
(上面的代码产生它):
list