根据信息最少的信息在熊猫中删除部分重复项

时间:2020-05-22 18:09:44

标签: pandas dataframe

我是Pandas的新手,想知道我是否可以将其用于特定用途。我想删除在特定列中具有部分重复条目的数据框的行。

例如,如果有两行,并且其中一行显示“ Verrucomicrobia; phylum;”,在给定的列中,第二行具有“ Verrucomicrobia; phylum; Opitutae; class;”在该列中,我要删除第一行。

我想从这里:

                                        0   1
0                 Verrucomicrobia;phylum;  10
1  Verrucomicrobia;phylum;Opitutae;class;   5

对此:

                                        0  1
0  Verrucomicrobia;phylum;Opitutae;class;  5

但是我没有寻找唯一的字符串-我只是想摆脱具有部分重复但信息较少的行。以上是一个例子。

另外,可能会有三行,而我想摆脱信息最少的两行。

在这种情况下,我想从这里开始:

                                                   0   1
0                            Verrucomicrobia;phylum;  10
1             Verrucomicrobia;phylum;Opitutae;class;   5
2  Verrucomicrobia;phylum;Opitutae;class;Puniceic...   3

对此:

                                                   0  1
0  Verrucomicrobia;phylum;Opitutae;class;Puniceic...  3

1 个答案:

答案 0 :(得分:0)

# Generate some sample data
df = pd.DataFrame({
    0: [';'.join([str(randint(0, 100)) for _ in range(randint(1, 5))]) for __ in range(100000)]
})

print(df)

                   0
0              71;39
1           72;75;92
2              45;74
3         55;94;95;3
4      27;93;4;33;52
...              ...
9995              64
9996  36;71;36;69;74
9997           53;30
9998            8;35
9999  47;63;68;99;18

[10000 rows x 1 columns]

带有进度条

import itertools as it
from tqdm.autonotebook import tqdm  # for jupyter notebook, else just "from tqdm import tqdm" 

# Turn your semi-colon seperated strings into lists
df[0] = df[0].str.split(';')

# You have trailing semi-colons, so we remove empty strings from the column
df[0] = df[0].apply(lambda lst: [item for item in lst if not item == ''])

# We make a list of values (lists) we want to drop
to_drop = []

# Get the total number of comparisons, for the progress bar
n = (len(df) * (len(df)-1)) // 2 

# Comparing all lists with all other lists, we append them to to_drop if 
# they contain any of the same elements
for a, b in tqdm(it.combinations(df[0], 2), total=n):
    if not set(a).isdisjoint(b):
        to_drop.append(b)

# Drop duplicates, like list(set(lst)) but for list of lists
to_drop = list(k for k,_ in it.groupby(to_drop))

# Filter
res = df[df[0].apply(lambda lst: lst not in to_drop)]
print(res)

没有进度条:

import itertools as it

# Turn your semi-colon seperated strings into lists
df[0] = df[0].str.split(';')

# You have trailing semi-colons, so we remove empty strings from the column
df[0] = df[0].apply(lambda lst: [item for item in lst if not item == ''])

# We make a list of values (lists) we want to drop
to_drop = []

# Comparing all lists with all other lists, we append them to to_drop if 
# they contain any of the same elements
for a, b in it.combinations(df[0], 2):
    if not set(a).isdisjoint(b):
        to_drop.append(b)

# Drop duplicates, like list(set(lst)) but for list of lists
to_drop = list(k for k,_ in it.groupby(to_drop))

# Filter
res = df[df[0].apply(lambda lst: lst not in to_drop)]

输出

                       0
0               [71, 39]
1           [72, 75, 92]
2               [45, 74]
3        [55, 94, 95, 3]
4    [27, 93, 4, 33, 52]
5   [49, 91, 28, 28, 20]
6               [31, 69]
8       [51, 59, 41, 17]
10          [79, 19, 62]
21      [89, 48, 56, 34]
52             [100, 50]
85              [77, 97]