我是Pandas的新手,想知道我是否可以将其用于特定用途。我想删除在特定列中具有部分重复条目的数据框的行。
例如,如果有两行,并且其中一行显示“ Verrucomicrobia; phylum;”,在给定的列中,第二行具有“ Verrucomicrobia; phylum; Opitutae; class;”在该列中,我要删除第一行。
我想从这里:
0 1
0 Verrucomicrobia;phylum; 10
1 Verrucomicrobia;phylum;Opitutae;class; 5
对此:
0 1
0 Verrucomicrobia;phylum;Opitutae;class; 5
但是我没有寻找唯一的字符串-我只是想摆脱具有部分重复但信息较少的行。以上是一个例子。
另外,可能会有三行,而我想摆脱信息最少的两行。
在这种情况下,我想从这里开始:
0 1
0 Verrucomicrobia;phylum; 10
1 Verrucomicrobia;phylum;Opitutae;class; 5
2 Verrucomicrobia;phylum;Opitutae;class;Puniceic... 3
对此:
0 1
0 Verrucomicrobia;phylum;Opitutae;class;Puniceic... 3
答案 0 :(得分:0)
# Generate some sample data
df = pd.DataFrame({
0: [';'.join([str(randint(0, 100)) for _ in range(randint(1, 5))]) for __ in range(100000)]
})
print(df)
0
0 71;39
1 72;75;92
2 45;74
3 55;94;95;3
4 27;93;4;33;52
... ...
9995 64
9996 36;71;36;69;74
9997 53;30
9998 8;35
9999 47;63;68;99;18
[10000 rows x 1 columns]
import itertools as it
from tqdm.autonotebook import tqdm # for jupyter notebook, else just "from tqdm import tqdm"
# Turn your semi-colon seperated strings into lists
df[0] = df[0].str.split(';')
# You have trailing semi-colons, so we remove empty strings from the column
df[0] = df[0].apply(lambda lst: [item for item in lst if not item == ''])
# We make a list of values (lists) we want to drop
to_drop = []
# Get the total number of comparisons, for the progress bar
n = (len(df) * (len(df)-1)) // 2
# Comparing all lists with all other lists, we append them to to_drop if
# they contain any of the same elements
for a, b in tqdm(it.combinations(df[0], 2), total=n):
if not set(a).isdisjoint(b):
to_drop.append(b)
# Drop duplicates, like list(set(lst)) but for list of lists
to_drop = list(k for k,_ in it.groupby(to_drop))
# Filter
res = df[df[0].apply(lambda lst: lst not in to_drop)]
print(res)
import itertools as it
# Turn your semi-colon seperated strings into lists
df[0] = df[0].str.split(';')
# You have trailing semi-colons, so we remove empty strings from the column
df[0] = df[0].apply(lambda lst: [item for item in lst if not item == ''])
# We make a list of values (lists) we want to drop
to_drop = []
# Comparing all lists with all other lists, we append them to to_drop if
# they contain any of the same elements
for a, b in it.combinations(df[0], 2):
if not set(a).isdisjoint(b):
to_drop.append(b)
# Drop duplicates, like list(set(lst)) but for list of lists
to_drop = list(k for k,_ in it.groupby(to_drop))
# Filter
res = df[df[0].apply(lambda lst: lst not in to_drop)]
输出
0
0 [71, 39]
1 [72, 75, 92]
2 [45, 74]
3 [55, 94, 95, 3]
4 [27, 93, 4, 33, 52]
5 [49, 91, 28, 28, 20]
6 [31, 69]
8 [51, 59, 41, 17]
10 [79, 19, 62]
21 [89, 48, 56, 34]
52 [100, 50]
85 [77, 97]