我有一个看起来像这样的数组:
a = ['UCI_99648;102568', 'UCI_99648;102568', 'UCI_99648;102568;99651', 'UCI_99651', 'UCI_99652', 'SIB_99658;102568;506010;706080', NaN]
我想找出多少个s具有单个数字,例如UCI_99651
,UCI_99652
因此,预期结果为2。
我如何在python中做到这一点。
注意::我的实际数据非常大,数字可以是任何数字,并且如示例中所述,可能涉及缺失值。
答案 0 :(得分:4)
假设所有字符串的结构均遵循上述示例的结构,则列表理解如下:
l = ['UCI_99648;102568', 'UCI_99648;102568', 'UCI_99648;102568;99651',
'UCI_99651', 'UCI_99652', 'SIB_99658;102568;506010;706080', 'NaN']
[i for i in l if ';' not in i and i != 'NaN']
输出
['UCI_99651', 'UCI_99652']
答案 1 :(得分:2)
由于您已标记了熊猫,因此可以采用另一种方式:
s=pd.Series(a).dropna()
s[s.str.split(';').str.len().eq(1)]
3 UCI_99651
4 UCI_99652
答案 2 :(得分:2)
您可以尝试以下操作。希望这可以解决您的问题。
p = [word.split(";")[0] for word in uci if word != 'NaN']
print(Counter(p))
#Counter({'UCI_99648': 3, 'UCI_99651': 1, 'UCI_99652': 1, 'SIB_99658': 1})
#To filter only one occurance you can try below.
b = [word for word in p if p.count(word)==1]
print(b)
有关更多信息,请参见此处的列表理解文档。
答案 3 :(得分:0)
根据需要实施NaN检查-使用numpy或pandas。
a = ['UCI_99648;102568', 'UCI_99648;102568', 'UCI_99648;102568;99651', 'UCI_99651', 'UCI_99652', 'SIB_99658;102568;506010;706080', 'NaN']
b = [i.split(';')[0] for i in a if i != 'NaN' and i.startswith('UCI_')]
b = [x for x in b if b.count(x)==1]
print(b)
#[UCI_99651, UCI_99652]
答案 4 :(得分:0)
您可以使用正则表达式提取数字。 例如,如下所示:
import re
import numpy as np
from collections import Counter
def count_strings_with_unq_nums(list_of_strings):
# Initialize two lists - one to keep track of where the numbers occur and another to isolate unique occurences of numbers
all_nums_nested = []
all_nums_flat = []
# Loop through all strings and extract integers within
for s in list_of_strings:
try:
nums = re.findall(r'\d+', s)
all_nums_nested.append(nums)
all_nums_flat.extend(nums)
except:
continue
# Count occurences of all extracted numbers
num_counter = Counter(all_nums_flat)
# Loop through nested list to find strings where unique numbers occur
unq_items = []
for key, val in num_counter.items():
if val == 1:
for n, num_list in enumerate(all_nums_nested):
if key in num_list:
unq_items.append(list_of_strings[n])
# Return the number of strings containing unique numbers.
return len(set(unq_items))
if __name__ == '__main__':
a = ['UCI_99648;102568', 'UCI_99648;102568', 'UCI_99648;102568;99651', 'UCI_99651', 'UCI_99652', 'SIB_99658;102568;506010;706080', np.NaN]
print(count_strings_with_unq_nums(a))
>>> 2