如何计算Python中列表成对比较的元素频率?

时间:2016-10-16 03:11:34

标签: python list for-loop dictionary frequency

我将样本存储在以下列表中

 sample = [AAAA,CGCG,TTTT,AT-T,CATC]

..为了说明问题,我将它们表示为" Set"以下

Set1 AAAA
Set2 CGCG
Set3 TTTT
Set4 AT-T
Set5 CATC
  1. 消除集合中每个元素与其自身相同的所有集合。
  2. 输出:

     Set2 CGCG
     Set4 AT-T
     Set5 CATC
    
    1. 执行集合之间的成对比较。 (Set2 v Set4,Set 2v Set5,Set4 v Set5)

    2. 每个成对比较只能有两种类型的组合,如果没有,则消除那些成对比较。例如,

      Set2    Set5
      C       C
      G       A
      C       T 
      G       C
      
    3. 这里,有两种以上的对(CC),(GA),(CT)和(GC)。所以这种成对比较不会发生。

      每次比较只能有两种组合(AA,GG,CC,TT,AT,TA,AC,CA,AG,GA,GC,CG,GT,TG,CT,TC)...基本上所有订单重要的ACGT的可能组合。

      在给定的示例中,找到了超过2个这样的组合。

      因此,Set2和Set4;不能考虑Set4和Set5。因此,剩下的唯一对是:

      Output
      Set2 CGCG
      Set4 AT-T
      
      1. 在这种成对比较中,删除任何元素" - "及其在另一对中的对应元素

        Output    
        Set2 CGG
        Set4 ATT
        
      2. 计算Set2和Set4中元素的频率。计算集合(CA和GT对)中对的类型出现的频率

        Output
        Set2 (C = 1/3, G = 2/3)
        Set4 (A = 1/3, T = 2/3)
        Pairs (CA = 1/3, GT = 2/3)
        
      3. 计算相应元素的float(a)=(Pairs) - (Set2)*(Set4)(任何一对就足够了)

        eg. For CA pairs, float (a) = (freq of CA pairs) - (freq of C) * (freq of A)
        
      4. 注意:如果该对是AAAC和CCCA,则C的频率为1/4,即它是其中一对的基频率

        1. 计算

          float (b) = float(a)/ (freq of C in CGG) * (freq G in CGG) * (freq A in ATT) * (ATT==> freq of T in ATT)
          
        2. 对所有成对比较重复此操作

        3. 例如。

          Set2 CGCG
          Set4 AT-T
          Set6 GCGC
          

          Set2 v Set4,Set2 v Set6,Set4 v Set6

          我的半生不熟的代码到现在为止: **如果建议的所有代码都采用标准的for-loop格式而不是理解**,我更希望**

          #Step 1
          for i in sample: 
              for j in range(i):
                  if j = j+1    #This needs to be corrected to if all elements in i identical to each other i.e. if all "j's" are the same
                                  del i 
              #insert line of code where sample1 = new sample with deletions as above
          
          #Step 2
              for i,i+1 in enumerate(sample):
              #Step 3
              for j in range(i):
                  for k in range (i+1):
                  #insert line of code to say only two types of pairs can be included, if yes continue else skip
                      #Step 4
                      if j = "-" or k = "-":
                          #Delete j/k and the corresponding element in the other pair
                          #Step 5
                          count_dict = {}
                              square_dict = {}
                          for base in list(i):
                              if base in count_dict:
                                      count_dict[base] += 1
                              else:
                                      count_dict[base] = 1
                              for allele in count_dict:
                              freq = (count_dict[allele] / len(i)) #frequencies of individual alleles
                              #Calculate frequency of pairs 
                          #Step 6
                          No code yet
          

1 个答案:

答案 0 :(得分:2)

我认为这就是你想要的:

from collections import Counter

# Remove elements where all nucleobases are the same.
for index in range(len(sample) - 1, -1, -1):
    if sample[index][:1] * len(sample[index]) == sample[index]:
        del sample[index]

for indexA, setA in enumerate(sample):
    for indexB, setB in enumerate(sample):
        # Don't compare samples with themselves nor compare same pair twice.
        if indexA <= indexB:
            continue

        # Calculate number of unique pairs
        pair_count = Counter()
        for pair in zip(setA, setB):
            if '-' not in pair:
                pair_count[pair] += 1

        # Only analyse pairs of sets with 2 unique pairs.
        if len(pair_count) != 2:
            continue

        # Count individual bases.
        base_counter = Counter()
        for pair, count in pair_count.items():
            base_counter[pair[0]] += count
            base_counter[pair[1]] += count

        # Get the length of one of each item in the pair.
        sequence_length = sum(pair_count.values())

        # Convert counts to frequencies.
        base_freq = {}
        for base, count in base_counter.items():
            base_freq[base] = count / float(sequence_length)

        # Examine a pair from the two unique pairs to calculate float_a.
        pair = list(pair_count)[0]
        float_a = (pair_count[pair] / float(sequence_length)) - base_freq[pair[0]] * base_freq[pair[1]]

        # Step 7!
        float_b = float_a / float(base_freq.get('A', 0) * base_freq.get('T', 0) * base_freq.get('C', 0) * base_freq.get('G', 0))

或者,更多的Python(使用你不想要的list / dict理解):

from collections import Counter

BASES = 'ATCG'

# Remove elements where all nucleobases are the same.
sample = [item for item in sample if item[:1] * len(item) != item]

for indexA, setA in enumerate(sample):
    for indexB, setB in enumerate(sample):
        # Don't compare samples with themselves nor compare same pair twice.
        if indexA <= indexB:
            continue

        # Calculate number of unique pairs
        relevant_pairs = [(elA, elB) for (elA, elB) in zip(setA, setB) if elA != '-' and elB != '-']
        pair_count = Counter(relevant_pairs)

        # Only analyse pairs of sets with 2 unique pairs.
        if len(pair_count) != 2:
            continue

        # setA and setB as tuples with pairs involving '-' removed.
        setA, setB = zip(*relevant_pairs)

        # Get the total for each base.
        seq_length = len(setA)

        # Convert counts to frequencies.
        base_freq = {base : count / float(seq_length) for (base, count) in (Counter(setA) + Counter(setB)).items()}

        # Examine a pair from the two unique pairs to calculate float_a.
        pair = list(pair_count)[0]
        float_a = (pair_count[pair] / float(seq_length)) - base_freq[pair[0]] * base_freq[pair[1]]

        # Step 7!
        denominator = 1
        for base in BASES:
            denominator *= base_freq.get(base, 0)

        float_b = float_a / denominator