My data looks like:
SNP Name ss715583617 ss715592335 ss715591044 ss715598181
4 PI081762 T A A T
5 PI101404A T A A T
6 PI101404B T A A T
7 PI135624 T A A T
8 PI326581 T A A T
9 PI326582A T A A T
10 PI326582B T A A T
11 PI339732 T A A T
12 PI339735A T A A T
13 PI339735B T A A T
14 PI342618A T A A T
In reality I have a dataset of 50,000 columns of 479 rows. My objective is to go through each column with characters and convert the data to integers depending on which is the most abundant character.
Right now I have the data input, and I have more or less written the function I would like to use to analyze each column separately. However, I can't quite understand how to use a forloop or use the apply function through all of the columns in the dataset. I would prefer not to hardcode the columns because I will have 40,000~50,000 columns to analyze.
My code so far is:
import pandas as pd
df = pd.read_csv("/home/dfreese/Desktop/testSNPtext", delimiter='\t')
df.head() # check that the file format fits
# ncol df
df2 = df.iloc[4:-1] # Select the rows you want to analyze in a subset df
print(df2)
My function:
def countAlleles(N):
# N is just suppose to be the column, ideally once I've optimized the function
# I need to analyze every column
# Will hold the counts of each letter in the column
letterCount = []
# This is a parallel array to know the order
letterOrder = {'T','A','G','C','H','U'}
# Boolean to use which one is the maximum
TFlag = None
AFlag = None
GFlag = None
CFlag = None
HFlag = None
UFlag = None
# Loop through the column to determine which one is the maximum
for i in range(len(N)): # How do I get index information of the column?
if(N[i] == 'T'): # If the element in the column is T
letterCount[0] = letterCount[0] + 1
elif(N[i] == 'A'):
letterCount[1] = letterCount [1] + 1
elif (N[i] == 'G'):
letterCount[2] = letterCount [2] + 1
elif (N[i] == 'C'):
lettercount[3] = letterCount[3] + 1
elif(N[i] == 'H'):
letterCount[4] = letterCount[4] + 1
else:
letterCount[5] = letterCount[5] + 1
max = letterCount[0] # This will hold the value of maximum
mIndex = 0 # This holds the index position with the max value
# Determine which one is max
for i in range(len(letterCount)):
if (letterCount[i] > max):
max = letterCount[i]
mIndex = i
So I designed the function to input the column, in hopes to be able to iterate through all the columns of the dataframe. My main question is:
1) How would I pass each in each column as a parameter to the for loop through the elements of each column?
My major source of confusion is how indexes are being used in pandas. I'm familiar with 2-dimensional array in C++ and Java and that is most of where my knowledge stems from.
I'm attempting to use the apply function:
df2 = df2.apply(countAlleles('ss715583617'), axis=2)
but it doesn't seem that my application is correct.
答案 0 :(得分:2)
更新的答案:现在根据每列等位基因的出现情况分析数据帧并用int值替换。如果一个等位基因的出现次数与另一个等位基因的出现次数相同,则会发生什么问题仍然是相同的 - 分配将不是唯一的。
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({"ss1": ["T", "T", "T", "G"],
"ss2": ["G", "G", "T", "A"],
"ss3": ["C", "H", "C", "H"]})
letterOrder = np.array(['T', 'A', 'G', 'C', 'H', 'U'])
for col in df:
alleles = list()
for num, allele in enumerate(letterOrder):
alleles.append(df[col].str.count(allele).sum())
# dictionary with full sorted keys
repl = letterOrder[np.argsort(alleles)][::-1]
# directly replace chars by value
for num, char in enumerate(repl):
df[col].replace(char, num+1, inplace=True)
print(df)
这将更改初始数据框
ss1 ss2 ss3
0 T G C
1 T G H
2 T T C
3 G A H
到新数据框,并根据出现次数排序:
ss1 ss2 ss3
0 1 1 2
1 1 1 1
2 1 3 2
3 2 2 1
供参考旧答案,该答案提供最大列索引:
import pandas as pd
import numpy as np
from collections import OrderedDict
df = pd.DataFrame.from_dict({"ss1": ["T", "T", "T", "G"],
"ss2": ["G", "G", "T", "A"],
"ss3": ["C", "H", "C", "H"]})
letterOrder = ['T', 'A', 'G', 'C', 'H', 'U']
full_results = OrderedDict()
for col in df:
alleles = list()
for num, allele in enumerate(letterOrder):
alleles.append(df[col].str.count(allele).sum())
full_results[col] = [letterOrder[np.argmax(alleles)], np.max(alleles)]
print(full_results)
这将给出:
OrderedDict([('ss1', ['T', 3]), ('ss2', ['G', 2]), ('ss3', ['C', 2])])
dict中的键是列的名称,值是带有[allele,number_of_occurences]的列表。
我使用OrderedDict
来保持列的顺序和名称,但如果您不需要订单,则可以使用dict
,或者如果您不需要列名称(隐含ID足够),使用list
。
但要小心:如果在一列中两个(或更多)字符具有相同的计数,则只返回其中一个。您需要为此添加一个额外的测试。
答案 1 :(得分:2)
迭代例如列中的列一个for循环,使用list(df)
。无论如何,您可以轻松地使用collections.Counter
假设数据框df
df
# Name ss715583617 ss715592335 ss715591044 ss715598181
#0 PI081762 T A A T
#1 PI101404A T A A T
#2 PI101404B T A A T
#3 PI135624 T A A T
#4 PI326581 T A F D
#5 PI326582A G A F T
#6 PI326582B G A A T
#7 PI339732 D H A T
#8 PI339735A D A A T
#9 PI339735B A A A T
#10 PI342618A D A A T
我从评论部分和您的原始帖子收集的内容,您希望根据其出现频率替换每列中的每个字符。这是一种方法:
from collections import Counter
cols = [ col for col in list(df) if col not in ['Name'] ] # all the column you want to operate on
col_counters = { col: Counter( df[col] ) for col in cols }
#{'ss715583617': Counter({'T': 5, 'D': 3, 'G': 2, 'A': 1}),
# 'ss715591044': Counter({'A': 9, 'F': 2}),
# 'ss715592335': Counter({'A': 10, 'H': 1}),
# 'ss715598181': Counter({'T': 10, 'D': 1})}
sort_func = lambda items: sorted(items, key=lambda x:x[1], reverse=True ) # sort a nested list according to second element in each sublist
sort_result = { col: sort_func(counter.items()) for col,counter in col_counters.iteritems() }
#{'ss715583617': [('T', 5), ('D', 3), ('G', 2), ('A', 1)],
# 'ss715591044': [('A', 9), ('F', 2)],
# 'ss715592335': [('A', 10), ('H', 1)],
# 'ss715598181': [('T', 10), ('D', 1)]}
在这里,我们将使用enumerate
来获取每个排序结果的位置
mapper = { col: {letter:i+1 for i,letter in enumerate(sort_result[col]) } for col in sort_result }
#{'ss715583617': {'A': 4, 'D': 2, 'G': 3, 'T': 1},
# 'ss715591044': {'A': 1, 'F': 2},
# 'ss715592335': {'A': 1, 'H': 2},
# 'ss715598181': {'D': 2, 'T': 1}}
df.replace( to_replace=mapper, inplace=True)
# Name ss715583617 ss715592335 ss715591044 ss715598181
#0 PI081762 1 1 1 1
#1 PI101404A 1 1 1 1
#2 PI101404B 1 1 1 1
#3 PI135624 1 1 1 1
#4 PI326581 1 1 2 2
#5 PI326582A 3 1 2 1
#6 PI326582B 3 1 1 1
#7 PI339732 2 2 1 1
#8 PI339735A 2 1 1 1
#9 PI339735B 4 1 1 1
#10 PI342618A 2 1 1 1
这应该足以让你继续前进。我不确定你想如何处理重复的元素,例如,如果一个列具有相同的数字,如果T和G.