我想使用以下列制作pandas Dataframe
。
my_cols = ['chrom', 'len_of_PIs']
以及特定列中的以下值:
chrom = pd.Series(['chr1', 'chr2', 'chr3'])
len_of_PIs = pd.Series([[np.random.randint(15, 59, 86)],
[np.random.randint(18, 55, 92)],
[np.random.randint(25, 61, 98)]])
我期待输出简单如下:
chrom len_PIs
chr1 49, 32, 30, 27, 52, 52,.....
chr2 27, 20, 40, 41, 44, 50,.....
chr3 35, 45, 56, 42, 58, 50,.....
其中,len_PIs
可以是list
或str
,因此我可以轻松进行下游分析。但是,当我这样做时,我没有按预期得到数据:
new_df = pd.DataFrame()
new_df['chrom'] = chrom
# this code is giving me an output like
new_df['len_PIs'] = len_of_PIs.astype(str)
chrom len_PIs
0 chr1 [array([49, 32, 30, 27, 52, 52, 33, 51, 36, 47, 34, ...
1 chr2 [array([27, 20, 40, 41, 44, 50, 40, 34, 36, 33, 23, ...
2 chr3 [array([35, 45, 56, 42, 58, 50, 42, 27, 53, 57, 40, ...
# and each one of these below codes are giving me an output like
new_df['len_PIs'] = len_of_PIs.as_matrix()
new_df.insert(loc=1, value=len_of_PIs.astype(list) , column='len_PIs')
new_df['len_PIs'] = pd.DataFrame(len_of_PIs, columns=['len_PIs'], index=len_of_PIs.index)
chrom len_PIs
0 chr1 [[49, 32, 30, 27, 52, 52, 33, 51, 36, 47, 34, ...
1 chr2 [[27, 20, 40, 41, 44, 50, 40, 34, 36, 33, 23, ...
2 chr3 [[35, 45, 56, 42, 58, 50, 42, 27, 53, 57, 40, ...
如何更新此方法?如果从column and data prepration
开始就有替代和全面的方法,这也很不错。
答案 0 :(得分:2)
我不相信你需要len_of_PIs
系列中的内部列表。您可能还会发现从字典中实例化pd.DataFrame
很方便。以下产生您想要的输出。
将数值数据转换为字符串通常不是很好的做法,除非你绝对必须,所以我把数组数据保存为数字。
import pandas as pd, numpy as np
my_cols = ['chrom', 'len_of_PIs']
chrom = pd.Series(['chr1', 'chr2', 'chr3'])
len_of_PIs = pd.Series([np.random.randint(15, 59, 86),
np.random.randint(18, 55, 92),
np.random.randint(25, 61, 98)])
df = pd.DataFrame({'chrom': chrom,
'len_of_PIs': len_of_PIs},
columns=my_cols)
# chrom len_of_PIs
# 0 chr1 [17, 52, 48, 22, 27, 49, 26, 18, 46, 16, 22, 1...
# 1 chr2 [39, 52, 53, 29, 38, 51, 30, 44, 47, 49, 28, 4...
# 2 chr3 [46, 37, 46, 29, 49, 39, 56, 48, 29, 46, 28, 2...
答案 1 :(得分:1)
如果希望string
使用列表理解和提取内部列表,则转换为string
和最后join
:
chrom = pd.Series(['chr1', 'chr2', 'chr3'])
len_of_PIs = pd.Series([[np.random.randint(15, 59, 86)],
[np.random.randint(18, 55, 92)],
[np.random.randint(25, 61, 98)]])
a = [', '.join(x[0].astype(str)) for x in len_of_PIs]
df1 = pd.DataFrame({'len_PIs':a, 'chrom':chrom})
print (df1)
chrom len_PIs
0 chr1 57, 32, 44, 29, 38, 40, 19, 34, 24, 38, 42, 46...
1 chr2 19, 32, 36, 21, 44, 33, 53, 36, 21, 18, 43, 30...
2 chr3 27, 58, 60, 39, 54, 53, 32, 43, 33, 36, 60, 39...
对于嵌套列表的列表,请使用列表推导或str[0]
:
df1 = pd.DataFrame({'len_PIs':[x[0] for x in len_of_PIs], 'chrom':chrom})
#alternative solution
#df1 = pd.DataFrame({'len_PIs':len_of_PIs.str[0], 'chrom':chrom})
print (df1)
chrom len_PIs
0 chr1 [18, 42, 34, 31, 57, 49, 56, 28, 56, 40, 19, 5...
1 chr2 [48, 29, 23, 21, 54, 28, 23, 27, 44, 51, 18, 3...
2 chr3 [47, 53, 57, 26, 49, 39, 37, 41, 29, 36, 36, 5...
答案 2 :(得分:1)
注意,"49, 32, 30"
在Python中不是一个合适的类型。如果它是一个列表/元组,它应该有括号/括号,如import pandas as pd, numpy as np
my_cols = ['chrom', 'len_of_PIs']
chrom = pd.Series(['chr1', 'chr2', 'chr3'])
len_of_PIs = pd.Series([", ".join(np.random.randint(15, 59, 86).astype(str)),
", ".join(np.random.randint(18, 55, 92).astype(str)),
", ".join(np.random.randint(25, 61, 98).astype(str))])
df = pd.DataFrame({'chrom': chrom,
'len_of_PIs': len_of_PIs},
columns=my_cols)
print(df) returns:
chrom len_of_PIs
0 chr1 17, 37, 38, 25, 51, 39, 26, 24, 38, 44, 51, 21...
1 chr2 23, 33, 20, 48, 22, 45, 51, 45, 20, 39, 29, 25...
2 chr3 49, 42, 35, 46, 25, 52, 57, 39, 26, 29, 58, 26...
;如果它是一个字符串,它应该有len_of_PIs
之类的引号。然而,后者可以不带引号打印,并准确地给你你想要的。但是以后很难再合作了。 jpp代码的以下修改将给你一个看起来与你想要的结果完全一致的结果;但鉴于你将使用这个DataFrame,你应该坚持他的答案。
[float(e) for e in df.len_of_PIs[0].split(", ")]
使用该结果的难度如下。以EFD_SEMAPHORE
列的第一行为例。它必须先处理才能用作数字集合:
EFD_SEMAPHORE
这是一种痛苦。所以,是的,你去吧。