Question

我想使用以下列制作pandas Dataframe。

my_cols = ['chrom', 'len_of_PIs']

以及特定列中的以下值：

chrom = pd.Series(['chr1', 'chr2', 'chr3'])
len_of_PIs = pd.Series([[np.random.randint(15, 59, 86)],
                    [np.random.randint(18, 55, 92)],
                    [np.random.randint(25, 61, 98)]])

我期待输出简单如下：

chrom    len_PIs
chr1     49, 32, 30, 27, 52, 52,.....
chr2     27, 20, 40, 41, 44, 50,.....
chr3     35, 45, 56, 42, 58, 50,.....

其中，len_PIs可以是list或str，因此我可以轻松进行下游分析。但是，当我这样做时，我没有按预期得到数据：

new_df = pd.DataFrame()
new_df['chrom'] = chrom

# this code is giving me an output like
new_df['len_PIs'] = len_of_PIs.astype(str)

      chrom                                            len_PIs
0  chr1  [array([49, 32, 30, 27, 52, 52, 33, 51, 36, 47, 34, ...
1  chr2  [array([27, 20, 40, 41, 44, 50, 40, 34, 36, 33, 23, ...
2  chr3  [array([35, 45, 56, 42, 58, 50, 42, 27, 53, 57, 40, ...

# and each one of these below codes are giving me an output like
new_df['len_PIs'] = len_of_PIs.as_matrix()
new_df.insert(loc=1, value=len_of_PIs.astype(list) , column='len_PIs')
new_df['len_PIs'] = pd.DataFrame(len_of_PIs, columns=['len_PIs'], index=len_of_PIs.index)

      chrom                                            len_PIs
0  chr1  [[49, 32, 30, 27, 52, 52, 33, 51, 36, 47, 34, ...
1  chr2  [[27, 20, 40, 41, 44, 50, 40, 34, 36, 33, 23, ...
2  chr3  [[35, 45, 56, 42, 58, 50, 42, 27, 53, 57, 40, ...

如何更新此方法？如果从column and data prepration开始就有替代和全面的方法，这也很不错。

Answer 1

我不相信你需要len_of_PIs系列中的内部列表。您可能还会发现从字典中实例化pd.DataFrame很方便。以下产生您想要的输出。

将数值数据转换为字符串通常不是很好的做法，除非你绝对必须，所以我把数组数据保存为数字。

import pandas as pd, numpy as np

my_cols = ['chrom', 'len_of_PIs']

chrom = pd.Series(['chr1', 'chr2', 'chr3'])
len_of_PIs = pd.Series([np.random.randint(15, 59, 86),
                        np.random.randint(18, 55, 92),
                        np.random.randint(25, 61, 98)])

df = pd.DataFrame({'chrom': chrom,
                   'len_of_PIs': len_of_PIs},
                  columns=my_cols)

#   chrom                                         len_of_PIs
# 0  chr1  [17, 52, 48, 22, 27, 49, 26, 18, 46, 16, 22, 1...
# 1  chr2  [39, 52, 53, 29, 38, 51, 30, 44, 47, 49, 28, 4...
# 2  chr3  [46, 37, 46, 29, 49, 39, 56, 48, 29, 46, 28, 2...

Answer 2

如果希望string使用列表理解和提取内部列表，则转换为string和最后join：

chrom = pd.Series(['chr1', 'chr2', 'chr3'])

len_of_PIs = pd.Series([[np.random.randint(15, 59, 86)],
                    [np.random.randint(18, 55, 92)],
                    [np.random.randint(25, 61, 98)]])

a = [', '.join(x[0].astype(str)) for x in len_of_PIs]
df1 = pd.DataFrame({'len_PIs':a, 'chrom':chrom})
print (df1)
  chrom                                            len_PIs
0  chr1  57, 32, 44, 29, 38, 40, 19, 34, 24, 38, 42, 46...
1  chr2  19, 32, 36, 21, 44, 33, 53, 36, 21, 18, 43, 30...
2  chr3  27, 58, 60, 39, 54, 53, 32, 43, 33, 36, 60, 39...

对于嵌套列表的列表，请使用列表推导或str[0]：

df1 = pd.DataFrame({'len_PIs':[x[0] for x in len_of_PIs], 'chrom':chrom})
#alternative solution
#df1 = pd.DataFrame({'len_PIs':len_of_PIs.str[0], 'chrom':chrom})
print (df1)
 chrom                                            len_PIs
0  chr1  [18, 42, 34, 31, 57, 49, 56, 28, 56, 40, 19, 5...
1  chr2  [48, 29, 23, 21, 54, 28, 23, 27, 44, 51, 18, 3...
2  chr3  [47, 53, 57, 26, 49, 39, 37, 41, 29, 36, 36, 5...

Answer 3

注意，"49, 32, 30"在Python中不是一个合适的类型。如果它是一个列表/元组，它应该有括号/括号，如import pandas as pd, numpy as np my_cols = ['chrom', 'len_of_PIs'] chrom = pd.Series(['chr1', 'chr2', 'chr3']) len_of_PIs = pd.Series([", ".join(np.random.randint(15, 59, 86).astype(str)), ", ".join(np.random.randint(18, 55, 92).astype(str)), ", ".join(np.random.randint(25, 61, 98).astype(str))]) df = pd.DataFrame({'chrom': chrom, 'len_of_PIs': len_of_PIs}, columns=my_cols) print(df) returns: chrom len_of_PIs 0 chr1 17, 37, 38, 25, 51, 39, 26, 24, 38, 44, 51, 21... 1 chr2 23, 33, 20, 48, 22, 45, 51, 45, 20, 39, 29, 25... 2 chr3 49, 42, 35, 46, 25, 52, 57, 39, 26, 29, 58, 26...;如果它是一个字符串，它应该有len_of_PIs之类的引号。然而，后者可以不带引号打印，并准确地给你你想要的。但是以后很难再合作了。 jpp代码的以下修改将给你一个看起来与你想要的结果完全一致的结果;但鉴于你将使用这个DataFrame，你应该坚持他的答案。

[float(e) for e in df.len_of_PIs[0].split(", ")]

使用该结果的难度如下。以EFD_SEMAPHORE列的第一行为例。它必须先处理才能用作数字集合：

EFD_SEMAPHORE

这是一种痛苦。所以，是的，你去吧。

如何将多个pandas系列合并到一个数据框，其中系列包含值列表

3 个答案: