Question

我正在处理521个文本文件，这些文件具有基因名称和表达式值。由于基因名称是通用的，因此仅表达值有所不同。我尝试使用相同的数据帧，但无法执行。我可以使用什么方法？

我尝试使用熊猫和数据框。如何创建一个循环，以便仅在匹配基因名称后附加表达式值？

import os
import pandas as pd
os.chdir('F:\study\TCGA\COAD\pcfiles\info')
f1=open('file1.txt').read().split('\n')
f2=open('file2FPKM.txt').read().split('\n')
df=pd.DataFrame(f1,f2)
print(df)

输出应如下：

Gene name     p1  ....................pn
gene1          x                      xn
gene2          x                      xn
gene3          x                      xn
.
.
.
.
.
gene19250      n                      xn

p是患者的名字。

x代表数字。

我希望所有这些都写在一个文本文件中。

Answer 1

这可以通过以下方式实现：

import glob
import os
import pandas as pd

path = 'C:\\tmmp' #Directory where all .txt files stored
mastertext = open("master.txt", "a+") #Open master.txt in append mode and create one if doesn't exist in current directory
mastertext.write("header1    header2\n") #Insert the header first, assume 4 spaces between headers
for filename in glob.glob(os.path.join(path, '*.txt')):
    f1=open(filename).readlines() #Open the file and read content into a list line by line
    mastertext.write(f1[1] + "\n") #Write 2nd line into master.txt (f1[1]
mastertext.close() #Close master.txt

例如，我有4个文本文件（假设每个行元素之间有4个空格），如下所示：

1st.txt：

header1     header2
1stdata1    1stdata2

2nd.txt：

header1     header2
2nddata1    2nddata2

3rd.txt：

header1     header2
3rddata1    3rddata2

4th.txt：

header1     header2
4thdata1    4thdata2

运行上面的代码时，它将生成 master.txt：

header1     header2
1stdata1    1stdata2
2nddata1    2nddata2
3rddata1    3rddata2
4thdata1    4thdata2

这可能不是最佳解决方案。通常，如果您使用.csv格式的文件，则处理起来会容易得多，因为pandas具有方法read_csv，该方法具有属性skiprows，您可以在其中设置要跳过的行，示例skiprows=0，它将跳过标题。而且，由于这是.txt文件，因此行中各条目之间的间距可能会遇到一些困难。在上面的示例中，我假设间距为4 space。祝你有美好的一天。

如何为数据框创建循环？

1 个答案: