我有很多文件可以阅读单个pandas数据框。示例文件可能如下所示:
variable_1_name
variable_2_name
...
variable_n_name
0.0 0.5 0.3 ... 0.8
...
1.0 4.5 6.5 ... 1.0
因此,该文件在文件顶部有一个变量名称列表(每行一个),然后数据以空格分隔表的形式显示,每行n
个值。
有几个问题:
1)每个文件中有不同数量的变量。并非所有变量都存在于每个文件中。
2)文件之间的变量可能有不同的顺序。
如何将所有这些数据读入panadas数据框,同时在文件之间匹配正确的数据?
答案 0 :(得分:2)
扩展Pal的答案:最好的方法是从csv文件中读取数据。那么为什么不将文件转换为csv文件(甚至更好,生活在内存中的csv文件类对象)并让pandas
做脏工作呢?
try:
import io # python3
except ImportError:
import cStringIO as io # python2
import pandas as pd
DELIMITER = ','
def pd_read_chunk(file):
"""
Reads file contents, converts it to a csv file in memory
and imports a dataframe from it.
"""
with open(file) as f:
content = [line.strip() for line in f.readlines()]
cols = [line for line in content if ' ' not in line]
vals = [line for line in content if ' ' in line]
csv_header = DELIMITER.join(cols)
csv_body = '\n'.join(DELIMITER.join(line.split()) for line in vals)
stream = io.StringIO(csv_header + '\n' + csv_body)
return pd.read_csv(stream, sep=DELIMITER)
if __name__ == '__main__':
files = ('file1', 'file2', )
# read dataframe from each file and concat all resulting dataframes
df_chunks = [pd_read_chunk(file) for file in files]
df = pd.concat(df_chunks)
print(df)
如果您尝试从Thom Ives'回答的示例文件,脚本将返回
A B C D E
0 1.0 2.0 3.0 NaN NaN
1 1.1 2.1 3.1 NaN NaN
0 NaN 2.2 NaN 4.2 5.2
1 NaN 2.3 NaN 4.3 5.3
编辑:实际上,我们不需要逗号分隔符 - 我们可以重复使用空格作为分隔符,这样我们就可以同时压缩和加速转换。以下是上述版本的更新版本,代码更少,运行速度更快:
try:
import io # python3
except ImportError:
import cStringIO as io # python2
import pandas as pd
def pd_read_chunk(file):
"""
Reads file contents, converts it to a csv file in memory
and imports a dataframe from it.
"""
with open(file) as f:
content = [line.strip() for line in f.readlines()]
cols = [line for line in content if ' ' not in line]
vals = [line for line in content if ' ' in line]
csv_header = ' '.join(cols)
csv_lines = [csv_header] + vals
stream = io.StringIO('\n'.join(csv_lines))
return pd.read_csv(stream, sep=' ')
if __name__ == '__main__':
files = ('file1', 'file2', )
# read dataframe from each file and concat all resulting dataframes
df_chunks = [pd_read_chunk(file) for file in files]
df = pd.concat(df_chunks)
print(df)
答案 1 :(得分:1)
简单的解决方案是以下列方式编辑文本文件并使用read_csv
variable_1_name, variable_2_name, ..., variable_n_name
0.0 0.5 0.3 ... 0.8
...
1.0 4.5 6.5 ... 1.0
df = pd.read_csv('filename')
答案 2 :(得分:1)
假设Pal在他的好建议中所说的并不容易,请说你有两个简化的数据文件:
DATA1.TXT
A
B
C
1.0 2.0 3.0
1.1 2.1 3.1
和data2.txt
B
D
E
2.2 4.2 5.2
2.3 4.3 5.3
使用类似以下两个函数:1)获取所需的文件,2)将它们置于pandas dataFrames中:
import pandas as pd
import os
def Get_Filtered_File_List(topDirectory, checkString = None):
fileList = []
fileNamesList = os.listdir(topDirectory)
for fileName in fileNamesList:
if checkString == None or checkString in fileName:
fileList.append(fileName)
return fileList
def Load_And_Condition_Files_Into_DF(fileList):
header = []
arrayOfValues = []
arrayOfDicts = []
for file in fileList:
thisHeader = []
with open(file,'r') as f:
arrayOfLines = f.readlines()
for line in arrayOfLines:
lineArray = line.split()
if len(lineArray) == 1:
thisHeader.append(lineArray[0])
else:
arrayOfDicts.append({})
for i in range(len(lineArray)):
arrayOfDicts[-1][thisHeader[i]] = lineArray[i]
header += thisHeader
# print arrayOfDicts
header = sorted(list(set(header)))
for dict in arrayOfDicts:
arrayOfValues.append([])
for name in header:
try:
val = dict[name]
# print '\t', name, val
arrayOfValues[-1].append(val)
except:
# print '\t', name, None
arrayOfValues[-1].append(None)
table = [header] + arrayOfValues
# print table
return pd.DataFrame(table, columns=table.pop(0))
fileList = Get_Filtered_File_List('./','data')
print Load_And_Condition_Files_Into_DF(fileList)
哪个输出:
A B C D E
0 1.0 2.0 3.0 None None
1 1.1 2.1 3.1 None None
2 None 2.2 None 4.2 5.2
3 None 2.3 None 4.3 5.3