熊猫:从具有不同变量排序的多个文件中读取

时间:2017-10-30 21:38:04

标签: python pandas

我有很多文件可以阅读单个pandas数据框。示例文件可能如下所示:

variable_1_name
variable_2_name
...
variable_n_name
0.0  0.5  0.3  ...  0.8
...
1.0  4.5  6.5  ...  1.0

因此,该文件在文件顶部有一个变量名称列表(每行一个),然后数据以空格分隔表的形式显示,每行n个值。

有几个问题:

1)每个文件中有不同数量的变量。并非所有变量都存在于每个文件中。

2)文件之间的变量可能有不同的顺序。

如何将所有这些数据读入panadas数据框,同时在文件之间匹配正确的数据?

3 个答案:

答案 0 :(得分:2)

扩展Pal的答案:最好的方法是从csv文件中读取数据。那么为什么不将文件转换为csv文件(甚至更好,生活在内存中的csv文件类对象)并让pandas做脏工作呢?

try:
    import io  # python3
except ImportError:
    import cStringIO as io  # python2
import pandas as pd

DELIMITER = ','

def pd_read_chunk(file):
    """
    Reads file contents, converts it to a csv file in memory
    and imports a dataframe from it.
    """
    with open(file) as f:
        content = [line.strip() for line in f.readlines()]
        cols = [line for line in content if ' ' not in line]
        vals = [line for line in content if ' ' in line]
        csv_header = DELIMITER.join(cols)
        csv_body = '\n'.join(DELIMITER.join(line.split()) for line in vals)
        stream = io.StringIO(csv_header + '\n' + csv_body)
        return pd.read_csv(stream, sep=DELIMITER)


if __name__ == '__main__':
    files = ('file1', 'file2', )
    # read dataframe from each file and concat all resulting dataframes
    df_chunks = [pd_read_chunk(file) for file in files]
    df = pd.concat(df_chunks)
    print(df)

如果您尝试从Thom Ives'回答的示例文件,脚本将返回

     A    B    C    D    E
0  1.0  2.0  3.0  NaN  NaN
1  1.1  2.1  3.1  NaN  NaN
0  NaN  2.2  NaN  4.2  5.2
1  NaN  2.3  NaN  4.3  5.3

编辑:实际上,我们不需要逗号分隔符 - 我们可以重复使用空格作为分隔符,这样我们就可以同时压缩和加速转换。以下是上述版本的更新版本,代码更少,运行速度更快:

try:
    import io  # python3
except ImportError:
    import cStringIO as io  # python2
import pandas as pd


def pd_read_chunk(file):
    """
    Reads file contents, converts it to a csv file in memory
    and imports a dataframe from it.
    """
    with open(file) as f:
        content = [line.strip() for line in f.readlines()]
        cols = [line for line in content if ' ' not in line]
        vals = [line for line in content if ' ' in line]
        csv_header = ' '.join(cols)
        csv_lines = [csv_header] + vals
        stream = io.StringIO('\n'.join(csv_lines))
        return pd.read_csv(stream, sep=' ')


if __name__ == '__main__':
    files = ('file1', 'file2', )
    # read dataframe from each file and concat all resulting dataframes
    df_chunks = [pd_read_chunk(file) for file in files]
    df = pd.concat(df_chunks)
    print(df)

答案 1 :(得分:1)

简单的解决方案是以下列方式编辑文本文件并使用read_csv

variable_1_name, variable_2_name, ..., variable_n_name
0.0  0.5  0.3  ...  0.8
...
1.0  4.5  6.5  ...  1.0

df = pd.read_csv('filename')

答案 2 :(得分:1)

假设Pal在他的好建议中所说的并不容易,请说你有两个简化的数据文件:

DATA1.TXT

A
B
C
1.0 2.0 3.0
1.1 2.1 3.1

和data2.txt

B
D
E
2.2 4.2 5.2
2.3 4.3 5.3

使用类似以下两个函数:1)获取所需的文件,2)将它们置于pandas dataFrames中:

import pandas as pd
import os

def Get_Filtered_File_List(topDirectory, checkString = None):
    fileList = []

    fileNamesList = os.listdir(topDirectory)
    for fileName in fileNamesList:
        if checkString == None or checkString in fileName:
            fileList.append(fileName)

    return fileList

def Load_And_Condition_Files_Into_DF(fileList):
    header = []
    arrayOfValues = []
    arrayOfDicts = []

    for file in fileList:
        thisHeader = []
        with open(file,'r') as f:
            arrayOfLines = f.readlines()
            for line in arrayOfLines:

                lineArray = line.split()
                if len(lineArray) == 1:
                    thisHeader.append(lineArray[0])
                else:
                    arrayOfDicts.append({})
                    for i in range(len(lineArray)):
                        arrayOfDicts[-1][thisHeader[i]] = lineArray[i]

            header += thisHeader

    # print arrayOfDicts
    header = sorted(list(set(header)))
    for dict in arrayOfDicts:
        arrayOfValues.append([])
        for name in header:
            try:
                val = dict[name]
                # print '\t', name, val
                arrayOfValues[-1].append(val)
            except:
                # print '\t', name, None
                arrayOfValues[-1].append(None)
    table = [header] + arrayOfValues
    # print table
    return pd.DataFrame(table, columns=table.pop(0))


fileList = Get_Filtered_File_List('./','data')
print Load_And_Condition_Files_Into_DF(fileList)

哪个输出:

      A    B     C     D     E
0   1.0  2.0   3.0  None  None
1   1.1  2.1   3.1  None  None
2  None  2.2  None   4.2   5.2
3  None  2.3  None   4.3   5.3