Question

我有如下文字文件。

A1 1234 56
B2 1234 56
C3 2345167

我有开始位置和长度表。表示每个元素在前一个df中开始的位置，以及每行的长度。

start length
1      1
2      1
3      1
4      2
6      2
8      2
10     1

我想根据开始位置和长度阅读如下。

A 1 nan 12 34  5 6
B 2 nan 12 34  5 6
C 3 nan 23 45 16 7

首先，我试过

pd.read_csv(file.txt,sep=" ")

但我无法弄清楚如何拆分。

如何阅读和拆分数据框？

Answer 1

正如评论中所提到的，这不是CSV格式，因此我不得不进行解决方案。

def get_row_format(length_file):

    with open(length_file, 'r') as fd_len:

        #Read in the file, not a CSV!
        #this double list-comprehension produces a list of lists
        rows = [[x.strip() for x in y.split()] for y in fd_len.readlines()]

        #determine the row-format from the rows lists
        row_form = {int(x[0]): int(x[1]) for x in rows[1:]} #idx 1: to skip header

    return row_form

def read_with_row_format(data_file, rform):

    with open(data_file, 'r') as fd_data:

        for row in fd_data.readlines():

            #Get the formatted output
            #use .items() for Python 3.x
            formatted_output = [row[k-1:k+v-1] for k, v in rform.iteritems()]
            print formatted_output

第一个函数获取'row-format'，第二个函数将该行格式应用于文件中的每一行

用法：

rform = get_row_format('lengths.csv')
read_with_row_format('data.csv', rform)

输出：

['A', '1', '12', '34', '5', '6']
['B', '2', '12', '34', '5', '6']
['C', '3', '23', '45', '6', '7']

Answer 2

由于您拥有每个字段的起始位置和长度，因此请使用它们。这是执行此操作的代码。每条线轮流拍摄。每个字段都是从起始列到相同位置的切片加上字段的长度。

我将转换留给您。

data = [
    "A1 1234 56",
    "B2 1234 56",
    "C3 2345167"
]

table = [
    [1, 1],
    [2, 1],
    [3, 1],
    [4, 2],
    [6, 2],
    [8, 2],
    [10, 1]
]

for line in data:
    fields = [line[(table[col][0]-1) : (table[col][0]+table[col][1]-1)] for col in range(len(table))]
    print fields

Answer 3

这是一个固定宽度的文件，您可以使用pandas.read_fwf：

import pandas as pd
from io import StringIO

s = StringIO("""A1 1234 56
B2 1234 56
C3 2345167""")

pd.read_fwf(s, widths = widths.length, header=None)

#   0   1   2   3   4   5   6
#0  A   1   NaN 12  34  5   6
#1  B   2   NaN 12  34  5   6
#2  C   3   NaN 23  45  16  7

widths数据框：

widths = pd.read_csv(StringIO("""start length
1      1
2      1
3      1
4      2
6      2
8      2
10     1"""), sep = "\s+")

如何在特定条件下读取txt

3 个答案: