Question

我有以下格式的文件：

我想从这个文件创建一个数据框（跳过前5行），如下所示：

x1   x2    y1  y2
0.00 0.00  0   1
0.00 0.01  0   1

因此线条被转换为列（其中每条第三条线也被分成两列，y1和y2）。

在R中我做了如下：

df = as.data.frame(scan(".../test.txt", what=list(x1=0, x2=0, y1=0, y2=0), skip=5))

我正在为此扫描（文件，what = list（...））函数寻找 python替代（pandas？）。它是否存在或者我是否必须编写更加扩展的脚本？

Answer 1

你可以跳过前5个，然后拿4个小组来构建一个Python列表，然后把它放在熊猫中作为开始......如果熊猫提供了更好的东西，我不会感到惊讶：

from itertools import islice, izip_longest

with open('input') as fin:
    # Skip header(s) at start
    after5 = islice(fin, 5, None)
    # Take remaining data and group it into groups of 4 lines each... The
    # first 2 are float data, the 3rd is two integers together, and the 4th
    # is the blank line between groups... We use izip_longest to ensure we
    # always have 4 items (padded with None if needs be)...
    for lines in izip_longest(*[iter(after5)] * 4):
            # Convert first two lines to float, and take 3rd line, split it and
            # convert to integers
        print map(float, lines[:2]) + map(int, lines[2].split())

#[0.0, 0.0, 0, 1]
#[0.0, 0.01, 0, 1]

Answer 2

据我所知，我无法在http://pandas.pydata.org/pandas-docs/stable/io.html看到任何选项来组织您想要的DataFrame;

但你可以很容易地实现它：

lines = open('YourDataFile.txt').read() # read the whole file
import re                               # import re
elems = re.split('\n| ', lines)[5:]     # split each element and exclude the first 5 
grouped = zip(*[iter(elems)]*4)          # group them 4 by 4
import pandas as pd                     # import pandas
df = pd.DataFrame(grouped)              # construct DataFrame
df.columns = ['x1', 'x2', 'y1', 'y2']   # columns names

它不简洁，不优雅，但很清楚它的作用......

Answer 3

好的，这就是我的表现（实际上是Jon和Giupo答案的组合，tnx家伙！）：

with open('myfile.txt') as file:
    data = file.read().split()[5:]
grouped = zip(*[iter(data)]*4)
import pandas as pd
df = pd.DataFrame(grouped)
df.columns = ['x1', 'x2', 'y1', 'y2']

R中的python替代扫描（'file'，what = list（...））

3 个答案: