我有以下格式的文件:
10000
2
2
2
2
0.00
0.00
0 1
0.00
0.01
0 1
...
我想从这个文件创建一个数据框(跳过前5行),如下所示:
x1 x2 y1 y2
0.00 0.00 0 1
0.00 0.01 0 1
因此线条被转换为列(其中每条第三条线也被分成两列,y1和y2)。
在R中我做了如下:
df = as.data.frame(scan(".../test.txt", what=list(x1=0, x2=0, y1=0, y2=0), skip=5))
我正在为此扫描(文件,what = list(...))函数寻找 python替代(pandas?)。 它是否存在或者我是否必须编写更加扩展的脚本?
答案 0 :(得分:3)
你可以跳过前5个,然后拿4个小组来构建一个Python列表,然后把它放在熊猫中作为开始......如果熊猫提供了更好的东西,我不会感到惊讶:
from itertools import islice, izip_longest
with open('input') as fin:
# Skip header(s) at start
after5 = islice(fin, 5, None)
# Take remaining data and group it into groups of 4 lines each... The
# first 2 are float data, the 3rd is two integers together, and the 4th
# is the blank line between groups... We use izip_longest to ensure we
# always have 4 items (padded with None if needs be)...
for lines in izip_longest(*[iter(after5)] * 4):
# Convert first two lines to float, and take 3rd line, split it and
# convert to integers
print map(float, lines[:2]) + map(int, lines[2].split())
#[0.0, 0.0, 0, 1]
#[0.0, 0.01, 0, 1]
答案 1 :(得分:0)
据我所知,我无法在http://pandas.pydata.org/pandas-docs/stable/io.html看到任何选项来组织您想要的DataFrame;
但你可以很容易地实现它:
lines = open('YourDataFile.txt').read() # read the whole file
import re # import re
elems = re.split('\n| ', lines)[5:] # split each element and exclude the first 5
grouped = zip(*[iter(elems)]*4) # group them 4 by 4
import pandas as pd # import pandas
df = pd.DataFrame(grouped) # construct DataFrame
df.columns = ['x1', 'x2', 'y1', 'y2'] # columns names
它不简洁,不优雅,但很清楚它的作用......
答案 2 :(得分:0)
好的,这就是我的表现(实际上是Jon和Giupo答案的组合,tnx家伙!):
with open('myfile.txt') as file:
data = file.read().split()[5:]
grouped = zip(*[iter(data)]*4)
import pandas as pd
df = pd.DataFrame(grouped)
df.columns = ['x1', 'x2', 'y1', 'y2']