我有一个文本文件,有很多行,每行有 6 列,但每第四列和每第六列后都有一个 \n,例如:
第 1 行 ---> 1 2 3 4\n 5 6\n
第 2 行 ---> 7 8 9 10\n 11 12\n
我正在使用命令从文件创建数据帧:
df = pd.read_csv('info.txt', header=None, delimiter=r"\s+", names = cols, lineterminator='\n')
但是,即使我在 read_csv 的名称属性中明确提供了 6 列的名称,pandas read_csv 也会将上述数据读取为 4 行:
col1 col2 col3 col4 col5 col6
0 1 2 3 4 NaN NaN
1 5 6 NaN NaN NaN NaN
2 7 8 9 10 NaN NaN
3 11 12 NaN NaN NaN NaN
如何读取数据为:
col1 col2 col3 col4 col5 col6
0 1 2 3 4 5 6
1 7 8 9 10 11 12
答案 0 :(得分:0)
从@gold_cy 的答案中汲取灵感,能够通过为每个交替行扩展列表的最后一个元素而不是向列表添加新行来解决问题:
def strip_newlines(fp):
file_data_without_line_breaks = []
i=-1
with open(fp, "r") as fin:
for val, line in enumerate(fin.readlines()):
stripped_line = line.rstrip()
if(val%2 == 1):
file_data_without_line_breaks[i].extend(stripped_line.split())
else:
i=i+1
file_data_without_line_breaks.append(stripped_line.split())
return file_data_without_line_breaks
但这可能不适用于大数据,因为列表对象是在内存中创建的。
答案 1 :(得分:0)
您可以使用自定义阅读逻辑创建类文件对象。类文件对象必须包含 __iter__
和 read
方法。
测试数据:
echo -en '1 2 3 4\n 5 6\n 7 8 9 10\n 11 12\n' > info.txt
class MultiLineReader:
def __init__(self, filename):
self.filename = filename
self.fd = None
# use as context manager in order to open and close file correctly
def __enter__(self):
self.fd = open(self.filename, 'r')
return self
def __exit__(self, type, value, traceback):
self.fd.close()
# file-like object must have this method
def __iter__(self):
while True:
line = self.readline()
if not line:
break
yield line
# file-like object must have this method
# just read a line
def read(self, size=-1):
return self.readline()
# read two lines at a time
def readline(self):
return self.fd.readline().strip() + self.fd.readline()
# example usage
with MultiLineReader("info.txt") as f:
pd.read_csv(f, sep=r'\s+', header=None)