我有一个文本文件,其中某些行在一个选项卡上移动,表示它们是主行的子类别。我需要用唯一的标签标记每一行,以表示它们属于哪个“组”或类别,以便对每个类别进行统计分析。
我不太确定如何在Python中自动执行此操作,因为文件中有数千行。我无法将文件转换为csv,而不先移开所有缩进的行,这显然会失去分辨每行所属类别的能力。
编辑:
Here is a screenshot of a sample text file that shows what I am talking about visually.
前3行将具有组标签“ 1”,后2行将具有组标签“ 2”,后2行将分别具有组标签“ 3”和“ 4”。
答案 0 :(得分:1)
IIUC,这是使用Pandas和itertools.chain
的相对简单的方法:
import pandas as pd
import numpy as np
import io, re, itertools
data = '''
ratio, weight, flags, initialTime, finalTime
15, 0.44, 3, 300500, 400000
42, .31, 2, 900432, 100304
41, .2, 7, 192834, 200048
17, 0.234, 2, 2019481, 2182350
8, .1, 9, 1092834, 1190924
8, 0.241, 2, 2115120, 2159009
10, 0.092, 9, 2011234, 5001234
'''
data = re.sub(r'\t+', '*', data)
df = pd.read_csv(io.StringIO(data))
df.columns = [i.strip('*') for i in df.columns]
df.loc[df['ratio'].str.contains(r'\*'), 'ratio'] = np.nan
df['ratio'] = df['ratio'].ffill()
group_numbers = itertools.chain.from_iterable([[i+1]*v.shape[0] for i, (name,v) in enumerate(df.groupby('ratio', sort=False))])
df['ratio'] = list(group_numbers)
df = df.replace(r'\*', '', regex=True).set_index('ratio')
收益:
weight flags initialTime finalTime
ratio
1 0.44 3 300500 400000
1 .31 2 900432 100304
1 .2 7 192834 200048
2 0.234 2 2019481 2182350
2 .1 9 1092834 1190924
3 0.241 2 2115120 2159009
4 0.092 9 2011234 5001234
答案 1 :(得分:0)
这不是最好的方法,但是应该可以。首先,您需要从文件接收所有数据。让我们只打开文件并逐行读取所有数据到名为data的字符串:
with open("your_text_file.txt", "r") as f:
data = f.read().split('\n')
为了模拟您的文件,我假装所有数据已存储在data变量中:
data = """ratio, weight, flags, initialTime, finalTime
15, 0.44, 3, 300500, 400000
42, 0.31, 2, 900432, 100304
22, 1.31, 3, 200432, 100304
52, 0.11, 4, 922432, 111304 """
group = 0
data = data.split('\n')
#insert name group in firts line of your data
data[0] = f"group, {data[0]}"
for count,items in enumerate(data[1:]):
#if line do not start with tab increment group number by 1
if not items.startswith(" "):
group += 1
#insert first raw as group
items = f"{group}, {items}"
#deleta all ugly tabs
items = ' '.join(items.split())
#rewrite dataline with new beatiful string
data[count+1] = items
#now create csv file with beautiful data
with open("new_file.csv", "w") as f:
for items in data:
f.write(f"{items}\n")
所以,我的第一个数据如下:
ratio, weight, flags, initialTime, finalTime
15, 0.44, 3, 300500, 400000
42, 0.31, 2, 900432, 100304
22, 1.31, 3, 200432, 100304
52, 0.11, 4, 922432, 111304
,然后运行如下脚本:
group, ratio, weight, flags, initialTime, finalTime
1, 15, 0.44, 3, 300500, 400000
1, 42, 0.31, 2, 900432, 100304
2, 22, 1.31, 3, 200432, 100304
2, 52, 0.11, 4, 922432, 111304
希望,这就是您所需要的
答案 2 :(得分:0)
如果要构建数据框,可以使用numpy genfromtext
和生成器来添加新字段。
def add_group(fd):
"""generator that prepends each line with a group field (returns byte strings)"""
digit = re.compile(r'^\d')
line = next(fd) # process header line
yield('group,'+line).encode()
group=0
for line in fd:
if digit.match(line): # increment group when first char is a digit
group += 1
yield "{},{}".format(group, line).encode()
df = pd.DataFrame(np.genfromtxt(add_group(open('file.txt')), delimiter=',', names=True,
autostrip=True, dtype=None)
如果您只想构建一个csv,则更加简单:
with open('file.txt') as fd, open('file.csv', 'w') as fdout):
digit = re.compile(r'^\d')
fdout.write('group,'+next(fd)) # process header line
group = 0
for line in fd:
if digit.match(line): # increment group when first char is a digit
group += 1
fdout.write("{},{}".format(group, line).encode())