在行的选项卡上添加唯一标签

时间:2019-07-01 21:10:44

标签: python pandas csv dataframe

我有一个文本文件,其中某些行在一个选项卡上移动,表示它们是主行的子类别。我需要用唯一的标签标记每一行,以表示它们属于哪个“组”或类别,以便对每个类别进行统计分析。

我不太确定如何在Python中自动执行此操作,因为文件中有数千行。我无法将文件转换为csv,而不先移开所有缩进的行,这显然会失去分辨每行所属类别的能力。

编辑:
Here is a screenshot of a sample text file that shows what I am talking about visually.

前3行将具有组标签“ 1”,后2行将具有组标签“ 2”,后2行将分别具有组标签“ 3”和“ 4”。

3 个答案:

答案 0 :(得分:1)

IIUC,这是使用Pandas和itertools.chain的相对简单的方法:

import pandas as pd
import numpy as np
import io, re, itertools

data = '''
ratio,  weight, flags,  initialTime,    finalTime
15,     0.44,   3,      300500,         400000
        42,     .31,    2,      900432,         100304
        41,     .2,     7,      192834,         200048
17,     0.234,  2,      2019481,        2182350
        8,      .1,     9,      1092834,        1190924
8,      0.241,  2,      2115120,        2159009
10,     0.092,  9,      2011234,        5001234
'''

data = re.sub(r'\t+', '*', data)
df = pd.read_csv(io.StringIO(data))

df.columns = [i.strip('*') for i in df.columns]
df.loc[df['ratio'].str.contains(r'\*'), 'ratio'] = np.nan
df['ratio'] = df['ratio'].ffill()
group_numbers = itertools.chain.from_iterable([[i+1]*v.shape[0] for i, (name,v) in enumerate(df.groupby('ratio', sort=False))])
df['ratio'] = list(group_numbers)
df = df.replace(r'\*', '', regex=True).set_index('ratio')

收益:

      weight  flags initialTime finalTime
ratio                                    
1       0.44      3      300500    400000
1        .31      2      900432    100304
1         .2      7      192834    200048
2      0.234      2     2019481   2182350
2         .1      9     1092834   1190924
3      0.241      2     2115120   2159009
4      0.092      9     2011234   5001234

答案 1 :(得分:0)

这不是最好的方法,但是应该可以。首先,您需要从文件接收所有数据。让我们只打开文件并逐行读取所有数据到名为data的字符串:

with open("your_text_file.txt", "r") as f:
    data = f.read().split('\n')

为了模拟您的文件,我假装所有数据已存储在data变量中:

data = """ratio, weight, flags, initialTime, finalTime
15,    0.44,    3,    300500,    400000
    42,    0.31,   2,   900432,   100304
22,    1.31,   3,   200432,   100304
    52,    0.11,   4,   922432,   111304 """
group = 0
data = data.split('\n')
#insert name group in firts line of your data
data[0] = f"group, {data[0]}"
for count,items in enumerate(data[1:]):
    #if line do not start with tab increment group number by 1
    if not items.startswith("    "):
        group += 1
    #insert first raw as group
    items = f"{group}, {items}"
    #deleta all ugly tabs
    items = ' '.join(items.split())
    #rewrite dataline with new beatiful string
    data[count+1] = items
#now create csv file with beautiful data
with open("new_file.csv", "w") as f:
    for items in data:
        f.write(f"{items}\n")

所以,我的第一个数据如下:

ratio, weight, flags, initialTime, finalTime
15,    0.44,    3,    300500,    400000
    42,    0.31,   2,   900432,   100304
22,    1.31,   3,   200432,   100304
    52,    0.11,   4,   922432,   111304 

,然后运行如下脚本:

group, ratio, weight, flags, initialTime, finalTime
1, 15, 0.44, 3, 300500, 400000
1, 42, 0.31, 2, 900432, 100304
2, 22, 1.31, 3, 200432, 100304
2, 52, 0.11, 4, 922432, 111304

希望,这就是您所需要的

答案 2 :(得分:0)

如果要构建数据框,可以使用numpy genfromtext和生成器来添加新字段。

def add_group(fd):
    """generator that prepends each line with a group field (returns byte strings)"""
    digit = re.compile(r'^\d')
    line = next(fd)                     # process header line
    yield('group,'+line).encode()
    group=0
    for line in fd:
        if digit.match(line):           # increment group when first char is a digit
            group += 1
        yield "{},{}".format(group, line).encode()

df = pd.DataFrame(np.genfromtxt(add_group(open('file.txt')), delimiter=',', names=True,
                  autostrip=True, dtype=None)

如果您只想构建一个csv,则更加简单:

with open('file.txt') as fd, open('file.csv', 'w') as fdout):
    digit = re.compile(r'^\d')
    fdout.write('group,'+next(fd))     # process header line
    group = 0
    for line in fd:
        if digit.match(line):           # increment group when first char is a digit
            group += 1
        fdout.write("{},{}".format(group, line).encode())