熊猫:推断列名称的智能方法

时间:2019-11-21 14:12:39

标签: python pandas

我希望在这里得到一些补充。我的问题是外部程序的输出文件会生成带有相当复杂的标头的文本文件。 该标题看起来像这样(有一些示例行):

* NAME                 KEYWORD                                    S                    BETX                    ALFX                     MUX                    BETY                    ALFY                     MUY                       X                      PX                       Y                      PY                       T                      PT                      DX                     DPX                      DY                     DPY                       L                    LRAD                   ANGLE                     K1L                     K1S                     K2L                     K2S                    TILT                      E1                      E2                    FINT                   FINTX APERTYPE                              APER_1                  APER_2                  APER_3                  APER_4 COMMENTS                                 KSI                   HKICK                   VKICK                    VOLT                     LAG                    FREQ                  HARMON                    RE11                    RE12                    RE13                    RE14                    RE15                    RE16                    RE21                    RE22                    RE23                    RE24                    RE25                    RE26                    RE31                    RE32                    RE33                    RE34                    RE35                    RE36                    RE41                    RE42                    RE43                    RE44                    RE45                    RE46                    RE51                    RE52                    RE53                    RE54                    RE55                    RE56                    RE61                    RE62                    RE63                    RE64                    RE65                    RE66 
$ %s                   %s                                       %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le %s                                       %le                     %le                     %le                     %le %s                                       %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le 
 "L000013$START"       "MARKER"                                    0     0.99997544084968948 -2.4868542792772026e-05                       0   0.0016028062114609705  -0.0015362599226402803                       0 -3.9960208164599792e-12 -2.3117945993838543e-07 -3.1289252451959499e-21 -3.5787461173940813e-18                       0                       0 -8.0262719745944669e-08  9.9999822209857522e-07 -2.5072600388022476e-21  1.5546880440816971e-19                       0                       0                       0                       0                       0                       0                       0                       0                       0                       0                       0                       0 "CIRCLE"                                    0                       0                       0                       0 ""                                          0                       0                       0                       0                       0                       0                       0     0.78625931963285645     0.61786193233791331  8.4376949871511897e-15 -5.7137454489986084e-17                       0 -6.3501624209261554e-07    -0.61789228216621062     0.78629005103940219  1.5432100042289676e-14 -2.9815559743351372e-17                       0  1.6411585361703815e-07  5.0306980803327406e-17 -7.3725747729014302e-18     0.42113408349746351   0.0014527043440388243                       0  6.7762635780344027e-21 -2.5909612996755094e-14  1.6239180139487885e-14     -565.47866828536485     0.42391886365021597                       0  2.3716922523120409e-19  2.6333401523484748e-07 -6.0070789235093852e-07  6.9388939039072284e-18                       0                       1     -0.7100070931506306                       0                       0                       0                       0                       0                       1
 "IP.1"                "MARKER"                                    0     0.99997544084968948 -2.4868542792772026e-05                       0   0.0016028062114609705  -0.0015362599226402803                       0 -3.9960208164599792e-12 -2.3117945993838543e-07 -3.1289252451959499e-21 -3.5787461173940813e-18                       0                       0 -8.0262719745944669e-08  9.9999822209857522e-07 -2.5072600388022476e-21  1.5546880440816971e-19                       0                       0                       0                       0                       0                       0                       0                       0                       0                       0                       0                       0 "CIRCLE"                 0.014999999999999999                       0                       0                       0 ""                                          0                       0                       0                       0                       0                       0                       0     0.78625931963285645     0.61786193233791331  8.4376949871511897e-15 -5.7137454489986084e-17                       0 -6.3501624209261554e-07    -0.61789228216621062     0.78629005103940219  1.5432100042289676e-14 -2.9815559743351372e-17                       0  1.6411585361703815e-07  5.0306980803327406e-17 -7.3725747729014302e-18     0.42113408349746351   0.0014527043440388243                       0  6.7762635780344027e-21 -2.5909612996755094e-14  1.6239180139487885e-14     -565.47866828536485     0.42391886365021597                       0  2.3716922523120409e-19  2.6333401523484748e-07 -6.0070789235093852e-07  6.9388939039072284e-18                       0                       1     -0.7100070931506306                       0                       0                       0                       0                       0                       1
 "SOL1R1.1"            "SOLENOID"                 1.0001125105478401      2.0002711841856438     -1.0001574437785969      0.1250089995511387      624.05172971938396     -623.97782494270291     0.24950043267581581 -2.5820171273456538e-07 -2.7166326049300749e-07 -2.4644694700727079e-05 -2.4641900061065806e-05 -6.0732928168700292e-10                       0  1.2320286106432364e-06   1.080963267922299e-06  4.9287833861482926e-05  2.4640168503266348e-05      1.0001125105478401                       0                       0                       0                       0                       0                       0                       0                       0                       0                       0                       0 "CIRCLE"                 0.014999999999999999                       0                       0                       0 ""                      0.0032857664417938401                       0                       0                       0                       0                       0                       0     0.16677454852979498      1.2374450524095637    -0.92870122060954885      0.9271984583845112                       0  2.2616270165159844e-05    -0.61941520497814673      1.4057728585838938    -0.92799611647829161     0.92750603109268515                       0  2.3209525218003638e-05    -0.92870122060959848     0.92719845838457737     -565.11861313300869      565.60812571287204                       0    0.013966222544609984    -0.92799611647833302      0.9275060310927431      -565.4756174247633      565.96366804371507                       0    0.013950409023346324 -2.2718014995738109e-05  2.2030728038876717e-05    -0.01392194968977813    0.013909296330422054                       1     -0.7100067496917396                       0                       0                       0                       0                       0                       1

可以看出,第一个字符是一个星号,后跟实际的第一列名称。此外,并非所有列都由相同数量的空格分隔。我通过指定分隔符为read_table()使用大熊猫r'\s+(宁愿保留它而不是正则表达式,因为它会退回到python中的pandas引擎):

df = read_table( filename, sep = r'\s+', index_col = False )

但是,结果是,DataFrame的列不匹配,因为分隔符还会计算* NAME中的第一个空格。通过使用提供的代码段作为MWE可以看到它。这会导致很多问题(例如,为每列指定dtypes)。因此,最后一列用NaN填充,以使较长的标题与列匹配。

编辑1 使用以下内容可以解决问题,但需要一些体操:

df = read_table( filename, sep=r'\s+', skiprows=[1] )
names = df.columns[1:]
df.drop( df.columns[len(df.columns)-1], axis = 1, inplace = True )
df.columns = names

首先,提取列名称,剥离*,用NaN删除最后一列,然后替换名称。

另一种方法是手动删除文件中的星号,然后将其读取到DataFrame中。可行,但由于多种原因,它不是最佳解决方案。

也许有人可以看看上面的方法,并告诉我这是否是一个好的解决方案,或者是否有更好的方法来实现。

谢谢!

2 个答案:

答案 0 :(得分:0)

您可以尝试:

df= pd.read_csv(filename,sep=r"\s{2,}",engine="python",skiprows=[1])

并修改列名称:

df.columns=[ colname.replace("*","").strip() for colname in df.columns]

编辑: 您也可以使用“重命名”:

df= df.rename(columns=lambda cname: cname.replace("*","").strip())

df= df.rename(columns={"* Name":"Name"})

答案 1 :(得分:0)

因此,如果我对您的理解正确,如果文本的第一行(假设为.txt文件)中有一个星号,是否需要将其删除?

在这种情况下,您可以通过创建新的清理文件来“清理”文件,如下所示。如果将其包装到函数中,则可以对程序输出的所有文件执行此操作

import re

with open('test_text.txt', 'r') as input_file: # Assuming that this is the name of your text file
    i = 1
    with open('clean_test.txt', 'w') as output: #clean_tst.txt will be the name of the output, choose any name you like here
        for line in input_file:
            if i == 1: # This if is a bit ugly, but makes sure you're only removing the asterisk from the first row
                line  = re.sub('^\*\s+', '', line)
                i += 1
            output.write(line)

-编辑,以就地编辑文件,以下功能可以解决问题:

import re
import pandas as pd
import io

def clean_file(input_file):
    output_rep = io.StringIO()
    with open(input_file, 'r') as input_file: 
        i = 1
        for line in input_file:
            if i == 1: # This if is a bit ugly, but makes sure you're only removing the asterisk from the first row, NOT any other ones
                line  = re.sub('^\*\s+', '', line)
                i += 1
            output_rep.write(line)
    output_rep.seek(0)
    return output_rep

input_file = 'test_text.txt'
test = clean_file(input_file)

df = pd.read_table(test, sep = r'\s+', index_col = False)