我希望在这里得到一些补充。我的问题是外部程序的输出文件会生成带有相当复杂的标头的文本文件。 该标题看起来像这样(有一些示例行):
* NAME KEYWORD S BETX ALFX MUX BETY ALFY MUY X PX Y PY T PT DX DPX DY DPY L LRAD ANGLE K1L K1S K2L K2S TILT E1 E2 FINT FINTX APERTYPE APER_1 APER_2 APER_3 APER_4 COMMENTS KSI HKICK VKICK VOLT LAG FREQ HARMON RE11 RE12 RE13 RE14 RE15 RE16 RE21 RE22 RE23 RE24 RE25 RE26 RE31 RE32 RE33 RE34 RE35 RE36 RE41 RE42 RE43 RE44 RE45 RE46 RE51 RE52 RE53 RE54 RE55 RE56 RE61 RE62 RE63 RE64 RE65 RE66
$ %s %s %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %s %le %le %le %le %s %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le
"L000013$START" "MARKER" 0 0.99997544084968948 -2.4868542792772026e-05 0 0.0016028062114609705 -0.0015362599226402803 0 -3.9960208164599792e-12 -2.3117945993838543e-07 -3.1289252451959499e-21 -3.5787461173940813e-18 0 0 -8.0262719745944669e-08 9.9999822209857522e-07 -2.5072600388022476e-21 1.5546880440816971e-19 0 0 0 0 0 0 0 0 0 0 0 0 "CIRCLE" 0 0 0 0 "" 0 0 0 0 0 0 0 0.78625931963285645 0.61786193233791331 8.4376949871511897e-15 -5.7137454489986084e-17 0 -6.3501624209261554e-07 -0.61789228216621062 0.78629005103940219 1.5432100042289676e-14 -2.9815559743351372e-17 0 1.6411585361703815e-07 5.0306980803327406e-17 -7.3725747729014302e-18 0.42113408349746351 0.0014527043440388243 0 6.7762635780344027e-21 -2.5909612996755094e-14 1.6239180139487885e-14 -565.47866828536485 0.42391886365021597 0 2.3716922523120409e-19 2.6333401523484748e-07 -6.0070789235093852e-07 6.9388939039072284e-18 0 1 -0.7100070931506306 0 0 0 0 0 1
"IP.1" "MARKER" 0 0.99997544084968948 -2.4868542792772026e-05 0 0.0016028062114609705 -0.0015362599226402803 0 -3.9960208164599792e-12 -2.3117945993838543e-07 -3.1289252451959499e-21 -3.5787461173940813e-18 0 0 -8.0262719745944669e-08 9.9999822209857522e-07 -2.5072600388022476e-21 1.5546880440816971e-19 0 0 0 0 0 0 0 0 0 0 0 0 "CIRCLE" 0.014999999999999999 0 0 0 "" 0 0 0 0 0 0 0 0.78625931963285645 0.61786193233791331 8.4376949871511897e-15 -5.7137454489986084e-17 0 -6.3501624209261554e-07 -0.61789228216621062 0.78629005103940219 1.5432100042289676e-14 -2.9815559743351372e-17 0 1.6411585361703815e-07 5.0306980803327406e-17 -7.3725747729014302e-18 0.42113408349746351 0.0014527043440388243 0 6.7762635780344027e-21 -2.5909612996755094e-14 1.6239180139487885e-14 -565.47866828536485 0.42391886365021597 0 2.3716922523120409e-19 2.6333401523484748e-07 -6.0070789235093852e-07 6.9388939039072284e-18 0 1 -0.7100070931506306 0 0 0 0 0 1
"SOL1R1.1" "SOLENOID" 1.0001125105478401 2.0002711841856438 -1.0001574437785969 0.1250089995511387 624.05172971938396 -623.97782494270291 0.24950043267581581 -2.5820171273456538e-07 -2.7166326049300749e-07 -2.4644694700727079e-05 -2.4641900061065806e-05 -6.0732928168700292e-10 0 1.2320286106432364e-06 1.080963267922299e-06 4.9287833861482926e-05 2.4640168503266348e-05 1.0001125105478401 0 0 0 0 0 0 0 0 0 0 0 "CIRCLE" 0.014999999999999999 0 0 0 "" 0.0032857664417938401 0 0 0 0 0 0 0.16677454852979498 1.2374450524095637 -0.92870122060954885 0.9271984583845112 0 2.2616270165159844e-05 -0.61941520497814673 1.4057728585838938 -0.92799611647829161 0.92750603109268515 0 2.3209525218003638e-05 -0.92870122060959848 0.92719845838457737 -565.11861313300869 565.60812571287204 0 0.013966222544609984 -0.92799611647833302 0.9275060310927431 -565.4756174247633 565.96366804371507 0 0.013950409023346324 -2.2718014995738109e-05 2.2030728038876717e-05 -0.01392194968977813 0.013909296330422054 1 -0.7100067496917396 0 0 0 0 0 1
可以看出,第一个字符是一个星号,后跟实际的第一列名称。此外,并非所有列都由相同数量的空格分隔。我通过指定分隔符为read_table()
使用大熊猫r'\s+
(宁愿保留它而不是正则表达式,因为它会退回到python
中的pandas
引擎):>
df = read_table( filename, sep = r'\s+', index_col = False )
但是,结果是,DataFrame
的列不匹配,因为分隔符还会计算* NAME
中的第一个空格。通过使用提供的代码段作为MWE可以看到它。这会导致很多问题(例如,为每列指定dtypes)。因此,最后一列用NaN
填充,以使较长的标题与列匹配。
编辑1 使用以下内容可以解决问题,但需要一些体操:
df = read_table( filename, sep=r'\s+', skiprows=[1] )
names = df.columns[1:]
df.drop( df.columns[len(df.columns)-1], axis = 1, inplace = True )
df.columns = names
首先,提取列名称,剥离*
,用NaN
删除最后一列,然后替换名称。
另一种方法是手动删除文件中的星号,然后将其读取到DataFrame
中。可行,但由于多种原因,它不是最佳解决方案。
也许有人可以看看上面的方法,并告诉我这是否是一个好的解决方案,或者是否有更好的方法来实现。
谢谢!
答案 0 :(得分:0)
您可以尝试:
df= pd.read_csv(filename,sep=r"\s{2,}",engine="python",skiprows=[1])
并修改列名称:
df.columns=[ colname.replace("*","").strip() for colname in df.columns]
编辑: 您也可以使用“重命名”:
df= df.rename(columns=lambda cname: cname.replace("*","").strip())
或
df= df.rename(columns={"* Name":"Name"})
答案 1 :(得分:0)
因此,如果我对您的理解正确,如果文本的第一行(假设为.txt
文件)中有一个星号,是否需要将其删除?
在这种情况下,您可以通过创建新的清理文件来“清理”文件,如下所示。如果将其包装到函数中,则可以对程序输出的所有文件执行此操作
import re
with open('test_text.txt', 'r') as input_file: # Assuming that this is the name of your text file
i = 1
with open('clean_test.txt', 'w') as output: #clean_tst.txt will be the name of the output, choose any name you like here
for line in input_file:
if i == 1: # This if is a bit ugly, but makes sure you're only removing the asterisk from the first row
line = re.sub('^\*\s+', '', line)
i += 1
output.write(line)
-编辑,以就地编辑文件,以下功能可以解决问题:
import re
import pandas as pd
import io
def clean_file(input_file):
output_rep = io.StringIO()
with open(input_file, 'r') as input_file:
i = 1
for line in input_file:
if i == 1: # This if is a bit ugly, but makes sure you're only removing the asterisk from the first row, NOT any other ones
line = re.sub('^\*\s+', '', line)
i += 1
output_rep.write(line)
output_rep.seek(0)
return output_rep
input_file = 'test_text.txt'
test = clean_file(input_file)
df = pd.read_table(test, sep = r'\s+', index_col = False)