我有一个文本文件,其中有几行,它们之间有一些我需要转换为数据框的数据(有用的数据)。
我逐行迭代了文本文件,并在正则表达式的帮助下捕获了有用的数据。
像这样,
pattern = r'^(\s)(\d+)(\s+)(\d)(\s+)(\w+)(\s+)(\w+)(.*)'
capture_data = []
with open(file,'r') as file_obj:
lineList = file_obj.readlines()
for line in lineList:
info_list = re.search(pattern, line)
if info_list is not None:
capture_data.append(line)
捕获的数据如下
' 100 0 PASS Continuity_PPMU_mV XSCI 140 -1.0000 V -427.9508 mV -300.0000 mV -100.0000 uA 0 \n'
' 100 1 PASS Continuity_PPMU_mV XSCI 12 -1.0000 V -430.3089 mV -300.0000 mV -100.0000 uA 0 \n'
我想迭代每个捕获的行并在空格的基础上拆分,但问题是,单位和值之间有空格,例如....
-300.0000 mV、-100.0000 uA 等
还有一个问题是尾随换行符,它也被视为.split(" ")中的一个新元素。
有人可以帮忙找到一些更聪明的方法来做到这一点吗?
我想要的只是将这些值作为单独的列值。
例如在第一个字符串中,
100 变为第 1 列,0 - 2nd,PASS - 3rd,Continuity_PPMU_mV - 4th,等等...
谢谢。
编辑:
原始数据有点像这样 -
Site Number:
0, 1, 2, 3
Device#: 1-4
*********************************************************************
FT45434HAP PQF64 Test @ RHC
*********************************************************************
---------------------------Continuity Test---------------------------
Number Site Result Test Name Pin Channel Low Measured High Force Loc
100 0 PASS Continuity_PPMU_mV XSCI 140 -1.0000 V -427.9508 mV -300.0000 mV -100.0000 uA 0
100 1 PASS Continuity_PPMU_mV XSCI 12 -1.0000 V -430.3089 mV -300.0000 mV -100.0000 uA 0
100 2 PASS Continuity_PPMU_mV XSCI 76 -1.0000 V -430.7492 mV -300.0000 mV -100.0000 uA 0
100 3 PASS Continuity_PPMU_mV XSCI 204 -1.0000 V -431.0482 mV -300.0000 mV -100.0000 uA 0
101 0 PASS Continuity_PPMU_mV XSCO 139 -1.0000 V -456.0359 mV -300.0000 mV -100.0000 uA 0
101 1 PASS Continuity_PPMU_mV XSCO 11 -1.0000 V -458.0605 mV -300.0000 mV -100.0000 uA 0
101 2 PASS Continuity_PPMU_mV XSCO 75 -1.0000 V -457.8564 mV -300.0000 mV -100.0000 uA 0
编辑:
顶行不是固定的,它们是动态生成的。此外,一些其他文本数据可以出现在相关数据之间,例如两个有用的行之间。所以,我不认为在这里跳过行会起作用。
答案 0 :(得分:1)
'Number'
开头的行,然后将其后的行附加到 data
。import pandas as pd
import seaborn as sns
# read the file in
data = list()
with open('test.txt', 'r') as f:
rows = f.readlines()
flag = False # flag to True once the header row with Number is found
for row in rows:
row = row.strip()
if row.startswith('Number'):
flag = True
continue # after the header row is found, skip it
if flag:
data.append(row.split()) # append rows after the header to data
# create a custom header where the unites have been added as column headers
header = ['Number', 'Site', 'Result', 'Test_Name', 'Pin', 'Channel', 'Low', 'U1', 'Measured', 'U2', 'High', 'U3', 'Force', 'U4', 'Loc']
# create the dataframe
df = pd.DataFrame(data, columns=header)
# save to csv
df.to_csv('file.csv', index=False)
# convert columns to numeric dtypes
df = df.apply(pd.to_numeric, errors='ignore')
# scale the columns as per their units
df.Measured = df.Measured.div(1000)
df.High = df.High.div(1000)
df.Force = df.Force.div(100000)
# display(df)
Number Site Result Test_Name Pin Channel Low U1 Measured U2 High U3 Force U4 Loc
0 100 0 PASS Continuity_PPMU_mV XSCI 140 -1.0 V -0.427951 mV -0.3 mV -0.001 uA 0
1 100 1 PASS Continuity_PPMU_mV XSCI 12 -1.0 V -0.430309 mV -0.3 mV -0.001 uA 0
2 100 2 PASS Continuity_PPMU_mV XSCI 76 -1.0 V -0.430749 mV -0.3 mV -0.001 uA 0
3 100 3 PASS Continuity_PPMU_mV XSCI 204 -1.0 V -0.431048 mV -0.3 mV -0.001 uA 0
4 101 0 PASS Continuity_PPMU_mV XSCO 139 -1.0 V -0.456036 mV -0.3 mV -0.001 uA 0
5 101 1 PASS Continuity_PPMU_mV XSCO 11 -1.0 V -0.458060 mV -0.3 mV -0.001 uA 0
6 101 2 PASS Continuity_PPMU_mV XSCO 75 -1.0 V -0.457856 mV -0.3 mV -0.001 uA 0
# plot
ax = sns.lineplot(data=df.iloc[:, 6:-2])
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
答案 1 :(得分:0)
您可以简单地跳过第一行并将分隔符指定为 \s\s+
:
pd.read_csv('file.txt', skiprows=10, sep='\s\s+', engine='python')
输出:
Number Site Result Test Name Pin Channel Low Measured High Force Loc
0 100 0 PASS Continuity_PPMU_mV XSCI 140 -1.0000 V -427.9508 mV -300.0000 mV -100.0000 uA 0
1 100 1 PASS Continuity_PPMU_mV XSCI 12 -1.0000 V -430.3089 mV -300.0000 mV -100.0000 uA 0
2 100 2 PASS Continuity_PPMU_mV XSCI 76 -1.0000 V -430.7492 mV -300.0000 mV -100.0000 uA 0
3 100 3 PASS Continuity_PPMU_mV XSCI 204 -1.0000 V -431.0482 mV -300.0000 mV -100.0000 uA 0
4 101 0 PASS Continuity_PPMU_mV XSCO 139 -1.0000 V -456.0359 mV -300.0000 mV -100.0000 uA 0
5 101 1 PASS Continuity_PPMU_mV XSCO 11 -1.0000 V -458.0605 mV -300.0000 mV -100.0000 uA 0
6 101 2 PASS Continuity_PPMU_mV XSCO 75 -1.0000 V -457.8564 mV -300.0000 mV -100.0000 uA 0
此外,如果您不确定应该忽略多少起始行,您可能会尝试找到一种模式来忽略第一行。例如,如果您的数据模式是一致的,您可以读取第一行直到匹配第一列(在本例中为“数字”):
# Identify how many rows we need to skip (avoiding reading the whole file)
skiplines=0
with open('file.txt') as file:
line = file.readline()
while not line.lstrip().startswith('Number'):
skiplines += 1
line = file.readline()
# Then read it with pandas
pd.read_csv('file.txt', skiprows=skiplines, sep='\s\s+', engine='python')
无论如何,利用其逻辑,很容易修改上面的代码块以匹配不同的文件模式。例如,输出将始终显示“连续性测试”行?如果数据总是显示在该行之后,这就是您要寻找的模式。