将文本数据转换为数据框的问题

时间:2021-02-03 05:50:28

标签: python regex pandas

我有一个文本文件,其中有几行,它们之间有一些我需要转换为数据框的数据(有用的数据)。

我逐行迭代了文本文件,并在正则表达式的帮助下捕获了有用的数据。

像这样,

pattern = r'^(\s)(\d+)(\s+)(\d)(\s+)(\w+)(\s+)(\w+)(.*)'
capture_data = []

with open(file,'r') as file_obj:
            lineList = file_obj.readlines()
            for line in lineList:
                info_list = re.search(pattern, line)
                if info_list is not None:
                    capture_data.append(line)

捕获的数据如下

' 100         0    PASS     Continuity_PPMU_mV        XSCI         140       -1.0000 V      -427.9508 mV   -300.0000 mV   -100.0000 uA   0         \n'

' 100         1    PASS     Continuity_PPMU_mV        XSCI         12        -1.0000 V      -430.3089 mV   -300.0000 mV   -100.0000 uA   0         \n'

我想迭代每个捕获的行并在空格的基础上拆分,但问题是,单位和值之间有空格,例如....

-300.0000 mV、-100.0000 uA 等

还有一个问题是尾随换行符,它也被视为.split(" ")中的一个新元素。

有人可以帮忙找到一些更聪明的方法来做到这一点吗?

我想要的只是将这些值作为单独的列值。

例如在第一个字符串中,

100 变为第 1 列,0 - 2nd,PASS - 3rd,Continuity_PPMU_mV - 4th,等等...

谢谢。

编辑:

原始数据有点像这样 -

Site Number:    
     0,  1,  2,  3

Device#: 1-4

********************************************************************* 
FT45434HAP PQF64 Test @ RHC 
********************************************************************* 
---------------------------Continuity Test--------------------------- 

 Number     Site  Result   Test Name                 Pin          Channel   Low            Measured       High           Force          Loc
 100         0    PASS     Continuity_PPMU_mV        XSCI         140       -1.0000 V      -427.9508 mV   -300.0000 mV   -100.0000 uA   0         
 100         1    PASS     Continuity_PPMU_mV        XSCI         12        -1.0000 V      -430.3089 mV   -300.0000 mV   -100.0000 uA   0         
 100         2    PASS     Continuity_PPMU_mV        XSCI         76        -1.0000 V      -430.7492 mV   -300.0000 mV   -100.0000 uA   0         
 100         3    PASS     Continuity_PPMU_mV        XSCI         204       -1.0000 V      -431.0482 mV   -300.0000 mV   -100.0000 uA   0         
 101         0    PASS     Continuity_PPMU_mV        XSCO         139       -1.0000 V      -456.0359 mV   -300.0000 mV   -100.0000 uA   0         
 101         1    PASS     Continuity_PPMU_mV        XSCO         11        -1.0000 V      -458.0605 mV   -300.0000 mV   -100.0000 uA   0         
 101         2    PASS     Continuity_PPMU_mV        XSCO         75        -1.0000 V      -457.8564 mV   -300.0000 mV   -100.0000 uA   0         

编辑

顶行不是固定的,它们是动态生成的。此外,一些其他文本数据可以出现在相关数据之间,例如两个有用的行之间。所以,我不认为在这里跳过行会起作用。

2 个答案:

答案 0 :(得分:1)

  • 读取文件并查找以 'Number' 开头的行,然后将其后的行附加到 data
  • 在数据行中,只有单位之间用空格分隔。
  • 最好将单位与数值分开,这样我们就可以在空格上分割行。
  • 创建一个新标题,为单位添加新列。
  • 这将允许将数值解释为浮点数。
import pandas as pd
import seaborn as sns

# read the file in
data = list()
with open('test.txt', 'r') as f:
    rows = f.readlines()
    flag = False  # flag to True once the header row with Number is found
    for row in rows:
        row = row.strip()
        if row.startswith('Number'):
            flag = True
            continue  # after the header row is found, skip it
        if flag:
            data.append(row.split())  # append rows after the header to data

# create a custom header where the unites have been added as column headers
header = ['Number', 'Site', 'Result', 'Test_Name', 'Pin', 'Channel', 'Low', 'U1', 'Measured', 'U2', 'High', 'U3', 'Force', 'U4', 'Loc']

# create the dataframe
df = pd.DataFrame(data, columns=header)

# save to csv
df.to_csv('file.csv', index=False)

# convert columns to numeric dtypes
df = df.apply(pd.to_numeric, errors='ignore')

# scale the columns as per their units
df.Measured = df.Measured.div(1000)
df.High = df.High.div(1000)
df.Force = df.Force.div(100000)

# display(df)
   Number  Site Result           Test_Name   Pin  Channel  Low U1  Measured  U2  High  U3  Force  U4  Loc
0     100     0   PASS  Continuity_PPMU_mV  XSCI      140 -1.0  V -0.427951  mV  -0.3  mV -0.001  uA    0
1     100     1   PASS  Continuity_PPMU_mV  XSCI       12 -1.0  V -0.430309  mV  -0.3  mV -0.001  uA    0
2     100     2   PASS  Continuity_PPMU_mV  XSCI       76 -1.0  V -0.430749  mV  -0.3  mV -0.001  uA    0
3     100     3   PASS  Continuity_PPMU_mV  XSCI      204 -1.0  V -0.431048  mV  -0.3  mV -0.001  uA    0
4     101     0   PASS  Continuity_PPMU_mV  XSCO      139 -1.0  V -0.456036  mV  -0.3  mV -0.001  uA    0
5     101     1   PASS  Continuity_PPMU_mV  XSCO       11 -1.0  V -0.458060  mV  -0.3  mV -0.001  uA    0
6     101     2   PASS  Continuity_PPMU_mV  XSCO       75 -1.0  V -0.457856  mV  -0.3  mV -0.001  uA    0

# plot
ax = sns.lineplot(data=df.iloc[:, 6:-2])
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

enter image description here

答案 1 :(得分:0)

您可以简单地跳过第一行并将分隔符指定为 \s\s+

pd.read_csv('file.txt', skiprows=10, sep='\s\s+', engine='python')

输出:

   Number  Site Result           Test Name   Pin  Channel        Low      Measured          High         Force  Loc
0     100     0   PASS  Continuity_PPMU_mV  XSCI      140  -1.0000 V  -427.9508 mV  -300.0000 mV  -100.0000 uA    0
1     100     1   PASS  Continuity_PPMU_mV  XSCI       12  -1.0000 V  -430.3089 mV  -300.0000 mV  -100.0000 uA    0
2     100     2   PASS  Continuity_PPMU_mV  XSCI       76  -1.0000 V  -430.7492 mV  -300.0000 mV  -100.0000 uA    0
3     100     3   PASS  Continuity_PPMU_mV  XSCI      204  -1.0000 V  -431.0482 mV  -300.0000 mV  -100.0000 uA    0
4     101     0   PASS  Continuity_PPMU_mV  XSCO      139  -1.0000 V  -456.0359 mV  -300.0000 mV  -100.0000 uA    0
5     101     1   PASS  Continuity_PPMU_mV  XSCO       11  -1.0000 V  -458.0605 mV  -300.0000 mV  -100.0000 uA    0
6     101     2   PASS  Continuity_PPMU_mV  XSCO       75  -1.0000 V  -457.8564 mV  -300.0000 mV  -100.0000 uA    0

此外,如果您不确定应该忽略多少起始行,您可能会尝试找到一种模式来忽略第一行。例如,如果您的数据模式是一致的,您可以读取第一行直到匹配第一列(在本例中为“数字”):

# Identify how many rows we need to skip (avoiding reading the whole file)
skiplines=0
with open('file.txt') as file:
    line = file.readline()
    while not line.lstrip().startswith('Number'):
        skiplines += 1
        line = file.readline()

# Then read it with pandas
pd.read_csv('file.txt', skiprows=skiplines, sep='\s\s+', engine='python')

无论如何,利用其逻辑,很容易修改上面的代码块以匹配不同的文件模式。例如,输出将始终显示“连续性测试”行?如果数据总是显示在该行之后,这就是您要寻找的模式。