从文本文件/文本中提取特定信息

时间:2019-07-12 18:54:52

标签: python regex

我正在尝试从文本文件中提取特定信息。我不确定如何去做。在这里,我会向您寻求帮助。

text ="65097    3785    <00>    tag KV-C203 fmt 65  typ KVMAxLOG:WM_area_results_table  dat <0A>    
 <0B>   stroke  0   area_results    <0B>    area_centre_xy  <0B>    x   -0.1279636  y   0.0819952   
 <00>   plane_deviation 0   area_id 10  area_measurement_ok TRUE    plane_deviation_check_done  
 FALSE  plane_deviation_check_ok    FALSE   FSM_check_ok    FALSE   FSM_check_done  FALSE   
 leveling_method    LEVELING_METHOD_TRADITIONAL <00>    x_gridlines_shift   0   nr_of_x_gridlines 
 5  nr_of_y_gridlines   38  <00>    <0B>    stroke  0   area_results    <0B>    area_centre_xy  <0B>
 x  -0.1279636  y   0.04919712  <00>    plane_deviation 0   area_id 9   area_measurement_ok TRUE    
 plane_deviation_check_done FALSE   plane_deviation_check_ok    FALSE   FSM_check_ok    FALSE   
 FSM_check_done FALSE   leveling_method LEVELING_METHOD_TRADITIONAL <00>    
 x_gridlines_shift  0   nr_of_x_gridlines   9   nr_of_y_gridlines   61  <00>    <0B>    
 stroke 0   area_results    <0B>    area_centre_xy  <0B>    x   -0.1279636  y   0.01639904  <00>    
 plane_deviation    0   area_id 8   area_measurement_ok TRUE    plane_deviation_check_done  FALSE   
 plane_deviation_check_ok   FALSE   FSM_check_ok    FALSE   FSM_check_done  FALSE   leveling_method 
 LEVELING_METHOD_TRADITIONAL    <00>    x_gridlines_shift   0   nr_of_x_gridlines   9   
 nr_of_y_gridlines  61  <00>    <0B>    stroke  0   area_results    <0B>    area_centre_xy  <0B>    
 x  -0.1279636  y   -0.01639904 <00>    plane_deviation 0   area_id 7   area_measurement_ok TRUE    
 plane_deviation_check_done FALSE   plane_deviation_check_ok    FALSE   FSM_check_ok    FALSE   
 FSM_check_done FALSE   leveling_method LEVELING_METHOD_TRADITIONAL <00>    x_gridlines_shift   
 0  nr_of_x_gridlines   9   nr_of_y_gridlines   61  <00>    <0B>    stroke  0   area_results    
 <0B>   area_centre_xy  <0B>    x   -0.1279636  y   -0.04919712 <00>    plane_deviation 0   
 area_id    6   area_measurement_ok TRUE    plane_deviation_check_done  FALSE   
 plane_deviation_check_ok   FALSE   FSM_check_ok    FALSE   FSM_check_done  FALSE   
 leveling_method    LEVELING_METHOD_TRADITIONAL <00>    x_gridlines_shift   0   nr_of_x_gridlines   
 9  nr_of_y_gridlines   61  <00>    <0B>    stroke  0   area_results    <0B>    area_centre_xy  
 <0B>   x   -0.1279636  y   -0.0819952  <00>    plane_deviation 0   area_id 5   
 area_measurement_ok    TRUE    plane_deviation_check_done  FALSE   plane_deviation_check_ok    
 FALSE  FSM_check_ok    FALSE   FSM_check_done  FALSE   leveling_method 
 LEVELING_METHOD_TRADITIONAL    <00>    x_gridlines_shift   0   nr_of_x_gridlines   
 5  nr_of_y_gridlines   38  <00>    <00>    <00>    \n  None    None    None    None    
 None   None    None    None    None    None    None    None    None    None    None"

预期产量

x             y
-0.1279636   0.0819952
-0.1279636   0.04919712
-0.1279636   0.01639904
-0.1279636  -0.01639904
-0.1279636  -0.04919712
-0.1279636  -0.0819952

4 个答案:

答案 0 :(得分:1)

不确定这里的数据结构是什么,但是此代码将从特定的字符串中提取它们。我敢肯定,如果与此类似的其他实例也应该起作用。

xvals = []
yvals = []
split1 = text.split("<00>")
for item1 in split1:
    split2 = item1.split("<0B>")
    for item2 in split2:
        split3 = [x for x in item2.split(" ") if x != ""]
        if "x" in split3 and "y" in split3:
            xvals.append(float(split3[split3.index("x")+1]))
            yvals.append(float(split3[split3.index("y")+1]))

print(xvals)
print(yvals)

输出:

[-0.1279636, -0.1279636, -0.1279636, -0.1279636, -0.1279636, -0.1279636] #x_vals
[0.0819952, 0.04919712, 0.01639904, -0.01639904, -0.04919712, -0.0819952] #y_vals

答案 1 :(得分:1)

请明确说明,此答案仅解决问题中发布的文字。 OP将不得不仔细考虑如何根据他/她希望运行此正则表达式的各种变体来概括该正则表达式。

import re
x = re.findall( r' x *?([\-0-9\.]+)', text )
y = re.findall( r' y *?([\-0-9\.]+)', text )
print( x )
print( y )

基本上,re.findall搜索模式括号内的所有匹配项。由于示例文本看起来总是以“ [space] x [space] ...”和“ [space] y [space] ...”开头,因此您可以创建一个模式来搜索此字符,并且仅匹配数字字符(负号,0-9位数字和小数点。

请注意,您可以将大的text块括在三引号(“”“)中,这样就不必处理新行了。例如:

text = """start of text
words on new line
more words on new line"""

答案 2 :(得分:1)

如果这是仅查找x'和y's的硬编码方法,则可以轻松实现,如下所示:

df = pd.DataFrame()

df['x'] = re.findall('x\s+([+-]?[0-9]*[.]?[0-9]+)', text)
df['y'] = re.findall('y\s+([+-]?[0-9]*[.]?[0-9]+)', text)

答案 3 :(得分:1)

import re

for x,y in re.findall('x\s+(-?\d\.\d+)\s+y\s+(-?\d\.\d+)',text):
   print(x, y)

结果:

-0.1279636 0.0819952
-0.1279636 0.04919712
-0.1279636 0.01639904
-0.1279636 -0.01639904
-0.1279636 -0.04919712
-0.1279636 -0.0819952

如果您逐行将文本读入sample,并且希望将数据存储在数据框中:

import re
import pandas as pd

df = pd.DataFrame(columns=['x','y'])

for text in sample:
   a = re.findall('x\s+(-?\d\.\d+)\s+y\s+(-?\d\.\d+)',text)
   df = df.append(pd.DataFrame(a, columns=['x','y']))

findall将返回字符串,如果您需要数字,则必须指定dtype

df = pd.DataFrame(columns=['x','y'], dtype=float)

for text in sample:
   a = re.findall('x\s+(-?\d\.\d+)\s+y\s+(-?\d\.\d+)',text)
   df = df.append(pd.DataFrame(a, columns=['x','y'], dtype=float))