Question

我正在尝试从文本文件中提取特定信息。我不确定如何去做。在这里，我会向您寻求帮助。

text ="65097    3785    <00>    tag KV-C203 fmt 65  typ KVMAxLOG:WM_area_results_table  dat <0A>    
 <0B>   stroke  0   area_results    <0B>    area_centre_xy  <0B>    x   -0.1279636  y   0.0819952   
 <00>   plane_deviation 0   area_id 10  area_measurement_ok TRUE    plane_deviation_check_done  
 FALSE  plane_deviation_check_ok    FALSE   FSM_check_ok    FALSE   FSM_check_done  FALSE   
 leveling_method    LEVELING_METHOD_TRADITIONAL <00>    x_gridlines_shift   0   nr_of_x_gridlines 
 5  nr_of_y_gridlines   38  <00>    <0B>    stroke  0   area_results    <0B>    area_centre_xy  <0B>
 x  -0.1279636  y   0.04919712  <00>    plane_deviation 0   area_id 9   area_measurement_ok TRUE    
 plane_deviation_check_done FALSE   plane_deviation_check_ok    FALSE   FSM_check_ok    FALSE   
 FSM_check_done FALSE   leveling_method LEVELING_METHOD_TRADITIONAL <00>    
 x_gridlines_shift  0   nr_of_x_gridlines   9   nr_of_y_gridlines   61  <00>    <0B>    
 stroke 0   area_results    <0B>    area_centre_xy  <0B>    x   -0.1279636  y   0.01639904  <00>    
 plane_deviation    0   area_id 8   area_measurement_ok TRUE    plane_deviation_check_done  FALSE   
 plane_deviation_check_ok   FALSE   FSM_check_ok    FALSE   FSM_check_done  FALSE   leveling_method 
 LEVELING_METHOD_TRADITIONAL    <00>    x_gridlines_shift   0   nr_of_x_gridlines   9   
 nr_of_y_gridlines  61  <00>    <0B>    stroke  0   area_results    <0B>    area_centre_xy  <0B>    
 x  -0.1279636  y   -0.01639904 <00>    plane_deviation 0   area_id 7   area_measurement_ok TRUE    
 plane_deviation_check_done FALSE   plane_deviation_check_ok    FALSE   FSM_check_ok    FALSE   
 FSM_check_done FALSE   leveling_method LEVELING_METHOD_TRADITIONAL <00>    x_gridlines_shift   
 0  nr_of_x_gridlines   9   nr_of_y_gridlines   61  <00>    <0B>    stroke  0   area_results    
 <0B>   area_centre_xy  <0B>    x   -0.1279636  y   -0.04919712 <00>    plane_deviation 0   
 area_id    6   area_measurement_ok TRUE    plane_deviation_check_done  FALSE   
 plane_deviation_check_ok   FALSE   FSM_check_ok    FALSE   FSM_check_done  FALSE   
 leveling_method    LEVELING_METHOD_TRADITIONAL <00>    x_gridlines_shift   0   nr_of_x_gridlines   
 9  nr_of_y_gridlines   61  <00>    <0B>    stroke  0   area_results    <0B>    area_centre_xy  
 <0B>   x   -0.1279636  y   -0.0819952  <00>    plane_deviation 0   area_id 5   
 area_measurement_ok    TRUE    plane_deviation_check_done  FALSE   plane_deviation_check_ok    
 FALSE  FSM_check_ok    FALSE   FSM_check_done  FALSE   leveling_method 
 LEVELING_METHOD_TRADITIONAL    <00>    x_gridlines_shift   0   nr_of_x_gridlines   
 5  nr_of_y_gridlines   38  <00>    <00>    <00>    \n  None    None    None    None    
 None   None    None    None    None    None    None    None    None    None    None"

预期产量

x             y
-0.1279636   0.0819952
-0.1279636   0.04919712
-0.1279636   0.01639904
-0.1279636  -0.01639904
-0.1279636  -0.04919712
-0.1279636  -0.0819952

Answer 1

不确定这里的数据结构是什么，但是此代码将从特定的字符串中提取它们。我敢肯定，如果与此类似的其他实例也应该起作用。

xvals = []
yvals = []
split1 = text.split("<00>")
for item1 in split1:
    split2 = item1.split("<0B>")
    for item2 in split2:
        split3 = [x for x in item2.split(" ") if x != ""]
        if "x" in split3 and "y" in split3:
            xvals.append(float(split3[split3.index("x")+1]))
            yvals.append(float(split3[split3.index("y")+1]))

print(xvals)
print(yvals)

输出：

[-0.1279636, -0.1279636, -0.1279636, -0.1279636, -0.1279636, -0.1279636] #x_vals
[0.0819952, 0.04919712, 0.01639904, -0.01639904, -0.04919712, -0.0819952] #y_vals

Answer 2

请明确说明，此答案仅解决问题中发布的文字。 OP将不得不仔细考虑如何根据他/她希望运行此正则表达式的各种变体来概括该正则表达式。

import re
x = re.findall( r' x *?([\-0-9\.]+)', text )
y = re.findall( r' y *?([\-0-9\.]+)', text )
print( x )
print( y )

基本上，re.findall搜索模式括号内的所有匹配项。由于示例文本看起来总是以“ [space] x [space] ...”和“ [space] y [space] ...”开头，因此您可以创建一个模式来搜索此字符，并且仅匹配数字字符（负号，0-9位数字和小数点。

请注意，您可以将大的text块括在三引号（“”“）中，这样就不必处理新行了。例如：

text = """start of text
words on new line
more words on new line"""

Answer 3

如果这是仅查找x'和y's的硬编码方法，则可以轻松实现，如下所示：

df = pd.DataFrame()

df['x'] = re.findall('x\s+([+-]?[0-9]*[.]?[0-9]+)', text)
df['y'] = re.findall('y\s+([+-]?[0-9]*[.]?[0-9]+)', text)

Answer 4

import re

for x,y in re.findall('x\s+(-?\d\.\d+)\s+y\s+(-?\d\.\d+)',text):
   print(x, y)

结果：

-0.1279636 0.0819952
-0.1279636 0.04919712
-0.1279636 0.01639904
-0.1279636 -0.01639904
-0.1279636 -0.04919712
-0.1279636 -0.0819952

如果您逐行将文本读入sample，并且希望将数据存储在数据框中：

import re
import pandas as pd

df = pd.DataFrame(columns=['x','y'])

for text in sample:
   a = re.findall('x\s+(-?\d\.\d+)\s+y\s+(-?\d\.\d+)',text)
   df = df.append(pd.DataFrame(a, columns=['x','y']))

findall将返回字符串，如果您需要数字，则必须指定dtype：

df = pd.DataFrame(columns=['x','y'], dtype=float)

for text in sample:
   a = re.findall('x\s+(-?\d\.\d+)\s+y\s+(-?\d\.\d+)',text)
   df = df.append(pd.DataFrame(a, columns=['x','y'], dtype=float))

从文本文件/文本中提取特定信息

4 个答案: