我正在尝试从文本文件中提取特定信息。我不确定如何去做。在这里,我会向您寻求帮助。
text ="65097 3785 <00> tag KV-C203 fmt 65 typ KVMAxLOG:WM_area_results_table dat <0A>
<0B> stroke 0 area_results <0B> area_centre_xy <0B> x -0.1279636 y 0.0819952
<00> plane_deviation 0 area_id 10 area_measurement_ok TRUE plane_deviation_check_done
FALSE plane_deviation_check_ok FALSE FSM_check_ok FALSE FSM_check_done FALSE
leveling_method LEVELING_METHOD_TRADITIONAL <00> x_gridlines_shift 0 nr_of_x_gridlines
5 nr_of_y_gridlines 38 <00> <0B> stroke 0 area_results <0B> area_centre_xy <0B>
x -0.1279636 y 0.04919712 <00> plane_deviation 0 area_id 9 area_measurement_ok TRUE
plane_deviation_check_done FALSE plane_deviation_check_ok FALSE FSM_check_ok FALSE
FSM_check_done FALSE leveling_method LEVELING_METHOD_TRADITIONAL <00>
x_gridlines_shift 0 nr_of_x_gridlines 9 nr_of_y_gridlines 61 <00> <0B>
stroke 0 area_results <0B> area_centre_xy <0B> x -0.1279636 y 0.01639904 <00>
plane_deviation 0 area_id 8 area_measurement_ok TRUE plane_deviation_check_done FALSE
plane_deviation_check_ok FALSE FSM_check_ok FALSE FSM_check_done FALSE leveling_method
LEVELING_METHOD_TRADITIONAL <00> x_gridlines_shift 0 nr_of_x_gridlines 9
nr_of_y_gridlines 61 <00> <0B> stroke 0 area_results <0B> area_centre_xy <0B>
x -0.1279636 y -0.01639904 <00> plane_deviation 0 area_id 7 area_measurement_ok TRUE
plane_deviation_check_done FALSE plane_deviation_check_ok FALSE FSM_check_ok FALSE
FSM_check_done FALSE leveling_method LEVELING_METHOD_TRADITIONAL <00> x_gridlines_shift
0 nr_of_x_gridlines 9 nr_of_y_gridlines 61 <00> <0B> stroke 0 area_results
<0B> area_centre_xy <0B> x -0.1279636 y -0.04919712 <00> plane_deviation 0
area_id 6 area_measurement_ok TRUE plane_deviation_check_done FALSE
plane_deviation_check_ok FALSE FSM_check_ok FALSE FSM_check_done FALSE
leveling_method LEVELING_METHOD_TRADITIONAL <00> x_gridlines_shift 0 nr_of_x_gridlines
9 nr_of_y_gridlines 61 <00> <0B> stroke 0 area_results <0B> area_centre_xy
<0B> x -0.1279636 y -0.0819952 <00> plane_deviation 0 area_id 5
area_measurement_ok TRUE plane_deviation_check_done FALSE plane_deviation_check_ok
FALSE FSM_check_ok FALSE FSM_check_done FALSE leveling_method
LEVELING_METHOD_TRADITIONAL <00> x_gridlines_shift 0 nr_of_x_gridlines
5 nr_of_y_gridlines 38 <00> <00> <00> \n None None None None
None None None None None None None None None None None"
预期产量
x y
-0.1279636 0.0819952
-0.1279636 0.04919712
-0.1279636 0.01639904
-0.1279636 -0.01639904
-0.1279636 -0.04919712
-0.1279636 -0.0819952
答案 0 :(得分:1)
不确定这里的数据结构是什么,但是此代码将从特定的字符串中提取它们。我敢肯定,如果与此类似的其他实例也应该起作用。
xvals = []
yvals = []
split1 = text.split("<00>")
for item1 in split1:
split2 = item1.split("<0B>")
for item2 in split2:
split3 = [x for x in item2.split(" ") if x != ""]
if "x" in split3 and "y" in split3:
xvals.append(float(split3[split3.index("x")+1]))
yvals.append(float(split3[split3.index("y")+1]))
print(xvals)
print(yvals)
输出:
[-0.1279636, -0.1279636, -0.1279636, -0.1279636, -0.1279636, -0.1279636] #x_vals
[0.0819952, 0.04919712, 0.01639904, -0.01639904, -0.04919712, -0.0819952] #y_vals
答案 1 :(得分:1)
请明确说明,此答案仅解决问题中发布的文字。 OP将不得不仔细考虑如何根据他/她希望运行此正则表达式的各种变体来概括该正则表达式。
import re
x = re.findall( r' x *?([\-0-9\.]+)', text )
y = re.findall( r' y *?([\-0-9\.]+)', text )
print( x )
print( y )
基本上,re.findall
搜索模式括号内的所有匹配项。由于示例文本看起来总是以“ [space] x [space] ...”和“ [space] y [space] ...”开头,因此您可以创建一个模式来搜索此字符,并且仅匹配数字字符(负号,0-9位数字和小数点。
请注意,您可以将大的text
块括在三引号(“”“)中,这样就不必处理新行了。例如:
text = """start of text
words on new line
more words on new line"""
答案 2 :(得分:1)
如果这是仅查找x'和y's的硬编码方法,则可以轻松实现,如下所示:
df = pd.DataFrame()
df['x'] = re.findall('x\s+([+-]?[0-9]*[.]?[0-9]+)', text)
df['y'] = re.findall('y\s+([+-]?[0-9]*[.]?[0-9]+)', text)
答案 3 :(得分:1)
import re
for x,y in re.findall('x\s+(-?\d\.\d+)\s+y\s+(-?\d\.\d+)',text):
print(x, y)
结果:
-0.1279636 0.0819952
-0.1279636 0.04919712
-0.1279636 0.01639904
-0.1279636 -0.01639904
-0.1279636 -0.04919712
-0.1279636 -0.0819952
如果您逐行将文本读入sample
,并且希望将数据存储在数据框中:
import re
import pandas as pd
df = pd.DataFrame(columns=['x','y'])
for text in sample:
a = re.findall('x\s+(-?\d\.\d+)\s+y\s+(-?\d\.\d+)',text)
df = df.append(pd.DataFrame(a, columns=['x','y']))
findall
将返回字符串,如果您需要数字,则必须指定dtype
:
df = pd.DataFrame(columns=['x','y'], dtype=float)
for text in sample:
a = re.findall('x\s+(-?\d\.\d+)\s+y\s+(-?\d\.\d+)',text)
df = df.append(pd.DataFrame(a, columns=['x','y'], dtype=float))