如何从XML xpath搜索生成的列表中的子字符串中删除字符?

时间:2012-08-16 19:33:41

标签: python xpath

这个问题是对早期问题的补充。如果您需要更多背景信息,可以在此处查看原始问题:

Populating Python list using data obtained from lxml xpath command

我已将@ ihor-kaharlichenko的优秀建议(来自我原来的问题)合并到修改后的代码中,在这里:

from lxml import etree as ET
from datetime import datetime

xmlDoc = ET.parse('http://192.168.1.198/Bench_read_scalar.xml')

response = xmlDoc.getroot()
tags = (
'address',
'status',
'flow',
'dp',
'inPressure',
'actVal',
'temp',
'valveOnPercent',
)

dmtVal = []

for dmt in response.iter('dmt'):
    val = [str(dmt.xpath('./%s/text()' % tag)) for tag in tags]
    val.insert(0, str(datetime.now())) #Add timestamp at beginning of each record
    dmtVal.append(val)

for item in dmtVal:
    str(item).strip('[')
    str(item).strip(']')
    str(item).strip('"')

这最后一块是我遇到问题的地方。我为dmtVal获取的数据如下:

[['2012-08-16 12:38:45.152222', "['0x46']", "['0x32']", "['1.234']", "['5.678']", "['9.123']", "['4.567']", "['0x98']", "['0x97']"], ['2012-08-16 12:38:45.152519', "['0x47']", "['0x33']", "['8.901']", "['2.345']", "['6.789']", "['0.123']", "['0x96']", "['0x95']"]]

但是,我真的希望数据看起来像这样:

[['2012-08-16 12:38:45.152222', '0x46', '0x32', '1.234', '5.678', '9.123', '4.567', '0x98', '0x97'], ['2012-08-16 12:38:45.152519', '0x47', '0x33', '8.901', '2.345', '6.789', '0.123', '0x96', '0x95']]

我认为这是一个相当简单的字符串剥离作业,我在原始迭代(最初填充dmtVal中尝试了代码,但这不起作用,所以我如上所列,在循环外部进行了剥离操作,但它仍然无法正常工作。我在想我正在做一些noob-error,但找不到它。欢迎任何建议!


感谢大家提供及时有用的回复。这是更正后的代码:

from lxml import etree as ET
from datetime import datetime

xmlDoc = ET.parse('http://192.168.1.198/Bench_read_scalar.xml')

print '...Starting to parse XML nodes'

response = xmlDoc.getroot()

tags = (
'address',
'status',
'flow',
'dp',
'inPressure',
'actVal',
'temp',
'valveOnPercent',
)

dmtVal = []

for dmt in response.iter('dmt'):
    val = [' '.join(dmt.xpath('./%s/text()' % tag)) for tag in tags]
    val.insert(0, str(datetime.now())) #Add timestamp at beginning of each record
    dmtVal.append(val)

哪个收益率:

...Starting to parse XML nodes
[['2012-08-16 14:41:10.442776', '0x46', '0x32', '1.234', '5.678', '9.123', '4.567', '0x98', '0x97'], ['2012-08-16 14:41:10.443052', '0x47', '0x33', '8.901', '2.345', '6.789', '0.123', '0x96', '0x95']]
...Done

谢谢大家!

5 个答案:

答案 0 :(得分:2)

将您当前的数据设为grps

解决方案1 ​​ - ast.literal_eval

import ast
grps = [['2012-08-16 12:38:45.152222', "['0x46']", "['0x32']", "['1.234']", "['5.678']", "['9.123']", "['4.567']", "['0x98']", "['0x97']"], ['2012-08-16 12:38:45.152519', "['0x47']", "['0x33']", "['8.901']", "['2.345']", "['6.789']", "['0.123']", "['0x96']", "['0x95']"]]
desired_output = [[grp[0]] + [ast.literal_eval(item)[0] for item in grp[1:]] for grp in grps]

print desired_output

<强>输出

[['2012-08-16 12:38:45.152222', '0x46', '0x32', '1.234', '5.678', '9.123', '4.567', '0x98', '0x97'], ['2012-08-16 12:38:45.152519', '0x47', '0x33', '8.901', '2.345', '6.789', '0.123', '0x96', '0x95']]

<强>解释

ast.literal_eval是一种安全的eval方式。它只适用于eval数据类型(字符串,数字,元组,列表,dicts,布尔值和None)。在您的情况下,它会将“['1.0']”评估为长度为1的列表,如['1.0']。您可能希望了解一下,并确保理解list comprehensions

写这个的另一种方法是:

desired_output = []
for grp in grps:  # loop through each group
    new_grp = grp[0]  # assign the first element (an array) to be our new_grp
    for item in grp[1:]  # loop over every item from index 1 to the end
        evaluated_item = ast.literal_eval(item)  # get the evaluated data
        new_grp.append(evaluated_item[0])  # append the item in the 1 item list to the new_grp
    desired_output.append(new_grp)  # append the new_grp to the desired_output list

解决方案2 - 正则表达式

import re
stripper = re.compile("[\[\]']")
grps = [['2012-08-16 12:38:45.152222', "['0x46']", "['0x32']", "['1.234']", "['5.678']", "['9.123']", "['4.567']", "['0x98']", "['0x97']"], ['2012-08-16 12:38:45.152519', "['0x47']", "['0x33']", "['8.901']", "['2.345']", "['6.789']", "['0.123']", "['0x96']", "['0x95']"]]
desired_output = [[grp[0]] + [ stripper.sub('', item) for item in grp[1:]] for grp in grps]

您的解决方案的问题是,for循环中迭代的项目不会通过引用传递,因此更改它们不会影响原始数据。

解决方案3 - 修复原始代码

要修复您的解决方案,您可以:

for i, grp in enumerate(dmtVal):  # loop over the inner lists
    for j, item in enumerate(grp):
        dmtVal[i][j] = item.strip('\]')
        dmtVal[i][j] = dmtVal[i][j].lstrip('\[')
        dmtVal[i][j] = dmtVal[i][j].strip("'")

您可以改为使用取消引用的值dmtVal[i][j],操纵它,然后在最后分配回item,而不是在每次剥离时将balue balue指定给dmtVal[i][j]。 / p>

for i, grp in enumerate(dmtVal):  # loop over the inner lists
    for j, item in enumerate(grp):
        # Could intead be
        item = item.strip('\]')
        item = dmtVal[i][j].lstrip('\[')
        item = dmtVal[i][j].strip("'")
        dmtVal[i][j] = item

或者更好的解决方案(imho):

for i, grp in enumerate(dmtVal):  # loop over the inner lists
    for j, item in enumerate(grp):
        dmtVal[i][j] = item.replace('[', '').replace(']', '').replace("'", '')

答案 1 :(得分:1)

这会做你需要的,也许不是最好的方式:

new_dmt_val = []
for sublist in dmtVal:
    new_dmt_val.append([elem.strip('[\'').strip('\']') for elem in sublist])

试图让它变得可读,它可能更少,但更容易混淆。

答案 2 :(得分:1)

答案是:首先不要创建字符串。


您的问题出在代码的这一部分:

for dmt in response.iter('dmt'):
    val = [str(dmt.xpath('./%s/text()' % tag)) for tag in tags]

我猜您在此处使用str()尝试从列表xpath()返回中提取字符串。
然而,这不是你得到的; str()只是为您提供列表的字符串表示。

你有几个选择可以做你想做的事 但鉴于你正在解析html,因此无法确定列表将包含多少元素,你最好的选择可能是使用''.join()

for dmt in response.iter('dmt'):
    val = [''.join(dmt.xpath('./%s/text()' % tag)) for tag in tags]



编辑:如果您使用此代码,则不需要最后一次循环。

答案 3 :(得分:1)

string.strip仅剥离前导和尾随字符。您可能希望使用string.replace代替。另请注意,string.strip(和string.replace)会返回字符串的副本

或仅使用''.join()代替str并完全放弃整个剥离业务:

val = [''.join(dmt.xpath('./%s/text()' % tag)) for tag in tags]

作为旁注,您可能也希望使用datetime.isoformat代替str

val.insert(0, datetime.now().isoformat()) #Add timestamp at beginning of each record

请参阅help(datetime)了解更多选项

答案 4 :(得分:1)

xml是原始帖子的字符串...(我认为这涵盖两种方式......)

from lxml import etree
from datetime import datetime
from ast import literal_eval

tree = etree.fromstring(xml).getroottree()
dmts = []
for dmt in tree.iterfind('dmt'):
    to_add = {'datetime': datetime.now()}
    to_add.update( {n.tag:literal_eval(n.text) for n in dmt} )
    dmts.append(to_add)

您仍然可以稍后明确地对节点进行排序 - 尽管我发现这种方法更清晰,因为您可以使用名称而不是索引(这取决于引入或删除节点是否应该是错误)