如何找到用<tag> ... <!---->指定的东西的字符位置?蟒</标签>

时间:2013-06-05 14:59:50

标签: python xml tags

我试图获得<>的真实tag的位置,当它们嵌入像<tag "510270">calculate</>这样的内容时。

我有这样的句子:

sentence = "After six weeks and seventeen tentative approaches the only serious 
tender came from Daniel. He had offered a paltry #2 a week for the one-time 
woodman's home, sane enough in this, at least, to <tag "510270">calculate</> 
safety to the nearest new penny piece. "

sentence2 = "After six weeks and seventeen tentative approaches the only serious 
tender came from Daniel. He had offered a paltry #2 a week for the one-time 
woodman's < home, sane enough in this, at least, to <tag "510270">calculate</> 
safety to the nearest new penny > piece. "

sentence3 = "After six weeks and seventeen tentative approaches the only serious 
tender came from Daniel. He had offered a paltry #2 a week for the one-time 
woodman's > home, sane enough in this, at least, to <tag "510270">calculate</> 
safety to the nearest new penny < piece. "

我需要cfrom和incfrom成为<中第1个和第2个<tag "XXXX">...</>的位置,我需要cto和incto成为第2个和第1个>的位置<tag "XXXX">...</>

我怎样才能对句子2和句子3这样的句子进行处理,其中<>出现在<tag "XXXX">...</>之外?

对于sentence1,我可以这样做:

cfrom,cto = 0,0
for i,c in enumerate(sentence1):
  if c == "<":
    cfrom == i
  break

for i,c in enumerate(sentence1.reverse):
  if c == ">":
    cto == len(sentence)-i
  break

incfrom incto = 0,0
fromtrigger, totrigger = False, False
for i,c in enumerate(sentence1[cfrom:]):
  if c == ">":
    incfrom = cfrom+i
  break

for i,c in enumerate(sentence1[incfrom:cto]):
  if c == "<":
    incto = i
  break

2 个答案:

答案 0 :(得分:1)

如下所示,您可以在找到标签时跟踪您的位置:

def parseSentence(sentence):
    cfrom, cto, incfrom, incto = 0, 0, 0, 0
    place = '' #to keep track of where we are

    for i in range(len(sentence)):
        c = sentence[i]
        if (c == '<'):
            #check for 'cfrom'
            if (sentence[i : i + 4] == '<tag'):
                cfrom = i
                place = 'botag' #begin-open-tag
            #check for 'incfrom'
            elif (sentence[i + 1] == '/' and place == 'intag'):
                incfrom = i
                place = 'bctag' #begin-close-tag
        elif (c == '>'):
            #check for 'cto'
            if (place == 'botag'): #just after '<tag...'
                cto = i
                place = 'intag' #now within the XML tag
            #check for 'incto'
            elif (place == 'bctag'):
                incto = i
                place = ''
                yield (cfrom, cto, incfrom, incto)

这应该适用于你的所有句子,但请注意,如果你的句子中只有一个<tag>...</>,它将真正起作用。如果有多个,它将​​返回最后<tag>...</>的位置。

修改:如果您在函数中添加yield,如果您有多个<tag>...</>,它将迭代句子中所有{{1}}个标记的位置(参见上文) )。

答案 1 :(得分:0)

如果我理解正确,这应该有用(假设你不改变变量i ,c

cfrom,cto = 0,0
for i,c in enumerate(sentence1):
  if c == "<tag":
    cfrom == i 
  break

for i,c in enumerate(sentence1):
  if c == ">":
    cto == i \\going forward from cfrom
  break

incfrom incto = 0,0
fromtrigger, totrigger = False, False
for i,c in enumerate(sentence1[cto:]):\\after the tag is opened, look for the start of closing tag
  if c == "</":
    incfrom = i
  break
for i,c in enumerate(sentence1[cto:]):
  if c == ">":
    incto = i
  break