Question

我是一个几乎完全的初学者，我正在尝试使用lxml将JSON中的Twitter数据提取到伪XML文件中。

我需要的最终结构如下：

<corpus>
   <text id="NNN" source="NNN">
      text of the message
   </text>
</corpus>

我已经成功获得了上述内容，但是当文本包含主题标签时，我需要将每个主题标签包装在一个新标签内，如下所示：

<corpus>
   <text id="NNN" source="NNNN">
      text of the message with <exhashtag original="#hashtag">hashtag</exhashtag>
   </text>
</corpus>

即。每个主题标签都必须删除哈希字符，并包含在包含其原始版本的自定义<exhashtag>标记内。

到目前为止，这是我写的 - 其中text_field是标记<text>的最终伪XML结构，json_text是从json中提取的文本：

if re.search(u'(?:\#+[\w_]+[\w\'_\-]*[\w_]+)', json_text) is not None:
   alltags = re.findall(u'(?:\#+[\w_]+[\w\'_\-]*[\w_]+)', json_text)
      for i in alltags:
         if i is not None:
            json_text_hashtags = i
            json_text_nohashtags = re.sub(u'(?:\#+([\w_]+[\w\'_\-]*[\w_])+)', u'\g<1>', i)
            exhashtag = etree.SubElement(text_field, "exhashtag", original=json_text_hashtags)
            exhashtag.text = json_text_nohashtags
            json_textstring_hash = text_field.insert(2,exhashtag)

但结果如下：

<corpus>
   <text id="NNN" source="NNNN">
      text of message with #hashtag <exhashtag orginal="#hashtag">hashtag</exhashtag>
   </text>
</corpus>

有关如何正确包含文本中每个主题标签exhashtag的正确位置的建议吗？非常感谢，我希望我已经包含了所需的所有信息。

Answer 1

而不是text_field.insert，我只会替换text_field.text

text_field.text = text_field.text.replace(
    json_text_hashtags,
    etree.tostring(exhashtag, encoding=str)
)

默认情况下，etree.tostring将元素序列化为 bytes 对象。使用str（或Python 2中的unicode）函数作为编码来获取字符串。

Python-lxml仅将标记内容的一部分包装到另一个标记中

1 个答案: