Question

text =  'the text stuff <*to test*> to find a way to extract all text'\
        'that is <*included in special tags*> less than star and greater'\
        'than star'

我尝试过使用：Adding up re.finditer results。

我尝试了许多正则表达式导入重组。

我尝试了\w+的各种变体。

我可以使用'<* .... *>'打印文字并使用'<*'替换'*>'和.replace空格，但我不能使用DictReader仅提取标记中的字词，因为标签在Python中看起来是特殊字符。使用DictReader我拉出整行文本，但不仅仅是控制字符标签中的单词。

.split确实可以处理替换文本，但不能查找标记中包含<*...*>等异常字符的文本。

我已尝试使用<转义字符*，>和\<|*.*?+\*\>以查找标记或标记内的所有文字，但这并不是＆＃ 39;工作。

Python并不喜欢这些字符被转义。

我已经考虑过在<，*和>的八进制代码中找到它们，但这可能是对Python工作方式的一种扭曲。

从Wes McKinney和Beazley / Jones＆＃39;中找到了很好的建议。关于Python的书籍。

测试了开始和结束文本，但这些特殊字符不能替代。

提前为所尝试的解决方案的复杂性道歉。希望我能够接近这个方法。

导入并注册csv

csv.register_dialect('piper', delimiter = '|', quoting=csv.QUOTE_NONE)

使用DictReader读取每一行

with open('text') as csvfle:
    for row in csv.DictReader(csvfile, dialect='piper'):
        row["specialtext"] = row["text"].replace("<*", "").replace("*>",           "").decode('windows-1252').encode('utf-8').strip()
        print row['specialtext']

以上所有这些都有效，但是在标签内找到文本的任何尝试都没有。

Answer 1

考虑使用re.findall()将所有匹配的文本提取到列表中，并使用反斜杠转义任何特殊字符（如星号，*）：

import re

text =  'the text stuff <*to test*> to find a way to extract all text'\
        'that is <*included in special tags*> less than star and greater'\
        'than star'

txtsearch = re.findall('<\*(.*?)\*>', text)

if txtsearch:
    print(txtsearch)

# ['to test', 'included in special tags']

试图提取标签内的文字＆lt; * word word word *＆gt;通过Python

1 个答案: