Question

这是我正在阅读的原始文本示例：

ID: 00000001
SENT: to do something
to    01573831
do    02017283
something    03517283

ID: 00000002
SENT: just an example
just    06482823
an    01298744
example    01724894

现在我正在尝试将其拆分为列表列表。

最高级别列表：通过ID这样2个元素（完成）

下一级：在每个ID中，按换行符分组

最后一级：在每行中拆分单词和ID，对于以ID或SENT开头的行，它们是否被拆分无关紧要。在单词和它们的ID之间是缩进（\ t）

当前代码：

f=open("text.txt","r")
raw=list(f)
text=" ".join(raw)
wordlist=text.split("\n \n ") #split by ID
toplist=wordlist[:2] #just take 2 IDs

编辑：我打算将这些单词交叉引用到另一个文本文件中以添加它们的单词类，这就是我要求列表列表的原因。

步骤：

1）使用.append（）为每个单词添加单词类

2）使用“\ t”.join（）将一条线连接起来

3）使用“\ n”.join（）连接ID中的不同行

4）“\ n \ n”.join（）将所有ID连接成一个字符串

输出：

ID: 00000001
SENT: to do something
to    01573831    prep
do    02017283    verb
something    03517283    noun

ID: 00000002
SENT: just an example
just    06482823    adverb
an    01298744    ind-art
example    01724894    noun

Answer 1

Thorsten回答的更为pythonic版本：

from collections import namedtuple

class Element(namedtuple("ElementBase", "id sent words")):
    @classmethod
    def parse(cls, source):
        lines = source.split("\n")
        return cls(
            id=lines[0][4:],
            sent=lines[1][6:],
            words=dict(
                line.split("\t") for line in lines[2:]
            )
        )

text = """ID: 00000001
SENT: to do something
to\t01573831
do\t02017283
something\t03517283

ID: 00000002
SENT: just an example
just\t06482823
an\t01298744
example\t01724894"""

elements = [Element.parse(part) for part in text.split("\n\n")]

for el in elements:
    print el
    print el.id
    print el.sent
    print el.words
    print

Answer 2

我认为最顶层的每个部分都是“对象”。因此，我创建了一个具有与每个部分相对应的属性的类。

class Element(object):
    def __init__(self, source):
        lines = source.split("\n")
        self._id = lines[0][4:]
        self._sent = lines[1][6:]
        self._words = {}
        for line in lines[2:]:
            word, id_ = line.split("\t")
            self._words[word] = id_

    @property
    def ID(self):
        return self._id

    @property
    def sent(self):
        return self._sent

    @property
    def words(self):
        return self._words

    def __str__(self):
        return "Element %s, containing %i words" % (self._id, len(self._words))

text = """ID: 00000001
SENT: to do something
to\t01573831
do\t02017283
something\t03517283

ID: 00000002
SENT: just an example
just\t06482823
an\t01298744
example\t01724894"""

elements = [Element(part) for part in text.split("\n\n")]

for el in elements:
    print el
    print el.ID
    print el.sent
    print el.words
    print

在主代码（一行，列表理解）中，文本仅在每个双重换行时分割。然后，所有逻辑都被推迟到__init__方法，使其非常本地化。

使用课程还可以获得__str__的好处，让您可以控制对象的打印方式。

您还可以考虑将__init__的最后三行重写为：

self._words = dict([line.split("\t") for line in lines[2:]])

但我写了一个简单的循环，因为它似乎更容易理解。

使用类也可以为您提供

Answer 3

这对你有用吗？：

顶级（你已经完成）

def get_parent(text, parent):
    """recursively walk through text, looking for 'ID' tag"""

    # find open_ID and close_ID
    open_ID = text.find('ID')
    close_ID = text.find('ID', open_ID + 1)

    # if there is another instance of 'ID', recursively walk again
    if close_ID != -1:
        parent.append(text[open_ID : close_ID])
        return get_parent(text[close_ID:], parent)
    # base-case 
    else:
        parent.append(text[open_ID:])
        return

第二级：按换行分开：

def child_split(parent):
    index = 0
    while index < len(parent):
        parent[index] = parent[index].split('\n')
        index += 1

第三级：拆分'ID'和'SENT'字段

def split_field(parent, index):
if index < len(parent):
    child = 0
    while child < len(parent[index]):
        if ':' in parent[index][child]:
            parent[index][child] = parent[index][child].split(':')
        else:
            parent[index][child] = parent[index][child].split()
        child += 1
    return split_field(parent, index + 1)
else:
    return

一起运行：

def main(text):
    parent = []
    get_parent(text, parent)
    child_split(parent)
    split_field(parent, 0)

结果是完全嵌套的，也许可以稍微清理一下？或者split_fields（）函数可能会返回一个字典？

Answer 4

我不确定您需要什么输出，但您可以调整它以满足您的需求（这使用itertools grouper recipe）：

>>> from itertools import izip_longest
>>> def grouper(n, iterable, fillvalue=None):
        "Collect data into fixed-length chunks or blocks"
        # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
        args = [iter(iterable)] * n
        return izip_longest(fillvalue=fillvalue, *args)

>>> with open('text.txt') as f:
        print [[x.rstrip().split(None, 1) for x in g if x.rstrip()]
               for g in grouper(6, f, fillvalue='')]


[[['ID:', '00000001'], ['SENT:', 'to do something'], ['to', '01573831'], ['do', '02017283'], ['something', '03517283']], 
 [['ID:', '00000002'], ['SENT:', 'just an example'], ['just', '06482823'], ['an', '01298744'], ['example', '01724894']]]

将列表的元素拆分为列表，然后再次拆分

4 个答案: