Question

我有一份文件清单，标有相应的类别：

>>> sum(all(np.linalg.norm(p - q) > 3
            for p, q in combinations(np.random.rand(10, 3) * 10, 2))
        for _ in range(1000))
20

给出了以下元组列表，其中元组的第一个元素是单词列表（句子的标记）。例如：

documents = [(list(corpus.words(fileid)), category)
              for category in corpus.categories()
              for fileid in corpus.fileids(category)]

我想应用一些文本处理技术，但我希望维护元组格式列表。

我知道如果我只有一个单词列表，那就可以了：

[([u'A', u'pilot', u'investigation', u'of', u'a', u'multidisciplinary', 
u'quality', u'of', u'life', u'intervention', u'for', u'men', u'with', 
u'biochemical', u'recurrence', u'of', u'prostate', u'cancer', u'.'], 
'cancer'), 
([u'A', u'Systematic', u'Review', u'of', u'the', u'Effectiveness', 
u'of', u'Medical', u'Cannabis', u'for', u'Psychiatric', u',', 
u'Movement', u'and', u'Neurodegenerative', u'Disorders', u'.'], 'hd')]

但在这种情况下，我想将.lower（）应用于元组列表中每个元组的第一个元素（字符串列表），并在尝试各种选项之后：

[w.lower() for w in words]

我总是收到这个错误：

AttributeError：'list'对象没有属性'lower'

我还尝试在创建列表之前应用我需要的东西，但.categories（）和.fileids（）是语料库的属性，它们也返回相同的错误（它们也是列表）。

任何帮助都将深表感谢。

解决：

@Adam Smith的答案和@vasia都是对的：

[[x.lower() for x in element] for element in documents],
[(x.lower(), y) for x,y in documents], or
[x[0].lower() for x in documents]

@ Adam的答案保持了元组结构; @vasia从创建元组列表中做到了这一点：

[([s.lower() for s in item[0]], item[1]) for item in documents]

谢谢大家：）

Answer 1

因此您的数据结构为[([str], str)]。每个元组为(list of strings, string)的元组列表。在尝试从中提取数据之前，深入了解这意味着什么非常重要。

这意味着for item in documents会为您提供一个元组列表，其中item是每个元组。

这意味着item[0]是每个元组中的列表。

这意味着for item in documents: for s in item[0]:将遍历该列表中的每个字符串。我们来试试吧！

[s.lower() for item in documents for s in item[0]]

这应该从您的示例数据中提供：

[u'a', u'p', u'i', u'o', u'a', u'm', ...]

如果您尝试保留元组格式，则可以执行以下操作：

[([s.lower() for s in item[0]], item[1]) for item in documents]

# or perhaps more readably
[([s.lower() for s in lst], val) for lst, val in documents]

这两个陈述都给出了：

[([u'a', u'p', u'i', u'o', u'a', u'm', ...], 'cancer'), ... ]

Answer 2

你很亲密。您正在寻找这样的结构：

[([s.lower() for s in ls], cat) for ls, cat in documents]

这基本上将这两者放在一起：

[[x.lower() for x in element] for element in documents],
[(x.lower(), y) for x,y in documents]

Answer 3

试试这个：

documents = [([word.lower() for word in corpus.words(fileid)], category)
              for category in corpus.categories()
              for fileid in corpus.fileids(category)]

Answer 4

通常，元组是不可变的。但是，由于每个元组的第一个元素是一个列表，该列表是可变的，因此您可以修改其内容而不更改该列表的元组所有权：

documents = [(...what you originally posted...) ... etc. ...]

for d in documents:
    # to lowercase all strings in the list
    # trailing '[:]' is important, need to modify list in place using slice
    d[0][:] = [w.lower() for w in d[0]]

    # or to just lower-case the first element of the list (which is what you asked for)
    d[0][0] = d[0][0].lower()

您不能只在字符串上调用lower()并让它更新 - lower()会返回一个新字符串。因此，要将字符串修改为小写版本，您必须通过它进行分配。如果字符串本身是一个元组成员，则这是不可能的，但由于您正在修改的字符串位于元组的列表中，因此您可以修改列表内容而无需修改元组对列表的所有权。

元组列表中元组的小写第一个元素

4 个答案: