Python中“逐字逐句”语法意味着什么?

时间:2014-01-06 15:21:46

标签: python gensim

我在gensim tutorial page中看到以下脚本片段。

以下Python脚本中逐字逐句的语法是什么?

>> texts = [[word for word in document.lower().split() if word not in stoplist]
>>          for document in documents]

3 个答案:

答案 0 :(得分:6)

这是list comprehension。您发布的代码循环遍历document.lower.split()中的每个元素,并创建一个仅包含符合if条件的元素的新列表。它为documents中的每个文档执行此操作。

尝试一下......

elems = [1, 2, 3, 4]
squares = [e*e for e in elems]  # square each element
big = [e for e in elems if e > 2]  # keep elements bigger than 2

从您的示例中可以看出,列表推导可以嵌套。

答案 1 :(得分:5)

那是list comprehension。一个更简单的例子可能是:

evens = [num for num in range(100) if num % 2 == 0]

答案 2 :(得分:4)

我很确定我在某些NLP应用程序中看到了这一行。

此列表理解:

[[word for word in document.lower().split() if word not in stoplist] for document in documents]

相同
ending_list = [] # often known as document stream in NLP.
for document in documents: # Loop through a list.
  internal_list = [] # often known as a a list tokens
  for word in document.lower().split():
    if word not in stoplist:
      internal_list.append(word) # this is where the [[word for word...] ...] appears
  ending_list.append(internal_list)

基本上,您需要包含令牌列表的文档列表。因此,通过循环文档,

for document in documents:

然后将每个文档拆分为标记

  list_of_tokens = []
  for word in document.lower().split():

然后列出这些令牌:

    list_of_tokens.append(word)    

例如:

>>> doc = "This is a foo bar sentence ."
>>> [word for word in doc.lower().split()]
['this', 'is', 'a', 'foo', 'bar', 'sentence', '.']

它与:

相同
>>> doc = "This is a foo bar sentence ."
>>> list_of_tokens = []
>>> for word in doc.lower().split():
...   list_of_tokens.append(word)
... 
>>> list_of_tokens
['this', 'is', 'a', 'foo', 'bar', 'sentence', '.']