我有一个列表列表,其中包含需要进行词素化的单词。我收到一条错误消息,指出必须使用字符串而不是列表,因为我正在使用Spacy。
如果我转换为字符串,即nlp(str(list_1))
,则列表分隔符(例如:“,”和“ [”)将被标记化并包含在我的输出中。
如何对列表列表中的项目进行定形,然后将其恢复为相同的形式(即列表列表)?
需要进行词素化的单词可以在列表列表中的任意位置。
我想要这样的东西:
输入:
[["flower", "grows", "garden"], [["boy", "running", "playground"]]
输出:
[["flower", "grow", "garden"], ["boy", "run", "playground"]]
import spacy
nlp = spacy.load("en_core_web_sm")
list_1 = [["flower", "grows", "garden"], ["boy", "running", "playground"]]
for item in nlp(str(list_1)):
print(item.lemma_)
答案 0 :(得分:1)
我会将这项任务分为以下几个部分:
您已经做到了,但为了后代:
nlp = spacy.load("en_core_web_sm")
words = [["flower", "grows", "garden"], ["boy", "running", "playground"]]
我们需要每个列表的长度,以便以后可以对其进行迭代(以调整输出的形状)。使用numpy.cumsum,我们可以创建一个数组,使我们能够在O(n)
时间内这样做。
# remember about importing numpy
lengths = np.cumsum([0] + list(map(len, words)))
print(lengths)
这将为我们提供以下数组(针对您的情况):
[0 3 6]
我们稍后将使用从该数组创建的范围,例如令牌[0:3]
构成第一个数组,令牌[3:6]
构成第二个数组。
flat_words = [item for sublist in words for item in sublist]
doc = spacy.tokens.Doc(nlp.vocab, words=flat_words)
最好将flat_words
作为列表传递,这样spacy
不必执行不必要的标记化操作。
最后遍历spacy.tokens.Span
个对象,它们的标记并将这些标记(当然是lemmatized
)添加到列表中。
lemmatized = []
# Iterate starting with 1
for index in range(1, len(lengths)):
# Slice doc as described in the first point, so [0:3] and [3:6]
span = doc[lengths[index - 1] : lengths[index]]
# Add lemmatized tokens as list to the outer list
lemmatized.append([token.lemma_ for token in span])
print(lemmatized)
的输出将如您所愿:
[['flower', 'grow', 'garden'], ['boy', 'run', 'playground']]
只是为了让您更轻松,请参见以下完整代码:
import numpy as np
import spacy
nlp = spacy.load("en_core_web_sm")
words = [["flower", "grows", "garden"], ["boy", "running", "playground"]]
lengths = np.cumsum([0] + list(map(len, words)))
print(lengths)
flat_words = [item for sublist in words for item in sublist]
doc = spacy.tokens.Doc(nlp.vocab, words=flat_words)
lemmatized = []
# Iterate starting with 1
for index in range(1, len(lengths)):
# Slice doc as described in the first point, so [0:3] and [3:6]
span = doc[lengths[index - 1] : lengths[index]]
# Add lemmatized tokens as list to the list
lemmatized.append([token.lemma_ for token in span])
print(lemmatized)
答案 1 :(得分:0)
在处理列表列表时,可以将列表中的项目加入,然后使用nlp()
。接下来,获取其中每个项目的引理。要再次返回列表列表,只需在项目出现的索引处对其进行定形。
for item in list_1:
doc = nlp(' '.join(item))
for indexer,i in enumerate(doc):
item[indexer] = i.lemma_
print(list_1)
#Output:
[['flower', 'grow', 'garden'], ['boy', 'run', 'playground']]
答案 2 :(得分:-1)
我认为这不是最好的解决方案,但是您可以做到
import spacy
nlp = spacy.load("en_core_web_sm")
list_1 = [["flower", "grows", "garden"], ["boy", "running", "playground"]]
s=""
for item in nlp(str(list_1)):
s+=item.lemma_
ss=s[2:-2].replace('\'','').split('],[')
l=[]
for sss in ss :
l.append(sss.split(','))
print(l)
#output
[['flower', 'grow', 'garden'], ['boy', 'run', 'playground']]
答案 3 :(得分:-1)
这里:如果仅要更改这些特定作品,则可以使用
main = [["flower", "grows", "garden"], [["boy", "running", "playground"]]
main[0][1] = "grow"
main[1][1] = "run"
# main = [["flower", "grow", "garden"], ["boy", "run", "playground"]]