Question

我正尝试为某些NLP用python编写用于字符串化的代码，并想出了以下代码：

str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
for line in str:
    s.append([])
    s[a].append(line.split())
    a+=1
print(s)

输出结果是：

[[['I', 'am', 'Batman.']], [['I', 'loved', 'the', 'tea.']], [['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]]

如您所见，列表现在具有一个额外的维度，例如，如果我想要单词“蝙蝠侠”，则必须键入s[0][0][2]而不是s[0][2]，因此我更改了代码：

str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
s= []
a=0
m = []
for line in str:
    s.append([])
    m=(line.split())
    for word in m:
        s[a].append(word)
    a += 1
print(s)

这给了我正确的输出：

[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]

但是我感觉这可以在一个循环中使用，因为我将要导入的数据集将非常大，并且n的复杂度将比n^2好得多，那么，有没有更好的方法可以做到这一点/一种方法可以做到这一点？

Answer 1

对于循环中的每个字符串，您都应使用split()

具有列表理解的示例：

str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']

[s.split() for s in str]

[['I', 'am', 'Batman.'],
 ['I', 'loved', 'the', 'tea.'],
 ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]

Answer 2

您的原始代码就在那儿。

>>> str = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
>>> s=[]
>>> for line in str:
...   s.append(line.split())
...
>>> print(s)
[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]

line.split()为您提供了一个列表，因此请将其附加到循环中。或直接理解：

[line.split() for line in str]

说s.append([])时，索引'a'处有一个空列表，如下所示：

L = []

如果将split的结果附加到其中，例如L.append([1])，则会在此列表中得到一个列表：[[1]]

Answer 3

查看此内容：-

>>> list1 = ['I am Batman.','I loved the tea.','I will never go to that mall again!']
>>> [i.split() for i in list1]  
# split by default slits on whitespace strings and give output as list

[['I', 'am', 'Batman.'], ['I', 'loved', 'the', 'tea.'], ['I', 'will', 'never', 'go', 'to', 'that', 'mall', 'again!']]

有没有更好的方法来标记一些字符串？

3 个答案: