我正在使用nltk语料库movie_reviews,其中有很多文档。我的任务是通过预处理数据获得这些评论的预测性能,而无需预处理。但是有问题,在列表documents
和documents2
中我有相同的文档,我需要对它们进行随机播放,以便在两个列表中保持相同的顺序。我不能单独洗牌,因为每次我洗牌都会得到其他结果。这就是为什么我需要以相同的顺序对其进行一次洗牌,因为我需要在最后比较它们(这取决于顺序)。我正在使用python 2.7
示例(实际上是字符串标记化,但它不是相对的):
documents = [(['plot : two teen couples go to a church party , '], 'neg'),
(['drink and then drive . '], 'pos'),
(['they get into an accident . '], 'neg'),
(['one of the guys dies'], 'neg')]
documents2 = [(['plot two teen couples church party'], 'neg'),
(['drink then drive . '], 'pos'),
(['they get accident . '], 'neg'),
(['one guys dies'], 'neg')]
我需要在对两个列表进行洗牌后得到这个结果:
documents = [(['one of the guys dies'], 'neg'),
(['they get into an accident . '], 'neg'),
(['drink and then drive . '], 'pos'),
(['plot : two teen couples go to a church party , '], 'neg')]
documents2 = [(['one guys dies'], 'neg'),
(['they get accident . '], 'neg'),
(['drink then drive . '], 'pos'),
(['plot two teen couples church party'], 'neg')]
我有这段代码:
def cleanDoc(doc):
stopset = set(stopwords.words('english'))
stemmer = nltk.PorterStemmer()
clean = [token.lower() for token in doc if token.lower() not in stopset and len(token) > 2]
final = [stemmer.stem(word) for word in clean]
return final
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
documents2 = [(list(cleanDoc(movie_reviews.words(fileid))), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle( and here shuffle documents and documents2 with same order) # or somehow
答案 0 :(得分:130)
你可以这样做:
import random
a = ['a', 'b', 'c']
b = [1, 2, 3]
c = list(zip(a, b))
random.shuffle(c)
a, b = zip(*c)
print a
print b
[OUTPUT]
['a', 'c', 'b']
[1, 3, 2]
当然,这是一个简单列表的例子,但适应性将与你的情况相同。
希望它有所帮助。祝你好运。
答案 1 :(得分:11)
我有一个简单的方法来做到这一点
import numpy as np
a = np.array([0,1,2,3,4])
b = np.array([5,6,7,8,9])
indices = np.arange(a.shape[0])
np.random.shuffle(indices)
a = a[indices]
b = b[indices]
# a, array([3, 4, 1, 2, 0])
# b, array([8, 9, 6, 7, 5])
答案 2 :(得分:3)
同时随机播放任意数量的列表。
from random import shuffle
def shuffle_list(*ls):
l =list(zip(*ls))
shuffle(l)
return zip(*l)
a = [0,1,2,3,4]
b = [5,6,7,8,9]
a1,b1 = shuffle_list(a,b)
print(a1,b1)
a = [0,1,2,3,4]
b = [5,6,7,8,9]
c = [10,11,12,13,14]
a1,b1,c1 = shuffle_list(a,b,c)
print(a1,b1,c1)
输出:
$ (0, 2, 4, 3, 1) (5, 7, 9, 8, 6)
$ (4, 3, 0, 2, 1) (9, 8, 5, 7, 6) (14, 13, 10, 12, 11)
注意:
shuffle_list()
返回的对象为tuples
。
P.S。
shuffle_list()
也可以应用于numpy.array()
a = np.array([1,2,3])
b = np.array([4,5,6])
a1,b1 = shuffle_list(a,b)
print(a1,b1)
输出:
$ (3, 1, 2) (6, 4, 5)
答案 3 :(得分:0)
from sklearn.utils import shuffle
a = ['a', 'b', 'c','d','e']
b = [1, 2, 3, 4, 5]
a_shuffled, b_shuffled = shuffle(np.array(a), np.array(b))
print(a_shuffled, b_shuffled)
#['e' 'c' 'b' 'd' 'a'] [5 3 2 4 1]
答案 4 :(得分:0)
简便快捷的方法是将random.seed()与random.shuffle()结合使用。它使您可以多次生成相同的随机订单。 看起来像这样:
a = [1, 2, 3, 4, 5]
b = [6, 7, 8, 9, 10]
seed = random.random()
random.seed(seed)
a.shuffle()
random.seed(seed)
b.shuffle()
print(a)
print(b)
>>[3, 1, 4, 2, 5]
>>[8, 6, 9, 7, 10]
当由于内存问题而无法同时使用两个列表时,这也适用。
答案 5 :(得分:0)
您可以将值的顺序存储在一个变量中,然后同时对数组进行排序:
array1 = [1, 2, 3, 4, 5]
array2 = ["one", "two", "three", "four", "five"]
order = range(len(array1))
random.shuffle(order)
newarray1 = []
newarray2 = []
for x in range(len(order)):
newarray1.append(array1[order[x]])
newarray2.append(array2[order[x]])
print newarray1, newarray2
答案 6 :(得分:-4)
您可以使用shuffle函数的第二个参数来修复shuffling的顺序。
具体来说,你可以传递shuffle函数的第二个参数一个零参数函数,它返回[0,1]中的值。该函数的返回值修复了混洗的顺序。 (默认情况下,即如果你没有传递任何函数作为第二个参数,它使用函数random.random()
。你可以在第277行here看到它。)
这个例子说明了我所描述的内容:
import random
a = ['a', 'b', 'c', 'd', 'e']
b = [1, 2, 3, 4, 5]
r = random.random() # randomly generating a real in [0,1)
random.shuffle(a, lambda : r) # lambda : r is an unary function which returns r
random.shuffle(b, lambda : r) # using the same function as used in prev line so that shuffling order is same
print a
print b
输出:
['e', 'c', 'd', 'a', 'b']
[5, 3, 4, 1, 2]