我正在尝试填充Python3
中的列表,其中使用REGEX
从文件中读取3个随机项,但我不断在列表中获取重复项。
这是一个例子。
import re
import random as rn
data = '/root/Desktop/Selenium[FILTERED].log'
with open(data, 'r') as inFile:
index = inFile.read()
URLS = re.findall(r'https://www\.\w{1,10}\.com/view\?i=\w{1,20}', index)
list_0 = []
for i in range(3):
list_0.append(URLS[rn.randint(1, 30)])
inFile.close()
for i in range(len(list_0)):
print(list_0[i])
防止重复项目附加到列表的最简洁方法是什么?
(修改的) 这是我认为完成这项工作的代码。
def random_sample(data):
r_e = ['https://www\.\w{1,10}\.com/view\?i=\w{1,20}', '..']
with open(data, 'r') as inFile:
urls = re.findall(r'%s' % r_e[0], inFile.read())
x = list(set(urls))
inFile.close()
return x
data = '/root/Desktop/[TEMP].log'
sample = random_sample(data)
for i in range(3):
print(sample[i])
没有重复条目的无序集合。
答案 0 :(得分:3)
使用内置random.sample
。
random.sample(population, k)
Return a k length list of unique elements chosen from the population sequence or set.
Used for random sampling without replacement.
看到你的编辑后,看起来你做的事情比以前要难得多。我在下面列出了URLS
的列表,但来源并不重要。选择(保证唯一)子集本质上是一个带有random.sample
:
import random
# the following two lines are easily replaced
URLS = ['url1', 'url2', 'url3', 'url4', 'url5', 'url6', 'url7', 'url8']
SUBSET_SIZE = 3
# the following one-liner yields the randomized subset as a list
urlList = [URLS[i] for i in random.sample(range(len(URLS)), SUBSET_SIZE)]
print(urlList) # produces, e.g., => ['url7', 'url3', 'url4']
请注意,通过使用len(URLS)
和SUBSET_SIZE
,执行工作的单线程不会硬连接到集合的大小,也不会硬连接到所需的子集大小。
如果原始输入列表包含重复值,则以下稍作修改将为您解决问题:
URLS = list(set(URLS)) # this converts to a set for uniqueness, then back for indexing
urlList = [URLS[i] for i in random.sample(range(len(URLS)), SUBSET_SIZE)]
甚至更好,因为它不需要两次转换:
URLS = set(URLS)
urlList = [u for u in random.sample(URLS, SUBSET_SIZE)]
答案 1 :(得分:1)
seen = set(list_0)
randValue = URLS[rn.randint(1, 30)]
# [...]
if randValue not in seen:
seen.add(randValue)
list_0.append(randValue)
现在你只需要检查list_0 size是否等于3来停止循环。