Question

我想使用python从文件中检索一个随机单词，但我不相信我的以下方法是最好或最有效的。请协助。

import fileinput
import _random
file = [line for line in fileinput.input("/etc/dictionaries-common/words")]
rand = _random.Random()
print file[int(rand.random() * len(file))],

Answer 1

随机模块定义了choice（），它可以满足您的需求：

import random

words = [line.strip() for line in open('/etc/dictionaries-common/words')]
print(random.choice(words))

另请注意，这假设每个单词本身都在文件的一行中。如果文件非常大，或者经常执行此操作，您可能会发现不断重读文件会对应用程序的性能产生负面影响。

Answer 2

另一种解决方案是使用getline

import linecache
import random
line_number = random.randint(0, total_num_lines)
linecache.getline('/etc/dictionaries-common/words', line_number)

来自文档：

linecache模块允许获取来自任何文件的任何行，而试图在内部优化，使用缓存，常见的情况在哪里从单个文件中读取许多行

编辑：您可以计算一次总数并存储它，因为字典文件不太可能改变。

Answer 3

>>> import random
>>> random.choice(list(open('/etc/dictionaries-common/words')))
'jaundiced\n'

这是有效的人类时间。

顺便说一下，你的实现与stdlib的random.py中的实现一致：

 def choice(self, seq):
    """Choose a random element from a non-empty sequence."""
    return seq[int(self.random() * len(seq))]

衡量时间表现

我想知道所提出的解决方案的相对性能是什么。 linecache - 基于显而易见的最爱。与random.choice中实施的诚实算法相比，select_random_line()的单行代码要慢多少？

# nadia_known_num_lines   9.6e-06 seconds 1.00
# nadia                   0.056 seconds 5843.51
# jfs                     0.062 seconds 1.10
# dcrosta_no_strip        0.091 seconds 1.48
# dcrosta                 0.13 seconds 1.41
# mark_ransom_no_strip    0.66 seconds 5.10
# mark_ransom_choose_from 0.67 seconds 1.02
# mark_ransom             0.69 seconds 1.04

（每个函数调用10次（缓存性能））。

这些结果表明，简单的解决方案（dcrosta）在这种情况下比更有意思的解决方案（mark_ransom）更快。

用于比较的代码（as a gist）：

import linecache
import random
from timeit import default_timer


WORDS_FILENAME = "/etc/dictionaries-common/words"


def measure(func):
    measure.func_to_measure.append(func)
    return func
measure.func_to_measure = []


@measure
def dcrosta():
    words = [line.strip() for line in open(WORDS_FILENAME)]
    return random.choice(words)


@measure
def dcrosta_no_strip():
    words = [line for line in open(WORDS_FILENAME)]
    return random.choice(words)


def select_random_line(filename):
    selection = None
    count = 0
    for line in file(filename, "r"):
        if random.randint(0, count) == 0:
            selection = line.strip()
            count = count + 1
    return selection


@measure
def mark_ransom():
    return select_random_line(WORDS_FILENAME)


def select_random_line_no_strip(filename):
    selection = None
    count = 0
    for line in file(filename, "r"):
        if random.randint(0, count) == 0:
            selection = line
            count = count + 1
    return selection


@measure
def mark_ransom_no_strip():
    return select_random_line_no_strip(WORDS_FILENAME)


def choose_from(iterable):
    """Choose a random element from a finite `iterable`.

    If `iterable` is a sequence then use `random.choice()` for efficiency.

    Return tuple (random element, total number of elements)
    """
    selection, i = None, None
    for i, item in enumerate(iterable):
        if random.randint(0, i) == 0:
            selection = item

    return selection, (i+1 if i is not None else 0)


@measure
def mark_ransom_choose_from():
    return choose_from(open(WORDS_FILENAME))


@measure
def nadia():
    global total_num_lines
    total_num_lines = sum(1 for _ in open(WORDS_FILENAME))

    line_number = random.randint(0, total_num_lines)
    return linecache.getline(WORDS_FILENAME, line_number)


@measure
def nadia_known_num_lines():
    line_number = random.randint(0, total_num_lines)
    return linecache.getline(WORDS_FILENAME, line_number)


@measure
def jfs():
    return random.choice(list(open(WORDS_FILENAME)))


def timef(func, number=1000, timer=default_timer):
    """Return number of seconds it takes to execute `func()`."""
    start = timer()
    for _ in range(number):
        func()
    return (timer() - start) / number


def main():
    # measure time
    times = dict((f.__name__, timef(f, number=10))
                 for f in measure.func_to_measure)

    # print from fastest to slowest
    maxname_len = max(map(len, times))
    last = None
    for name in sorted(times, key=times.__getitem__):
        print "%s %4.2g seconds %.2f" % (name.ljust(maxname_len), times[name],
                                         last and times[name] / last or 1)
        last = times[name]


if __name__ == "__main__":
    main()

Answer 4

从What’s the best way to return a random line in a text file using C? Python化我的答案：

import random

def select_random_line(filename):
    selection = None
    count = 0
    for line in file(filename, "r"):
        if random.randint(0, count) == 0:
            selection = line.strip()
        count = count + 1
    return selection

print select_random_line("/etc/dictionaries-common/words")

编辑：我的答案的原始版本使用readlines，这不符合我的想法，完全没必要。这个版本将遍历文件，而不是将其全部读入内存，并在一次通过中完成，这应该使它比我迄今为止看到的任何答案都更有效。

广义版

import random

def choose_from(iterable):
    """Choose a random element from a finite `iterable`.

    If `iterable` is a sequence then use `random.choice()` for efficiency.

    Return tuple (random element, total number of elements)
    """
    selection, i = None, None
    for i, item in enumerate(iterable):
        if random.randint(0, i) == 0:
            selection = item

    return selection, (i+1 if i is not None else 0)

实施例

print choose_from(open("/etc/dictionaries-common/words"))
print choose_from(dict(a=1, b=2))
print choose_from(i for i in range(10) if i % 3 == 0)
print choose_from(i for i in range(10) if i % 11 == 0 and i) # empty
print choose_from([0]) # one element
chunk, n = choose_from(urllib2.urlopen("http://google.com"))
print (chunk[:20], n)

输出

('yeps\n', 98569)
('a', 2)
(6, 4)
(None, 0)
(0, 1)
('window._gjp && _gjp(', 10)

Answer 5

本文可能有所帮助

http://www.bryceboe.com/2009/03/23/random-lines-from-a-file/

Answer 6

您可以在不使用fileinput的情况下执行此操作：

import random
data = open("/etc/dictionaries-common/words").readlines()
print random.choice(data)

我还使用了data而不是file，因为file是Python中的预定义类型。

Answer 7

我没有代码，但就算法而言：

查找文件大小
使用seek（）函数进行随机搜索
查找下一个（或上一个）空格字符
返回在该空白字符之后开始的单词

Answer 8

在这种情况下，效率和冗长不是一回事。很有诱惑力去采用最美丽的pythonic方法，只需要一行或两行，但对于文件I / O，坚持经典的fopen风格，低级别的交互，即使它确实占用了更多的代码行

我可以复制并粘贴一些代码并声称它是我自己的（其他人可以，如果他们想要的话），但请看一下：http://mail.python.org/pipermail/tutor/2007-July/055635.html

Answer 9

有几种不同的方法可以优化此问题。您可以优化速度或空间。

如果你想要一个快速但需要内存的解决方案，请使用file.readlines（）读取整个文件，然后使用random.choice（）

如果你想要一个节省内存的解决方案，首先通过反复调用somefile.readline（）来检查文件中的行数，直到它返回“”，然后生成一个小于行数的随机数（比如n）），寻找文件的开头，最后调用somefile.readline（）n次。对somefile.readline（）的下一次调用将返回所需的随机行。这种方法不会浪费任何存储“不必要”的行。当然，如果你计划从文件中获取大量随机行，这将是非常低效的，并且最好只将整个文件保存在内存中，就像第一种方法一样。

从python中的单词列表中返回一个随机单词

9 个答案:

衡量时间表现

用于比较的代码（as a gist）：

广义版

实施例

输出