Question

我有一个包含数千个单词的文本文件，例如：

laban
labrador
labradors
lacey
lachesis
lacy
ladoga
ladonna
lafayette
lafitte
lagos
lagrange
lagrangian
lahore
laius
lajos
lakeisha
lakewood

我想迭代每个单词，所以我得到：

labanlaban
labanlabrador
labanlabradors
labanlacey
labanlachesis
etc...

在bash中，我可以执行以下操作，但速度极慢：

#!/bin/bash
( cat words.txt | while read word1; do
  cat words.txt | while read word2; do
    echo "$word1$word2" >> doublewords.txt
 done; done )

有更快更有效的方法吗？另外，我将如何以这种方式迭代两个不同的文本文件？

Answer 1

如果你可以将列表放入内存：

import itertools

with open(words_filename, 'r') as words_file:
    words = [word.strip() for word in words_file]

for words in itertools.product(words, repeat=2):
    print(''.join(words))

（你也可以做一个双循环，但我今晚感觉itertools。）

我怀疑这里的胜利是我们可以避免一遍又一遍地重读文件; bash示例中的内部循环将为外循环的每次迭代捕获一个文件。另外，我认为Python的执行速度比bash，IIRC更快。

你当然可以用bash来解决这个问题（将文件读入数组，编写一个双for循环），这只会更加痛苦。

Answer 2

看起来sed非常有效地将文本附加到每一行。我建议：

#!/bin/bash

for word in $(< words.txt)
do 
    sed "s/$/$word/" words.txt;
done > doublewords.txt

（您是否混淆了$，这意味着sed的行尾和作为bash变量的$word。

对于2000行文件，这在我的计算机上运行大约20秒，相比之下，你的解决方案大约需要2分钟。

备注：看起来你在重定向整个程序的标准输出而不是在每个循环强制写入时稍微好一些。

（警告，这有点偏离主题和个人意见！）

如果你真的想要速度，你应该考虑使用C ++等编译语言。例如：

vector<string> words;
ifstream infile("words.dat");
for(string line ; std::getline(infile,line) ; )
    words.push_back(line);
infile.close();

ofstream outfile("doublewords.dat");
for(auto word1 : data)
    for(auto word2 : data)
        outfile << word1 << word2 << "\n";
outfile.close();

你需要明白bash和python在双循环中都是不好的：这就是你使用技巧（@Thanatos）或预定义命令（sed）的原因。最近，我遇到了一个双循环问题（在3d中给出一组10000点，计算对之间的所有距离）并且我使用C ++而不是python或Matlab成功解决了它。

Answer 3

如果您有GHC，笛卡尔产品是同步的！

Q1：一个文件

-- words.hs
import Control.Applicative
main = interact f
    where f = unlines . g . words
          g x = map (++) x <*> x

这会将文件拆分为单词列表，然后使用适用的<*>将每个单词附加到彼此的单词上。

与GHC编译，

ghc words.hs

然后使用IO重定向运行：

./words <words.txt >out

Q2：两个文件

-- words2.hs
import Control.Applicative
import Control.Monad
import System.Environment
main = do
    ws <- mapM ((liftM words) . readFile) =<< getArgs
    putStrLn $ unlines $ g ws
    where g (x:y:_) = map (++) x <*> y

像以前一样编译并以两个文件作为参数运行：

./words2 words1.txt words2.txt > out

Bleh，编译？

想要shell脚本的方便性和已编译的可执行文件的性能吗？为什么不两个都做？

只需将您想要的Haskell程序包装在一个包装器脚本中，该脚本在/var/tmp中编译它，然后用生成的可执行文件替换它自己：

#!/bin/bash
# wrapper.sh

cd /var/tmp
cat > c.hs <<CODE
# replace this comment with haskell code
CODE
ghc c.hs >/dev/null
cd - >/dev/null
exec /var/tmp/c "$@"

这会处理参数和IO重定向，就好像包装器不存在一样。

<强>结果

使用两个2000个单词文件来对抗其他一些答案：

$ time ./words2 words1.txt words2.txt >out
3.75s user 0.20s system 98% cpu 4.026 total

$ time ./wrapper.sh words1.txt words2.txt > words2
4.12s user 0.26s system 97% cpu 4.485 total

$ time ./thanatos.py > out
4.93s user 0.11s system 98% cpu 5.124 total

$ time ./styko.sh
7.91s user 0.96s system 74% cpu 11.883 total

$ time ./user3552978.sh
57.16s user 29.17s system 93% cpu 1:31.97 total

Answer 4

您可以通过创建tempfile并在读取现有文件时向其写入数据并最终remove原始文件并将move新文件写入原始文件，以pythonic方式执行此操作

import sys
from os import remove
from shutil import move
from tempfile import mkstemp


def data_redundent(source_file_path):
    fh, target_file_path = mkstemp()
    with open(target_file_path, 'w') as target_file:
        with open(source_file_path, 'r') as source_file:
            for line in source_file:
                target_file.write(line.replace('\n', '')+line)
    remove(source_file_path)
    move(target_file_path, source_file_path)

data_redundent('test_data.txt')

Answer 5

我不确定这是多么有效，但使用专门为此类设计的Unix工具的一种非常简单的方法是

paste -d'\0' <file> <file>

-d选项指定在连接部分之间使用的分隔符，\0表示NULL字符（即根本没有分隔符）。

Python或Bash - 迭代文本文件中的所有单词

5 个答案: