如何自动扫描本地文档以将单词添加到自定义词典?

时间:2015-07-28 19:04:09

标签: macos unix dictionary scripting

我喜欢我的字典,以便更多地了解我使用的字词 - 并且不想手动添加所有可能的字词,因为我最后输入它们(我是生物学家/生物信息学家 - 那里& #39;大量的行话和特定的软件和物种名称)。相反,我想:

  1. 获取现有文档的目录。这些是科学文章的PDF或Word /乳胶文件;我猜他们可以轻松地#34;转换为纯文本。
  2. 拉出所有不在"正常"字典。
  3. 将这些添加到我的本地自定义词典中(在我的Mac上~/Library/Spelling/LocalDictionary。但是将它们添加到libreoffice / word / ispell自定义词典中也是有意义的。
  4. 1和3很容易。我怎么办2?谢谢!

2 个答案:

答案 0 :(得分:1)

据我了解,您想要删除重复项(系统字典中已存在的重复项)。你可能想先问一下,如果这确实是必要的话。我猜他们不会导致任何问题并且不会过度增加单词拼写检查,因此我认为没有真正的理由第2步

我认为你在第1步时会遇到更多困难。从PDF中提取纯文本可能听起来很容易,但肯定不是。你最终会得到大量未知的符号。你需要在一行的末尾修复分词,你可能想要排除方程/链接/数字/等。在将所有这些添加到词典之前。

但是如果你有一些工具可以完成这个并且可以创建几个真正只包含你需要的单词/句子的 .txt 文件,那么我会使用类似下面的python代码到"解决"合并仅适用于您的本地词典。当然你也可以扩展它来加载系统字典(无论在哪里?)并按照我在下面显示的方式合并它。

请注意,我故意遗漏任何错误处理。

另存为import_to_dict.py,根据您的要求调整路径,并使用python import_to_dict.py

进行通话
#!/usr/bin/env python

import os,re

# 1 - load existing dictionaries from files (adjust paths here!)
dictionary_file = '~/Library/Spelling/LocalDictionary'
global_dictionary_file = '/Library/Spelling/GlobalDictionary'
txt_file_folder = '~/Documents/ConvertedPapers'

reg_exp = r'[\s,.|/]+' #add symbols here

with open(local_dictionary_file, 'r') as f:
    # splitting with regular expressions shouldn't really be needed for the dictionary, but it should work
    dictionary = set(re.split(reg_exp,f.read()))

with open(global_dictionary_file, 'r') as f:
    # splitting with regular expressions shouldn't really be needed for the dictionary, but it should work
    global_dictionary = set(re.split(reg_exp,f.read()))

# 2 - walk over all sub-dirs in your folder
for root, dirs, files in os.walk(txt_file_folder):
    # open all files (this could easily be limited to only .txt files)
    for file in files:
        with open(os.path.join(root, file), 'r') as txt_f:
            # read the file contents
            words = txt_f.read()
            # split into word-set (set guarantees no duplicates)
            word_set = set(re.split(reg_exp,words))
            # remove any already in dictionary existing words
            missing_words = (word_set - dictionary) - global_dictionary
            # add missing words to dictionary
            dictionary |= missing_words

# 3 - write dictionary file
with open(dictionary_file, 'w') as f:
    f.write('\n'.join(dictionary))

答案 1 :(得分:0)

这是一个基本的java程序,它将生成一个文本文件,其中包含纯文本文件目录中的所有唯一单词,用换行符分隔。

您可以使用正确的系统值替换输入目录和输出文件路径字符串并运行它。

import java.io.*;
import java.util.*;

public class MakeDictionary {
    public static void main(String args[]) throws IOException {
        Hashtable<String, Boolean> dictionary = new Hashtable<String, Boolean>();

        String inputDir = "C:\\test";
        String outputFile = "C:\\out\\dictionary.txt";


        File[] files = new File(inputDir).listFiles();

        BufferedWriter out = new BufferedWriter(new FileWriter(outputFile));
        for (File file : files) {
            if (file.isFile()) {
                BufferedReader in = null;
                try {
                    in = new BufferedReader(new FileReader(file.getCanonicalPath()));
                    String line;
                    while ((line = in.readLine()) != null) {
                        String[] words = line.split(" ");
                        for (String word : words) {
                            dictionary.put(word, true);
                        }
                    }
                } finally {
                    if (in != null) {
                        in.close();
                    }
                }
            }
        }

        Set<String> wordset = dictionary.keySet();
        Iterator<String> iter = wordset.iterator();
        while(iter.hasNext()) {
            out.write(iter.next());
            out.newLine();
        }
        out.close();
    }
}