我喜欢我的字典,以便更多地了解我使用的字词 - 并且不想手动添加所有可能的字词,因为我最后输入它们(我是生物学家/生物信息学家 - 那里& #39;大量的行话和特定的软件和物种名称)。相反,我想:
~/Library/Spelling/LocalDictionary
。但是将它们添加到libreoffice / word / ispell自定义词典中也是有意义的。1和3很容易。我怎么办2?谢谢!
答案 0 :(得分:1)
据我了解,您想要删除重复项(系统字典中已存在的重复项)。你可能想先问一下,如果这确实是必要的话。我猜他们不会导致任何问题并且不会过度增加单词拼写检查,因此我认为没有真正的理由第2步。
我认为你在第1步时会遇到更多困难。从PDF中提取纯文本可能听起来很容易,但肯定不是。你最终会得到大量未知的符号。你需要在一行的末尾修复分词,你可能想要排除方程/链接/数字/等。在将所有这些添加到词典之前。
但是如果你有一些工具可以完成这个并且可以创建几个真正只包含你需要的单词/句子的 .txt 文件,那么我会使用类似下面的python代码到"解决"合并仅适用于您的本地词典。当然你也可以扩展它来加载系统字典(无论在哪里?)并按照我在下面显示的方式合并它。
请注意,我故意遗漏任何错误处理。
另存为import_to_dict.py
,根据您的要求调整路径,并使用python import_to_dict.py
#!/usr/bin/env python
import os,re
# 1 - load existing dictionaries from files (adjust paths here!)
dictionary_file = '~/Library/Spelling/LocalDictionary'
global_dictionary_file = '/Library/Spelling/GlobalDictionary'
txt_file_folder = '~/Documents/ConvertedPapers'
reg_exp = r'[\s,.|/]+' #add symbols here
with open(local_dictionary_file, 'r') as f:
# splitting with regular expressions shouldn't really be needed for the dictionary, but it should work
dictionary = set(re.split(reg_exp,f.read()))
with open(global_dictionary_file, 'r') as f:
# splitting with regular expressions shouldn't really be needed for the dictionary, but it should work
global_dictionary = set(re.split(reg_exp,f.read()))
# 2 - walk over all sub-dirs in your folder
for root, dirs, files in os.walk(txt_file_folder):
# open all files (this could easily be limited to only .txt files)
for file in files:
with open(os.path.join(root, file), 'r') as txt_f:
# read the file contents
words = txt_f.read()
# split into word-set (set guarantees no duplicates)
word_set = set(re.split(reg_exp,words))
# remove any already in dictionary existing words
missing_words = (word_set - dictionary) - global_dictionary
# add missing words to dictionary
dictionary |= missing_words
# 3 - write dictionary file
with open(dictionary_file, 'w') as f:
f.write('\n'.join(dictionary))
答案 1 :(得分:0)
这是一个基本的java程序,它将生成一个文本文件,其中包含纯文本文件目录中的所有唯一单词,用换行符分隔。
您可以使用正确的系统值替换输入目录和输出文件路径字符串并运行它。
import java.io.*;
import java.util.*;
public class MakeDictionary {
public static void main(String args[]) throws IOException {
Hashtable<String, Boolean> dictionary = new Hashtable<String, Boolean>();
String inputDir = "C:\\test";
String outputFile = "C:\\out\\dictionary.txt";
File[] files = new File(inputDir).listFiles();
BufferedWriter out = new BufferedWriter(new FileWriter(outputFile));
for (File file : files) {
if (file.isFile()) {
BufferedReader in = null;
try {
in = new BufferedReader(new FileReader(file.getCanonicalPath()));
String line;
while ((line = in.readLine()) != null) {
String[] words = line.split(" ");
for (String word : words) {
dictionary.put(word, true);
}
}
} finally {
if (in != null) {
in.close();
}
}
}
}
Set<String> wordset = dictionary.keySet();
Iterator<String> iter = wordset.iterator();
while(iter.hasNext()) {
out.write(iter.next());
out.newLine();
}
out.close();
}
}