我写了一段代码,基本上可以从文本文件的列表中进行查找和替换。
因此,它将整个列表映射到字典中。然后,从文本文件中处理每一行,并与字典中的整个列表进行匹配,如果在该行中的任何位置找到匹配项,它将替换为list(dictionary)中的相应值。
代码如下:
import sys
import re
#open file using open file mode
fp1 = open(sys.argv[1]) # Open file on read mode
lines = fp1.read().split("\n") # Create a list containing all lines
fp1.close() # Close file
fp2 = open(sys.argv[2]) # Open file on read mode
words = fp2.read().split("\n") # Create a list containing all lines
fp2.close() # Close file
word_hash = {}
for word in words:
#print(word)
if(word != ""):
tsl = word.split("\t")
word_hash[tsl[0]] = tsl[1]
#print(word_hash)
keys = word_hash.keys()
#skeys = sorted(keys, key=lambda x:x.split(" "),reverse=True)
#print(keys)
#print (skeys)
for line in lines:
if(line != ""):
for key in keys:
#my_regex = key + r"\b"
my_regex = r"([\"\( ])" + key + r"([ ,\.!\"।)])"
#print(my_regex)
if((re.search(my_regex, line, re.IGNORECASE|re.UNICODE))):
line = re.sub(my_regex, r"\1" + word_hash[key]+r"\2",line,flags=re.IGNORECASE|re.UNICODE|re.MULTILINE)
#print("iam :1",line)
if((re.search(key + r"$", line, re.IGNORECASE|re.UNICODE))):
line = re.sub(key+r"$", word_hash[key],line,flags=re.IGNORECASE|re.UNICODE|re.MULTILINE)
#print("iam :2",line)
if((re.search(r"^" + key, line, re.IGNORECASE|re.UNICODE))):
#print(line)
line = re.sub(r"^" + key, word_hash[key],line,flags=re.IGNORECASE|re.UNICODE|re.MULTILINE)
#print("iam :",line)
print(line)
else:
print(line)
这里的问题是列表大小增加时,由于文本文件的所有行都与列表中的每个键都匹配,因此执行速度变慢。因此,我在哪里可以改善此代码的执行呢?
列表文件:
word1 ===>替换单词1
word2 ===>替换单词2
.....
列表以制表符分隔。在这里,为了便于理解,我使用了===>。
输入文件:
hello word1 I am here.
word2. how are you word1?
预期输出:
hello replaceword1 I am here.
replaceword2. how are you replaceword1?
答案 0 :(得分:2)
如果单词列表足够小,那么通过匹配和替换过程可以实现的最佳加速是使用单个大正则表达式并使用函数re.sub
这样,您只需调用一次优化函数即可。
编辑:为了保留替换顺序(这可能导致链替换,不知道是否为预期的行为),我们可以按批次执行替换,而不是单次运行,因为批次订单要考虑文件顺序和每个批次由不相交的可能的字符串匹配组成。
代码如下
import sys
import re
word_hashes = []
def insert_word(word, replacement, hashes):
if not hashes:
return [{word: replacement}]
for prev_word in hashes[0]:
if word in prev_word or prev_word in word:
return [hashes[0]] + insert_word(word, replacement, hashes[1:])
hashes[0][word] = replacement
return hashes
with open(sys.argv[2]) as fp2: # Open file on read mode
words = fp2.readlines()
for word in [w.strip() for w in words if w.strip()]:
tsl = word.split("\t")
word_hashes = insert_word(tsl[0],tsl[1], word_hashes)
#open file using open file mode
lines = []
with open(sys.argv[1]) as fp1:
content = fp1.read()
for word_hash in word_hashes:
my_regex = r"([\"\( ])(" + '|'.join(word_hash.keys()) + r")([ ,\.!\"।)])"
content = re.sub(my_regex, lambda x: x.group(1) + word_hash[x.group(2)] + x.group(3) ,content,flags=re.IGNORECASE|re.UNICODE|re.MULTILINE)
print(content)
我们获得了示例数据的链式替换。例如,用以下单词替换
roses are red==>flowers are blue
are==>is
要解析的文本
roses are red and beautiful
flowers are yellow
输出
roses is red and beautiful
flowers is yellow
答案 1 :(得分:0)
为什么不以字符串形式读取整个文件的内容,而只是执行//clone repo
File localPath = File.createTempFile("TestGitRepository", "");
localPath.delete();
private Repository repo;
Git git = Git.cloneRepository()
.setCredentialsProvider(credentials)
.setURI(url)
.setBranch(branch)
.setDirectory(localPath)
.call();
repo = git.getRepository();
//get commit by using cloned repo
try (Git git = new Git(repo)) {
Iterable<RevCommit> logs = git.log().call();
for (RevCommit rev : logs) {
System.out.println("Commit: " + rev + ", name: " + rev.getName() + ", id: " + rev.getId().getName());
}
} catch (Exception e) {
System.out.println("Exception: {}"+e);
}
。例如。
string.replace
如果输入文件为:
def find_replace():
txt = ''
#Read text from the file as a string
with open('file.txt', 'r') as fp:
txt = fp.read()
dct = {"word1":"replaceword1","word2":"replaceword2"}
#Find and replace characters
for k,v in dct.items():
txt = txt.replace(k,v)
#Write back the modified string
with open('file.txt', 'w') as fp:
fp.write(txt)
输出将是:
hello word1 I am here.
word2. how are you word1?