我想替换所有出现的大于2147483647的整数,并且后跟^^<int>
后面的数字的前三位数。例如,我的原始数据为:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "25500000000"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
我想用下面提到的数据替换原始数据:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "255"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
我实施的方法是逐行扫描数据。如果我发现数字大于2147483647,我将它们替换为前3位数字。但是,我不知道如何检查字符串的下一部分是^^<int>
。
我想要做的是:对于大于2147483647的数字,例如25500000000,我想用数字的前3位替换它们。由于我的数据大小为1太字节,因此非常感谢更快的解决方案。
答案 0 :(得分:3)
使用re
模块构建正则表达式:
regex = r"""
( # Capture in group #1
"[\w\s]+" # Three sequences of quoted letters and white space characters
\s+ # followed by one or more white space characters
"[\w\s]+"
\s+
"[\w\s]+"
\s+
)
"(\d{10,})" # Match a quoted set of at least 10 integers into group #2
(^^\s+\.\s+) # Match by two circumflex characters, whitespace and a period
# into group #3
(.*) # Followed by anything at all into group #4
"""
COMPILED_REGEX = re.compile(regex, re.VERBOSE)
接下来,我们需要定义一个回调函数(因为re.RegexObject.sub
接受回调)来处理替换:
def replace_callback(matches):
full_line = matches.group(0)
number_text = matches.group(2)
number_of_interest = int(number_text, base=10)
if number_of_interest > 2147483647:
return full_line.replace(number_of_interest, number_text[:3])
else:
return full_line
然后找到并替换:
fixed_data = COMPILED_REGEX.sub(replace_callback, YOUR_DATA)
如果您有一个数据级别的数据,您可能不希望在内存中执行此操作 - 您需要打开该文件然后迭代它,逐行替换数据并将其写回另一个文件(毫无疑问,有办法加速这项工作,但他们会更难以理解这项技术的要点:
# Given the above
def process_data():
with open("path/to/your/file") as data_file,
open("path/to/output/file", "w") as output_file:
for line in data_file:
fixed_data = COMPILED_REGEX.sub(replace_callback, line)
output_file.write(fixed_data)
答案 1 :(得分:1)
如果文本文件中的每一行都与您的示例相似,那么您可以这样做:
In [2078]: line = '"QuestionAndAnsweringWebsite" "fact". "Ask a Question" "25500000000"^^ . "language" "89028899"'
In [2079]: re.findall('\d+"\^\^', line)
Out[2079]: ['25500000000"^^']
with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
for line in infile:
for found in re.findall('\d+"\^\^', line):
if int(found[:-3]) > 2147483647:
line = line.replace(found, found[:3])
outfile.write(line)
由于内部for循环,这可能是一个低效的解决方案。但是,我现在想不出更好的正则表达式,所以这应该让你开始,至少