请注意,这是我原始查询的修订/改进副本,我希望这比我的第一次尝试更清晰。我是编程世界中的新手,试图创建一个脚本,基本上使用另一个csv表作为更正指南,在csv上进行一系列特定的查找和替换。 (即chiken变成鸡肉,bcon变成培根)
所以在简单的情况下:
chikn,如图1所示,
BCON,2,B
egs,3,c
变得
chickn,如图1所示,
腊肉,2,B
鸡蛋,3,c
到目前为止,使用下面的代码我已经构建了一个基于输入csv的字典,并且能够在简单的情况下转换目标(编辑)csv上的大多数更正。然而,真正的挑战是实际数据集通常每个单元格有1-3个条目(具有共同的分隔符:它们之间),其中许多将具有空格(即,短语而不是单个单词)。使用更新的字典构建前一个示例,这将是:
开头:
夹心的chk:egs,1,a
BCON,2,B
Bcon:egs,3,c
应结束:
鸡肉三明治:鸡蛋,1,a
培根,B,2
培根:鸡蛋,3,c
相反,我当前的输出会降低后一部分并打印出来
鸡肉三明治,1,a
培根,B,2
培根,3,c
代码:
#!/usr/bin/env python
"""A script for finding and replacing values in CSV files.
"""
import csv
import sys
def main(args):
"""Execute the transformation script.
Args:
args (list of `str`): The command line arguments.
"""
transform(args[1], args[2], create_reps(args[3]), int(args[4]))
def transform(infile, outfile, reps, column):
"""Write a new CSV file with replaced text.
Args:
infile (str): the sheet of original text with errors
outfile (str): the sheet with the revised text with corrections in place of errors
reps (:obj: `str`): dictionary of error word and corrected word
column (int): the column (0 based) the word revisions will take place in
"""
with open(infile) as csvfile:
with open(outfile, 'w') as w:
spamreader = csv.reader(csvfile)
spamwriter = csv.writer(w)
for row in spamreader:
row[column] = replace_all(row[column], reps)
spamwriter.writerow(row)
def create_reps(infile):
"""Create reps object to use as reference dictionary for transform.
Args:
infile (str): The sheet of original and corrected words used to
generate dicitonary
Returns:
(:obj: `str`): a dictionary listing the error words and their
corrections
"""
reps = {}
with open(infile) as csvfile:
dictreader = csv.reader(csvfile)
for row in dictreader:
reps[row[0]] = row[1]
return reps
# def replace_all(text, reps):
#"""Original Version: Iterate through `reps` and replace key => value in `text`.
# Args:
#text (str): The text to search and replace.
# reps (:obj: `str`): Search for `key` and replace with `value`
# Returns:
# (str): The string with the replacements.
"""
# last = text
# for i, j in reps.items():
# text = text.replace(i, j)
# if last != text:
# return text
def new_replace_all(text, reps):
"""Updated Version: Do a single-pass replacement from a dictionary"""
pattern = re.compile(r'\b(' + '|'.join(reps.keys()) + r')\b')
return pattern.sub(lambda x: reps[x.group()], text)
if __name__ == "__main__":
main(sys.argv)
提前感谢您的时间和支持。我期待着你的指导!
最佳。
----------------更新于4/5/18 ------------------------- ------------
在HFBrowing的支持下,我已经能够修改此代码以使用我最初提供的示例数据集。然而,在我的真实世界应用程序中,我发现它在我的数据集中暴露于一些更复杂的字符串匹配时仍然崩溃。我欢迎任何有关如何解决此问题的建议,并提供了一些示例和错误。
理想情况下,给定单元格中的项目由" |"将保持在一起并将某个特定单元格中的项目链接到":"将被视为单独的字符串并单独替换。
所以如果:
" A |第一" =" A1"和" B |第一" =" B1"
然后
" A |第一:乙|第一"应转变为" A1:B1" 。
使用这个更复杂的字符串数据,我提供了预期和当前输出以及收到的错误代码的示例。
示例词典。
错误词,正确的词。
精算学,会计学:精算学
人类学,人类学:一般
未申报,未定
信息技术与行政管理|行政管理
专业化,信息技术与行政
管理:行政管理专业化。
生物学,生物学。
示例输入。
Major,ID,Last。
精算学,111,史密斯
人类学,222,鲍勃
人类学:精算学,333,约翰逊
信息技术与行政管理|行政管理专业化,444,Frank
未申报,555,蒂蒙。
当前输出错误:
Traceback (most recent call last):
File "myscript3.py", line 89, in <module> .
main(sys.argv) .
File "myscript3.py", line 21, in main .
transform(args[1], args[2], create_reps(args[3]), int(args[4])) .
File "myscript3.py", line 41, in transform .
row[column] = new_replace_all(row[column], reps) .
File "myscript3.py", line 68, in new_replace_all .
return pattern.sub(lambda x: reps[x.group()], text)
File "myscript3.py", line 68, in <lambda> .
return pattern.sub(lambda x: reps[x.group()], text) .
KeyError: 'Information Technology and Administrative Management' .
当前输出csv 。
&#34; Major,ID,Last。
会计:精算学,111,Sumeri
人类学:将军,222,尼尔森
人类学:综合;会计:精算学,333,纽曼。 &#34;
----------------------- Update 4/6/18:已解决---------------- ----------
大家好,
谢谢大家的支持。在一位同事的建议下,我修改了原来的&#34; Replace_all&#34;代码如下。这似乎现在在我的上下文中按预期工作。
再次感谢大家的时间和支持!
码
#!/usr/bin/env python
"""A script for finding and replacing values in CSV files.
Example::
./myscript school-data.csv outfile-data.csv replacements.csv 4
"""
import csv
import sys
def main(args):
"""Execute the transformation script.
Args:
args (list of `str`): The command line arguments.
"""
transform(args[1], args[2], create_reps(args[3]), int(args[4]))
def transform(infile, outfile, reps, column):
"""Write a new CSV file with replaced text.
Args:
infile (str): the sheet of original text with errors
outfile (str): the sheet with the revised text with corrections in
place of errors
reps (:obj: `str`): dictionary of error word and corrected word
column (int): the column (0 based) the word revisions will take place
in
"""
with open(infile) as csvfile:
with open(outfile, 'w') as w:
spamreader = csv.reader(csvfile)
spamwriter = csv.writer(w)
for row in spamreader:
row[column] = replace_all(row[column], reps)
spamwriter.writerow(row)
def create_reps(infile):
"""Create reps object to use as reference dictionary for transform.
Args:
infile (str): The sheet of original and corrected words used to
generate dicitonary
Returns:
(:obj: `str`): a dictionary listing the error words and their
corrections
"""
reps = {}
with open(infile) as csvfile:
dictreader = csv.reader(csvfile)
for row in dictreader:
reps[row[0]] = row[1]
return reps
def replace_all(text, reps):
"""Iterate through `reps` and replace key => value in `text`.
Args:
text (str): The text to search and replace.
reps (:obj: `str`): Search for `key` and replace with `value`
Returns:
(str): The string with the replacements.
"""
last = text
for i, j in reps.items():
text = text.replace(i, j)
#if last != text:
# return text
return text
if __name__ == "__main__":
main(sys.argv)
答案 0 :(得分:0)
我实际上无法让您的代码示例在更换内容时正常工作,因此我确信与您正在进行的操作相比,我构建CSV的方式存在一些差异。不过我觉得问题出在你的replace_all()
函数中,因为顺序替换文本can be tricky。这里是该关联问题的解决方案,作为一个功能进行了调整。这会解决您的问题吗?
def new_replace_all(text, reps):
"""Do a single-pass replacement from a dictionary"""
pattern = re.compile(r'\b(' + '|'.join(reps.keys()) + r')\b')
return pattern.sub(lambda x: reps[x.group()], text)
答案 1 :(得分:0)
#!/usr/bin/env python
"""A script for finding and replacing values in CSV files.
Example::
./myscript school-data.csv outfile-data.csv replacements.csv 4
"""
import csv
import sys
def main(args):
"""Execute the transformation script.
Args:
args (list of `str`): The command line arguments.
"""
transform(args[1], args[2], create_reps(args[3]), int(args[4]))
def transform(infile, outfile, reps, column):
"""Write a new CSV file with replaced text.
Args:
infile (str): the sheet of original text with errors
outfile (str): the sheet with the revised text with corrections in
place of errors
reps (:obj: `str`): dictionary of error word and corrected word
column (int): the column (0 based) the word revisions will take place
in
"""
with open(infile) as csvfile:
with open(outfile, 'w') as w:
spamreader = csv.reader(csvfile)
spamwriter = csv.writer(w)
for row in spamreader:
row[column] = replace_all(row[column], reps)
spamwriter.writerow(row)
def create_reps(infile):
"""Create reps object to use as reference dictionary for transform.
Args:
infile (str): The sheet of original and corrected words used to
generate dicitonary
Returns:
(:obj: `str`): a dictionary listing the error words and their
corrections
"""
reps = {}
with open(infile) as csvfile:
dictreader = csv.reader(csvfile)
for row in dictreader:
reps[row[0]] = row[1]
return reps
def replace_all(text, reps):
"""Iterate through `reps` and replace key => value in `text`.
Args:
text (str): The text to search and replace.
reps (:obj: `str`): Search for `key` and replace with `value`
Returns:
(str): The string with the replacements.
"""
last = text
for i, j in reps.items():
text = text.replace(i, j)
#if last != text:
# return text
return text
if __name__ == "__main__":
main(sys.argv)