Question

我正在尝试自动查找和替换.rst文件中一系列损坏的图像链接。我有一个csv文件，其中列A是“旧”链接（在.rst文件中可以看到），列B是每一行的新替换链接。

我不能先使用pandoc转换为HTML，因为它会“破坏”第一个文件。我对使用BeautifulSoup和regex的一组HTML文件做了一次，但是该解析器对我的第一个文件不起作用。

一位同事建议尝试使用Grep，但我似乎无法弄清楚如何调用csv文件来进行“匹配”并进行切换。

对于html文件，它将循环浏览每个文件，搜索img标签，并使用csv文件作为字典替换链接

with open(image_csv, newline='') as f:
reader = csv.reader(f)
next(reader, None)  # Ignore the header row
for row in reader:
    graph_main_nodes.append(row[0])
    graph_child_nodes.append(row[1:])
graph = dict(zip(graph_main_nodes, graph_child_nodes))  # Dict with keys in correct location, vals in old locations

graph = dict((v, k) for k in graph for v in graph[k])

for fixfile in html:
try:
    with open(fixfile, 'r', encoding='utf-8') as f:
        soup = BeautifulSoup(f, 'html.parser')
        tags =  soup.findAll('img')
        for tag in tags:  
            print(tag['src'])
            if tag['src'] in graph.keys():
                tag['src'] = tag['src'].replace(tag['src'], graph[tag['src']])
                replaced_links += 1
                print("Match found!")
            else:
                orphan_links.append(tag["src"])
                print("Ignore")

我很乐意就如何解决这个问题提出一些建议。我很想改变我的BeautifulSoup代码的用途，但是我不确定这是否现实。

Answer 1

This question提供了有关解析RST文件的信息，但是我认为没有必要。您的问题归结为将textA替换为textB。您已经加载了csv的图形，因此只需使用（credit to this answer）

# Read in the file
filedata = None
with open('fixfile', 'r', encoding='utf-8') as file:
  filedata = file.read()

# Replace the target strings
for old, new in graph.items():
  filedata.replace(old, new)

# Write the file out again
with open('fixfile', 'w', encoding='utf-8') as file:
  file.write(filedata)

这也是sed或perl的不错选择。使用类似this answer之类的内容，也使用this answer来帮助指定sed的罕见分隔符。（在测试后将-n更改为-i，将p更改为g以使其实际保存文件）：

DELIM=$(echo -en "\001");
IFS=","
cat csvFile | while read PATTERN REPLACEMENT  # You feed the while loop with stdout lines and read fields separated by ":"
do
   sed -n "\\${DELIM}${PATTERN}${DELIM},\\${DELIM}${REPLACEMENT}${DELIM}p" fixfile.rst
done

如何使用Python或Grep处理RST文件中的find + replace

1 个答案: