Question

我不确定这样做的最佳方法是什么......我以为我可能需要在python中做到这一点？

filea.html包含data-tx-text="9817db21ccc2d9acc021c4536690b90a_se"
fileb.html包含data-tx-text="0850235fcb0e503150c224dad3156312_se"

从data-tx-text到filea.html（171）的fileb.html值完全相同。

我希望能够使用正则表达式模式或简单的Python程序

在data-tx-text="(.*?)"

filea.html

在data-tx-text="(.*?)"

fileb.html

将filea.html中的值替换为fileb.html
转到下一个匹配项。
继续直到文件结尾，或直到filea.html中的所有值都与fileb.html

我有基础知识。例如，我知道我需要的正则表达式模式，我猜我需要在Python或类似的东西中循环它？

也许我可以用sed做到这一点，但我对此并不那么好，所以非常感谢任何帮助。

Answer 1

open filea find stringa
打开fileb find stringb
用stringb替换stringa
用stringa替换stringb
写回文件

在下面的代码中

import re
pattern = 'data-tx-text="(.*?)"'
With open('filea.html', 'r') as f: 
    filea = f.read()
With open('fileb.html', 'r') as f: 
    fileb = f.read()
stringa= re.match(pattern, filea).group()
stringb= re.match(pattern, fileb).group()
filea = filea.replace(stringa, stringb)
fileb = fileb.replace(stringb, stringa)
with open('filea.html', 'w') as f:
    f.write(filea)
with open('filea.html', 'w') as f:
    f.write(fileb)

Answer 2

在awk中，您可以使用以下内容：

NR == FNR {
    match($0, /data-tx-text="[^"]+"/);
    if (RSTART > 0) {
        data[++a] = substr($0, RSTART + 14, RLENGTH - 15);
    }
    next;
}

/data-tx-text/ {
    sub(/data-tx-text="[^"]+"/, "data-tx-text=\"" data[++b] "\"");
    print;
}

Answer 3

使用GNU awk为第3个arg匹配（）：

$ cat tst.awk
match($0,/(.*)(data-tx-text="[^"]+")(.*)/,a) {
    if (NR==FNR) {
        fileb[++bcnt] = a[2]
    }
    else {
        $0 = a[1] fileb[++acnt] a[3]
    }
}
NR>FNR

$ awk -f tst.awk fileb filea
data-tx-text="0850235fcb0e503150c224dad3156312_se"

与其他awks一起，你在匹配后使用3次调用substr（）:( / p>

$ cat tst.awk
match($0,/data-tx-text="[^"]+"/) {
    if (NR==FNR) {
        fileb[++bcnt] = substr($0,RSTART,RLENGTH)
    }
    else {
        $0 = substr($0,1,RSTART-1) fileb[++acnt] substr($0,RSTART+RLENGTH)
    }
}
NR>FNR

$ awk -f tst.awk fileb filea
data-tx-text="0850235fcb0e503150c224dad3156312_se"

Answer 4

所以这就是我用python解决它的方法，它有点手册，因为我每次都要更改filea和fileb的名称，但它有效

我想我可以用逃脱来改善正则表达式？

import re
import sys

with open('filea.html') as originalFile:
    originalFileContents = originalFile.read()

pattern = re.compile(r'[0-9a-f]{32}_se')
originalMatches = pattern.findall(originalFileContents)

counter = 0

def replaceId(match):
    global counter
    value = match.group()
    newValue = originalMatches[counter]
    print counter, '=> replacing', value, 'with', newValue
    counter = counter + 1
    return newValue

with open('fileb.html') as targetFile:
    targetFileContents = targetFile.read()

changedTargetFileContents = pattern.sub(replaceId, targetFileContents)
print changedTargetFileContents

new_file = open("Output.html", "w")
new_file.write(changedTargetFileContents)
new_file.close()

在Github上可用：https://github.com/timm088/rehjex-py

Answer 5

以下是我使用Beautiful Soup：

的方法

from bs4 import BeautifulSoup as bs

replacements, replaced_html = [], ''

with open('fileb.html') as fileb:
    # Extract replacements
    soup = bs(fileb, 'html.parser')
    tags = soup.find_all(lambda tag: tag.get('data-tx-text'))
    replacements = [tag.get('data-tx-text') for tag in tags]

with open('filea.html') as filea:
    # Replace values
    soup = bs(filea, 'html.parser')
    tags = soup.find_all(lambda tag: tag.get('data-tx-text'))
    for tag in tags:
        tag['data-tx-text'] = replacements.pop(0)
    replaced_html = str(soup)

with open('filea.html', 'w') as new_filea:
    # Update file
    new_filea.write(replaced_html)

比较来自两个单独文件的正则表达式匹配，并替换其中一个的值

5 个答案: