Python for re.match re.sub

时间:2015-08-17 06:00:04

标签: python html csv text match

处理csv文件。它包含一系列来源(简单的ssl链接),地点,网站(< a>非ssl链接< / a>),Direcciones和电子邮件。当某些数据不可用时,它就不会出现。像这样:

httpsgoogledotcom, GooglePlace2, Direcciones, Montain View, Email, googplace@yourplace.com

尽管如此,网站“一个html标签”链接总是出现两次,后面跟着几个逗号。同样,逗号被跟踪,有时是由Direcciones,有时是来源(https)。因此,如果进程在EOF时没有中断,它可以“替换”几个小时并创建一个带有gbs的redudant和missplaced信息的输出文件。让我们选择四个条目作为Reutput.csv的一个例子:

> httpsgoogledotcom, GooglePlace, Website, "<a> href='httpgoogledotcom'></a>",,,,,,,,,,,,,, 
> "<a href='httpgoogledotcom'></a>",,,,,,,,,,,,, 
> ,,Direcciones, Montain View, Email, googplace@yourplace.com
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, bing@yourplace.com
> httpsgoogledotcom, GooglePlace, Website, "<a> href='httpgoogledotcom'></a>",,,,,,,,,,,,,, 
> "<a href='httpgoogledotcom'></a>",,,,,,,,,,,,, 
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, bing@yourplace.com

所以这个想法是删除不必要的网站'一个html标签'链接和多余的逗号,但尊重新行/ n而不是落在循环上。像这样:

> httpsgoogledotcom, GooglePlace, Website, "<a href='httpgoogledotcom'></a>",Direcciones, Montain View, Email, googplace@yourplace.com 
> httpsbingdotcom, BingPlace, Direcciones,MicroWorld, Email, bing@yourplace.com
> httpsgoogledotcom, GooglePlace,Website, <a href='httpgoogledotcom'></a>"
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, bing@yourplace.com

这是代码的最后一个版本:

with open('Reutput.csv') as reuf, open('Put.csv', 'w') as putuf:
    text = str(reuf.read())
    for lines in text:
        d = re.match('</a>".*D?',text,re.DOTALL)
        if d is not None:
            if not 'https' in d:
                replace = re.sub(d,'</a>",Direc',lines)
        h = re.match('</a>".*?http',text,re.DOTALL|re.MULTILINE)
        if h is not None:
            if not 'Direc' in h:
                replace = re.sub(h,'</a>"\nhttp',lines)
        replace = str(replace)
        putuf.write(replace)

现在我得到一个Put.csv文件,最后一行永远重复。为什么这个循环?我已经尝试了几种方法来处理这段代码,但遗憾的是,我仍然坚持这一点。提前谢谢。

2 个答案:

答案 0 :(得分:0)

如果没有匹配项,groups将为None。你需要防范这个(或重构正则表达式,以便它总是匹配的东西)。

    groups = re.search('</a>".*?Direc',lines,re.DOTALL)
    if groups is not None:
        if not 'https' in groups:

请注意添加not None条件以及随后对其管辖的行的缩进。

答案 1 :(得分:0)

最后,我自己得到了代码。我在这里发帖,希望有人发现它有用。无论如何,谢谢你的帮助和反对票!

import re
with open('Reutput.csv') as reuf, open('Put.csv', 'w') as putuf:
    text = str(reuf.read())
    d = re.findall('</a>".*?Direc',text,re.DOTALL|re.MULTILINE)
    if d is not None:
        for elements in d:
            elements = str(elements)
            if not 'https' in elements:
                    s = re.compile('</a>".*?Direc',re.DOTALL)
                    replace = re.sub(s,'</a>",Direc',text)
    h = re.findall('</a>".*?https',text,re.DOTALL|re.MULTILINE)
    if h is not None:
        for elements in h:
            if not 'Direc' in elements:
                s = re.compile('</a>".*?https',re.DOTALL)
                replace = re.sub(s,'</a>"\nhttps',text)
        replace = str(replace)
        putuf.write(replace)