如果该字词位于另一个字词

时间:2017-06-13 04:29:07

标签: python regex conditional

我在数据框中有一个名为' DESCRIPTION'的文本列。我需要找到所有单词" tile"或"瓷砖"在"屋顶"这个词的6个字内。然后改变单词" tile / s" to" rooftiles"。我需要为" floor"做同样的事情。和"瓷砖" (改变" tile" to" floortiles")。当某些单词与其他单词一起使用时,这将有助于区分我们正在查看的建筑行业。

为了说明我的意思,数据示例和我最新的错误尝试是:

s1=pd.Series(["After the storm the roof was damaged and some of the tiles are missing"])
s2=pd.Series(["I dropped the saw and it fell on the floor and damaged some of the tiles"])
s3=pd.Series(["the roof was leaking and when I checked I saw that some of the tiles were cracked"])
df=pd.DataFrame([list(s1), list(s2),  list(s3)],  columns =  ["DESCRIPTION"])
df

我所追求的解决方案应该是这样的(以数据帧格式):

1.After the storm the roof was damaged and some of the rooftiles are missing      
2.I dropped the saw and it fell on the floor and damaged some of the floortiles
3.the roof was leaking and when I checked I saw that some of the tiles were cracked

这里我尝试使用REGEX模式来匹配单词" tiles"但这是完全错误的...有没有办法做我想做的事情?我是Python新手......

regex=r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*tiles)"
replacedString=re.sub(regex, r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*rooftiles)", df['DESCRIPTION'])

更新:解决方案

感谢您的帮助!我设法使用Jan的代码进行了一些添加/调整。最终工作代码如下(使用实际代码,而不是示例,文件和数据):

claims_file = pd.read_csv(project_path + claims_filename) # Read input file
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].fillna('NA') #get rid of encoding errors generated because some text was just 'NA' and it was read in as NaN
#create the REGEX    
rx =  re.compile(r'''
        (                      # outer group
            \b(floor|roof)     # floor or roof
            (?:\W+\w+){0,6}\s* # any six "words"
        )
        \b(tiles?)\b           # tile or tiles
        ''', re.VERBOSE)

#create the reverse REGEX
rx2 =  re.compile(r'''
        (                      # outer group
            \b(tiles?)     # tile or tiles
            (?:\W+\w+){0,6}\s* # any six "words"
        )
        \b(floor|roof)\b           # roof or floor
        ''', re.VERBOSE)
#apply it to every row of Loss Description:
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].apply(lambda x: rx.sub(r'\1\2\3', x)) 

#apply the reverse regex:
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].apply(lambda x: rx2.sub(r'\3\1\2', x)) 

# Write results into CSV file and check results
claims_file.to_csv(project_path + output_filename, index = False
                       , encoding = 'utf-8')

4 个答案:

答案 0 :(得分:2)

我将向您展示一个快速而肮脏的不完整实现。你肯定可以使它更强大和有用。假设jQuery(document).ready(function(){ //Este objeto guardará algunos datos sobre la cámara window.datosVideo = { 'StreamVideo': null, 'url' : null }; jQuery('#botonFoto').on('click', function(e){ var oCamara, oFoto, oContexto, w, h; oCamara = jQuery('#videoElement'); oFoto = jQuery('#foto'); w = oCamara.width(); h = oCamara.height(); oFoto.attr({'width': w, 'height': h}); oContexto = oFoto[0].getContext('2d'); oContexto.drawImage(oCamara[0], 0, 0, w, h); }); } ); 是您的描述之一:

s

让我们首先将其分解为单词(tokenize;如果需要,可以消除标点符号):

s = "I dropped the saw and it fell on the roof and damaged roof " +\
    "and some of the tiles"

现在,选择感兴趣的标记并按字母顺序排序,但记住它们在tokens = nltk.word_tokenize(s) 中的原始位置:

s

组合相同的标记并创建一个字典,其中标记是键,其位置列表是值。使用字典理解:

my_tokens = sorted((w.lower(), i) for i,w in enumerate(tokens)
                    if w.lower() in ("roof", "tiles"))
#[('roof', 6), ('roof', 12), ('tiles', 17)]

浏览token_dict = {name: [p0 for _, p0 in pos] for name,pos in itertools.groupby(my_tokens, key=lambda a:a[0])} #{'roof': [9, 12], 'tiles': [17]} 位置列表(如果有),看看附近是否有tiles,如果有,请更改单词:

roof

最后,再将这些词放在一起:

for i in token_dict['tiles']:
    for j in token_dict['roof']:
        if abs(i-j) <= 6: 
            tokens[i] = 'rooftiles'

答案 1 :(得分:2)

您可以在此处使用带有正则表达式的解决方案:

(                      # outer group
    \b(floor|roof)     # floor or roof
    (?:\W+\w+){1,6}\s* # any six "words"
)
\b(tiles?)\b           # tile or tiles

请参阅a demo for the regex on regex101.com

<小时/> 然后,只需合并捕获的部分,然后使用rx.sub()将它们重新组合在一起,并将其应用于DESCRIPTION列的所有项目,以便最终获得以下代码:

import pandas as pd, re

s1 = pd.Series(["After the storm the roof was damaged and some of the tiles are missing"])
s2 = pd.Series(["I dropped the saw and it fell on the floor and damaged some of the tiles"])
s3 = pd.Series(["the roof was leaking and when I checked I saw that some of the tiles were cracked"])

df = pd.DataFrame([list(s1), list(s2),  list(s3)],  columns =  ["DESCRIPTION"])

rx = re.compile(r'''
            (                      # outer group
                \b(floor|roof)     # floor or roof
                (?:\W+\w+){1,6}\s* # any six "words"
            )
            \b(tiles?)\b           # tile or tiles
            ''', re.VERBOSE)

# apply it to every row of "DESCRIPTION"
df["DESCRIPTION"] = df["DESCRIPTION"].apply(lambda x: rx.sub(r'\1\2\3', x))
print(df["DESCRIPTION"])

<小时/> 请注意,您的原始问题不太明确:此解决方案只会在 tile之后找到tilesroof ,这意味着{{1}这样的句子}将不会匹配(尽管单词Can you give me the tile for the roof, please?tile的六个单词范围内,即是。

答案 2 :(得分:1)

我可以将此概括为比“屋顶”和“地板”更多的子串,但这似乎是一个更简单的代码:

for idx,r in enumerate(df.loc[:,'DESCRIPTION']):
    if "roof" in r and "tile" in r:
        fill=r[r.find("roof")+4:]
        fill = fill[0:fill.replace(' ','_',7).find(' ')]
        sixWords = fill if fill.find('.') == -1 else ''
        df.loc[idx,'DESCRIPTION'] = r.replace(sixWords,sixWords.replace("tile", "rooftile"))
    elif "floor" in r and "tile" in r:
        fill=r[r.find("floor")+5:]
        fill = fill[0:fill.replace(' ','_',7).find(' ')]
        sixWords = fill if fill.find('.') == -1 else ''
        df.loc[idx,'DESCRIPTION'] = r.replace(sixWords,sixWords.replace("tile", "floortile"))

请注意,这还包括检查fullstop(“。”)。您可以删除sixWords变量并将其替换为fill

来删除它

答案 3 :(得分:0)

你遇到的主要问题是正则表达式中瓷砖前面的。*。这使得任何数量的任何角色都可以去那里并且仍然匹配。 \ b \ n是不必要的,因为它们无论如何都处于空白和非空白之间的边界。并且分组()也没有被使用,所以我删除了它们。

R&#34;(屋顶\ S + [^ \ S] + \ S +){0,6}瓦片&#34;将只匹配6&#34;单词&#34; (瓦片的非空白字符组)。要替换它,从正则表达式中取出匹配字符串的最后5个字符,添加&#34; rooftiles&#34;,然后用匹配的字符串替换更新后的字符串。或者,你可以在正则表达式中将除了瓷砖的所有东西分组,然后用自己替换该组加上#34; roof&#34;。你不能将re.sub用于这个复杂的东西,因为它将从屋顶到瓷砖取代整个匹配,而不仅仅是单词瓦片。