如何在特定符号之前提取常用词并找到特定词

时间:2013-07-17 13:58:01

标签: python word

如果我有字典:

mydict = {"g18_84pp_2A_MVP1_GoodiesT0-HKJ-DFG_MIX-CMVP1_Y1000-MIX.txt" : 0,
          "g18_84pp_2A_MVP2_GoodiesT0-HKJ-DFG_MIX-CMVP2_Y1000-MIX.txt" : 1,
          "g18_84pp_2A_MVP3_GoodiesT0-HKJ-DFG_MIX-CMVP3_Y1000-MIX.txt" : 2,
          "g18_84pp_2A_MVP4_GoodiesT0-HKJ-DFG_MIX-CMVP4_Y1000-MIX.txt" : 3,
          "g18_84pp_2A_MVP5_GoodiesT0-HKJ-DFG_MIX-CMVP5_Y1000-MIX.txt" : 4,
          "g18_84pp_2A_MVP6_GoodiesT0-HKJ-DFG_MIX-CMVP6_Y1000-MIX.txt" : 5,
          "h18_84pp_3A_MVP1_GoodiesT1-HKJ-DFG-CMVP1_Y1000-FIX.txt" : 6,
          "g18_84pp_2A_MVP7_GoodiesT0-HKJ-DFG_MIX-CMVP7_Y1000-MIX.txt" : 7,
          "h18_84pp_3A_MVP2_GoodiesT1-HKJ-DFG-CMVP2_Y1000-FIX.txt" : 8,
          "h18_84pp_3A_MVP3_GoodiesT1-HKJ-DFG-CMVP3_Y1000-FIX.txt" : 9,
          "p18_84pp_2B_MVP1_GoodiesT2-HKJ-DFG-CMVP3_Y1000-FIX.txt" : 10}
  1. 我想在第一个g18_84pp_2A_MVP_GoodiesT0之前提取公共部分-

  2. 我还希望在第一组中找到特定字_MIX时添加g18_84pp_2A_MVP_GoodiesT0以跟随MIX。假设我能够根据myDict中的MIXFIX来分类两个组,然后是最终的输出字典:

  3. OutputNameDict= {"g18_84pp_2A_MVP_GoodiesT0_MIX" : 0,
                      "h18_84pp_3A_MVP_GoodiesT1_FIX" : 1,
                      "p18_84pp_2B_MVP_FIX": 2}
    

    我可以用任何功能找到共同的部分吗?如何在-等特定符号之前或之后选择单词并查找MIXFIX等特定字词?

3 个答案:

答案 0 :(得分:1)

您可以使用split获取公共部分:

s = "g18_84pp_2A_MVP1_GoodiesT0-HKJ-DFG_MIX-CMVP1_Y1000-MIX.txt"
n = s.split('-')[0]

事实上,split会为您提供由'-'分隔的每个令牌的列表,因此s.split('-')会产生:

['g18_84pp_2A_MVP1_GoodiesT0', 'HKJ', 'DFG_MIX', 'CMVP1_Y1000', 'MIX.txt']

要查看字符串中是否有MIXFIX,您可以使用in

if 'MIX' in s:
    print "then MIX is in the string s"

如果你想摆脱'MVP'之后的数字,你可以使用re模块:

import re
s = 'g18_84pp_2A_MVP1_GoodiesT0'
s = re.sub('MVP[0-9]*','MVP',s)

这是一个示例函数,用于获取公共部分的列表:

def foo(mydict):
    return [re.sub('MVP[0-9]*', 'MVP', k.split('-')[0]) for k in mydict]

答案 1 :(得分:1)

您可以使用index()功能查找短划线,然后根据该知识,您可以将剩余的字符串带到该点之后。例如,

mydict = {"g18_84pp_2A_MVP1_GoodiesT0-HKJ-DFG_MIX-CMVP1_Y1000-MIX.txt" : 0,
          "g18_84pp_2A_MVP2_GoodiesT0-HKJ-DFG_MIX-CMVP2_Y1000-MIX.txt" : 1,
          "g18_84pp_2A_MVP3_GoodiesT0-HKJ-DFG_MIX-CMVP3_Y1000-MIX.txt" : 2,
          "g18_84pp_2A_MVP4_GoodiesT0-HKJ-DFG_MIX-CMVP4_Y1000-MIX.txt" : 3,
          "g18_84pp_2A_MVP5_GoodiesT0-HKJ-DFG_MIX-CMVP5_Y1000-MIX.txt" : 4,
          "g18_84pp_2A_MVP6_GoodiesT0-HKJ-DFG_MIX-CMVP6_Y1000-MIX.txt" : 5,
          "g18_84pp_2A_MVP7_GoodiesT0-HKJ-DFG_MIX-CMVP7_Y1000-MIX.txt" : 6,
          "h18_84pp_3A_MVP1_GoodiesT1-HKJ-DFG_MIX-CMVP1_Y1000-FIX.txt" : 7,
          "h18_84pp_3A_MVP2_GoodiesT1-HKJ-DFG_MIX-CMVP2_Y1000-FIX.txt" : 8,
          "h18_84pp_3A_MVP2_GoodiesT1-HKJ-DFG_MIX-CMVP3_Y1000-FIX.txt" : 9}

for value in sorted(mydict.iterkeys()):
        index = value.index('-')
        extracted = value[index+1:-4] # Goes past the first occurrence of - and removes .txt from the end
        print extracted[-3:] # Find the last 3 letters in the string

将打印以下内容:

MIX
MIX
MIX
MIX
MIX
MIX
MIX
FIX
FIX
FIX

然后,如果语句可以用来做你想做的事。

如果您只想提取公共部分。

index = value.index('-')
extracted = value[:index] # Will get g18_84pp_2A_MVP1_GoodiesT0

然后找出要使用的结尾。如果您知道mydict值的结尾将始终为MIX.txt或FIX.txt,那么您可以这样做。

for value in sorted(mydict.iterkeys()):
    ending = value[-7:-4]
    index = value.index('-')
    extracted = value[:index]
    print "%s_%s" % (extracted, ending)

打印

g18_84pp_2A_MVP1_GoodiesT0_MIX
g18_84pp_2A_MVP2_GoodiesT0_MIX
g18_84pp_2A_MVP3_GoodiesT0_MIX
g18_84pp_2A_MVP4_GoodiesT0_MIX
g18_84pp_2A_MVP5_GoodiesT0_MIX
g18_84pp_2A_MVP6_GoodiesT0_MIX
g18_84pp_2A_MVP7_GoodiesT0_MIX
h18_84pp_3A_MVP1_GoodiesT1_FIX
h18_84pp_3A_MVP2_GoodiesT1_FIX
h18_84pp_3A_MVP2_GoodiesT1_FIX

然后将其添加到提取的字典中。

答案 2 :(得分:0)

感谢您的回答。我的完整代码如下。有什么建议可以优化吗?

import re

mydict = {"g18_84pp_2A_MVP1_GoodiesT0-HKJ-DFG_MIX-CMVP1_Y1000-MIX.txt" : 0,
          "g18_84pp_2A_MVP2_GoodiesT0-HKJ-DFG_MIX-CMVP2_Y1000-MIX.txt" : 1,
          "g18_84pp_2A_MVP3_GoodiesT0-HKJ-DFG_MIX-CMVP3_Y1000-MIX.txt" : 2,
          "g18_84pp_2A_MVP4_GoodiesT0-HKJ-DFG_MIX-CMVP4_Y1000-MIX.txt" : 3,
          "g18_84pp_2A_MVP5_GoodiesT0-HKJ-DFG_MIX-CMVP5_Y1000-MIX.txt" : 4,
          "g18_84pp_2A_MVP6_GoodiesT0-HKJ-DFG_MIX-CMVP6_Y1000-MIX.txt" : 5,    
          "h18_84pp_3A_MVP1_GoodiesT1-HKJ-DFG-CMVP1_Y1000-FIX.txt" : 6,    
          "g18_84pp_2A_MVP7_GoodiesT0-HKJ-DFG_MIX-CMVP7_Y1000-MIX.txt" : 7,
          "h18_84pp_3A_MVP2_GoodiesT1-HKJ-DFG-CMVP2_Y1000-FIX.txt" : 8,
          "h18_84pp_3A_MVP3_GoodiesT1-HKJ-DFG-CMVP3_Y1000-FIX.txt" : 9,
          "p18_84pp_2B_MVP1_GoodiesT2-HKJ-DFG-CMVP3_Y1000-FIX.txt" : 10}

ExtractDict = {}
start = 0
for stringList in sorted(mydict.iterkeys()):
    stringList = stringList.split('.')[0]  
    underscore = stringList.split('_')   
    Area= re.split('[0-9]+',stringList.split('_')[3])[0] # MVP and etc.       
    CaseNameString=underscore[0]+"_"+underscore[1]+"_"+underscore[2]+"_"+Area #g18_84pp_2A_MVP_GoodiesT0 and etc.
    postfix= stringList.split('-')[4]
    Newstring= CaseNameString + "_" + postfix   
    ExtractDict[Newstring]= start
    start += 1
startagain =0
OutputNameDict = {}
for OutputNameList in sorted(ExtractDict.iterkeys()):
    OutputNameDict[OutputNameList] = startagain
    startagain +=1

#OutputNameDict = {'h18_84pp_3A_MVP_FIX': 1, 'p18_84pp_2B_MVP_FIX': 2, 'g18_84pp_2A_MVP_MIX': 0}