Question

我的.txt文件如下：

正如你可以看到动词之间的几个关系（不关心数字），文件有5000行。

数据在此处：在下载＆amp;使用VerbOcean：http://demo.patrickpantel.com/demos/verbocean/

我想要的是每个关系的字典，以便我们可以说例如

similar-to['anger'] = 'energize' 
happens-before['X'] = 'Y'
stronger-than ['A'] = 'B'

等等。

所以到目前为止我所做的只是[强于]关系。我应该如何以一种完成所有其他关系的方式扩展它？

import csv

file = open("C:\\Users\\shide\\Desktop\\Independent study\\data.txt")
counter = 1
stronger = {}
strongerverb = []
secondverb = []
term1 = "[stronger-than]" #Look for stronger-than
     words = line.split()  #split sentence
    if term1 in words:  #if ['Stronger-than'] exists in the line then add the first word
     strongerverb.append(line.split(None, 1)[0]) # add only first verb
     secondverb.append(line.split()[2])  #add second verb

     if term1 in words:  # if ['Stronger-than'] exists in the line then add the first word
         strongerverb.append(line.split(None, 1)[0])  # add only first verb
         secondverb.append(line.split()[2])  # add second verb

capacity = len(strongerverb)

index = 0
while index!=capacity:
    line = strongerverb[index]
    for word in line.split():
  #      print(word)
        index = index+1
#print("First verb:",firstverb)
#print("Second verb:",secondverb)
for i in range(len(strongerverb)):
    stronger[strongerverb[i]] = secondverb[i]

#Write a CSV file that fist column is containing verbs that is stronger than the second column.

with open('output.csv', 'w') as output:
     writer = csv.writer(output, lineterminator='\n')
     for secondverb, strongerverb in stronger.items():
        writer.writerow([strongerverb, secondverb])

一种方法是对所有其他关系采取相同的方式，但我想这不会是一个聪明的事情。有任何想法吗？我想要的是每个关系的字典，以便我们可以说：

similar-to['anger'] = 'energize' 
happens-before['X'] = 'Y'
stronger-than ['A'] = 'B'

我是python的新手，非常感谢任何帮助。

Answer 1

这可以使用正则表达式完成：

import re
regexp = re.compile(r'^([^\[\]\s]+)\s*\[([^\[\]\s]+)\]\s*([^\[\]\s]+)\s*.*$', re.MULTILINE)

^ :(在开头）意味着开始在行的开头查找。
$ :(最后）意味着表达式应该以行和行结束。
[^\[\]\s]+：捕获非[，]或空格的所有字符。 ^表示不捕获方括号内的以下字符。
我们使用()将上述表达式封装起来，将其标记为要使用m.groups()捕获的组。由于我们想要获得动词及其关系，我们将这三个用()封装。
在这些组之间，我们使用\s*捕获所有空格，并使用.*捕获我们捕获的其余行。两者都被忽略，因为它们没有用()封装。

例如：

data = """
invate [happens-beforeg] annex :: ....
annex [similar] invade :: ....
annex [opposite-of] cede :: ....
annex [stronger-than] occupy :: ....
"""
relationships = {}
for m in regexp.finditer(data):
    v1,r,v2 = m.groups()
    relationships.setdefault(r, {})[v1] = v2
print(relationships)

输出：

{'happens-before': {'invate': 'annex'},
 'opposite-of': {'annex': 'cede'},
 'similar': {'annex': 'invade'},
 'stronger-than': {'annex': 'occupy'}}

然后，要获得动词'similar'的{{1}}关系，请使用：

'annex'

将返回：relationships['similar']['annex']

字符串处理：如果存在某个单词，如何查找单词写一个字典（正则表达式）

1 个答案:

例如：

输出：