我有一个有三列的文件(用\ t分隔;第一列是单词,第二列是引理,第三列是标记)。有些行只包含点或逗号。
<doc n=1 id="CMP/94/10">
<head p="80%">
Customs customs tag1
union union tag2
in in tag3
danger danger tag4
of of tag5
the the tag6
</head>
<head p="80%">
New new tag7
restrictions restriction tag8
in in tag3
the the tag6
.
Hi hi tag8
假设用户在“in”中搜索引理。我想要“in”的频率和“in”之前和之后的lemmas的频率。所以我希望整个语料库中的“联合”,“危险”,“限制”和“其中”的频率。结果应该是:
union 1
danger 1
restriction 1
the 2
我该怎么做?我尝试使用lemma_counter = {}
,但它不起作用。
我对python语言没有经验,所以如果我有任何错误,请纠正我。
c = open("corpus.vert")
corpus = []
for line in c:
if not line.startswith("<"):
corpus.append(line)
lemma = raw_input("Lemma you are looking for: ")
counter = 0
lemmas_before_after = []
for i in range(len(corpus)):
parsed_line = corpus[i].split("\t")
if len(parsed_line) > 1:
if parsed_line[1] == lemma:
counter += 1 #this counts lemma frequency
new_list = []
for j in range(i-1, i+2):
if j < len(corpus) and j >= 0:
parsed_line_with_context = corpus[j].split("\t")
found_lemma = parsed_line_with_context[0].replace("\n","")
if len(parsed_line_with_context) > 1:
if lemma != parsed_line_with_context[1].replace("\n",""):
lemmas_before_after.append(found_lemma)
else:
lemmas_before_after.append(found_lemma)
print "list of lemmas ", lemmas_before_after
lemma_counter = {}
for i in range(len(corpus)):
for lemma in lemmas_before_after:
if parsed_line[1] == lemma:
if lemma in lemma_counter:
lemma_counter[lemma] += 1
else:
lemma_counter[lemma] = 1
print lemma_counter
fA = counter
print "lemma frequency: ", fA
答案 0 :(得分:0)
这应该可以让你获得80%的支持。
# Let's use some useful pieces of the awesome standard library
from collections import namedtuple, Counter
# Define a simple structure to hold the properties of each entry in corpus
CorpusEntry = namedtuple('CorpusEntry', ['word', 'lemma', 'tag'])
# Use a context manager ("with...") to automatically close the file when we no
# longer need it
with open('corpus.vert') as c:
corpus = []
for line in c:
if len(line.strip()) > 1 and not line.startswith('<'):
# Remove the newline character and split at tabs
word, lemma, tag = line.strip().split('\t')
# Put the obtained values in the structure
entry = CorpusEntry(word, lemma, tag)
# Put the structure in the corpus list
corpus.append(entry)
# It's practical to wrap the counting in a function
def get_frequencies(lemma):
# Create a set of indices at which the lemma occurs in corpus. We use a
# set because it is more efficient for the next part, checking if some
# index is in this set
lemma_indices = set()
# Loop over corpus without manual indexing; enumerate provides information
# about the current index and the value (some CorpusEntry added earlier).
for index, entry in enumerate(corpus):
if entry.lemma == lemma:
lemma_indices.add(index)
# Now that we have the indices at which the lemma occurs, we can loop over
# corpus again and for each entry check if it is either one before or
# one after the lemma. If so, add the entry's lemma to a new set.
related_lemmas = set()
for index, entry in enumerate(corpus):
before_lemma = index+1 in lemma_indices
after_lemma = index-1 in lemma_indices
if before_lemma or after_lemma:
related_lemmas.add(entry.lemma)
# Finally, we need to count the number of occurrences of those related
# lemmas
counter = Counter()
for entry in corpus:
if entry.lemma in related_lemmas:
counter[entry.lemma] += 1
return counter
print get_frequencies('in')
# Counter({'the': 2, 'union': 1, 'restriction': 1, 'danger': 1})
可以更简洁地编写(下面),并且算法也可以改进,尽管它仍然是O(n);关键是让它变得可以理解。
对于那些感兴趣的人:
with open('corpus.vert') as c:
corpus = [CorpusEntry(*line.strip().split('\t')) for line in c
if len(line.strip() > 1) and not line.startswith('<')]
def get_frequencies(lemma):
lemma_indices = {index for index, entry in enumerate(corpus)
if entry.lemma == lemma}
related_lemmas = {entry.lemma for index, entry in enumerate(corpus)
if lemma_indices & {index+1, index-1}}
return Counter(entry.lemma for entry in corpus
if entry.lemma in related_lemmas)
这是一种更加程序化的风格,速度提高了三倍:
def get_frequencies(lemma):
counter = Counter()
related_lemmas = set()
for index, entry in enumerate(corpus):
counter[entry.lemma] += 1
if entry.lemma == lemma:
if index > 0:
related_lemmas.add(corpus[index-1].lemma)
if index < len(corpus)-1:
related_lemmas.add(corpus[index+1].lemma)
return {lemma: frequency for lemma, frequency in counter.iteritems()
if lemma in related_lemmas}