根据与句子边界的接近程度对单词进行排名

时间:2017-03-05 16:55:42

标签: python pandas

我的数据框由Word(代表英文单词),sentence_ID(代表句子编号)和Flag(代表这个单词的一部分是否为句子,如果Flag = 1,这意味着句子边界内的单词,如果Flag = 0,则表示该单词位于句子的边缘。)

我想根据句子中心的距离对单词进行排名。 因此,输入

Word    sentence_ID Flag
A   1   1
B   1   1
C   1   1
D   1   1
E   1   1
A   1   0
F   2   1
G   2   1
H   2   1
I   2   1
A   2   0
J   0   0
k   0   0
M   0   0
C   3   1
D   3   1
E   3   1
A   3   1
F   3   1
G   3   1
H   3   1
I   3   1
A   3   1
J   3   1
G   3   0
H   0   0
I   0   0
L   4   1

输出

Word    sentence_ID Flag    Rank
A   1   1   1
B   1   1   2
C   1   1   3
D   1   1   3
E   1   1   2
A   1   0   1
F   2   1   1
G   2   1   2
H   2   1   3
I   2   1   2
A   2   0   1
J   0   0   
k   0   0   
M   0   0   
C   3   1   1
D   3   1   2
E   3   1   3
A   3   1   4
F   3   1   5
G   3   1   6
H   3   1   5
I   3   1   4
A   3   1   3
J   3   1   2
G   3   0   1
H   0   0   
I   0   0   
L   4   1   1

2 个答案:

答案 0 :(得分:0)

试试这个例子:

sentence = [("foo",0), ("bar",0) , ("baz",0), ("foo",0), ("bar",0) ]
words = len( sentence )
if odd(words):
   center = int(words / 2) + 1
else:
   center = words / 2

for rank, i in enumerrate( range(0, center), 1):
    sentence [i] [1] = rank

for rank, i in reversed( range(center, words), center-1):
    sentence [i] [1] = rank

print(sentence). 

答案 1 :(得分:0)

经过六个小时的编码,我找到了解决方案:

    df = pd.read_csv(f_Name, sep=";",index_col=False)
    df2= df.groupby(["sentence_ID"]).size().reset_index(name='count') # Find the length for each sentense

    #Process first Sentense
    j = 0

    for index in range(0, len(df)):
        if index in df['sentence_ID']:
            if df.ix[index, 'sentence_ID'] in df2['sentence_ID'] and df.ix[index, 'sentence_ID'] != 0:
                if index > 1 and df.ix[index, 'sentence_ID'] != df.ix[index -1, 'sentence_ID']:
                    j=0
                    CurrentSentensLength = df2.ix[df.ix[index, 'sentence_ID'], 'count']
                    if CurrentSentensLength  % 2 == 1:
                        center = int(CurrentSentensLength  / 2) + 1
                        center = index + center
                    else:
                        center = CurrentSentensLength / 2
                        center = index + center
                elif index == 0:
                    # Process first Sentense
                    CurrentSentensLength = df2.ix[df.ix[index, 'sentence_ID'], 'count']
                    if CurrentSentensLength % 2 == 1:
                        center = int(CurrentSentensLength / 2) + 1
                        center = index + center

                    else:
                        center = CurrentSentensLength / 2
                        center = index + center
                if index >= center:
                    if index !=center:
                       j=j-1
                else:
                    j=j+1

                df.ix[index, 'Gloss_Rank_On_Sentense'] = j