Question

我有一个包含'key'和'paragraph'的列表。每个“密钥”都与“段落”相关联。

我的目标是将每个段落分成单个句子，每个句子分配给他们最初属于段落形式的“键”。例如：

(['2925729', 'Patrick came outside and greeted us promptly.'], ['2925729', 'Patrick did not shake our hands nor ask our names. He greeted us promptly and politely, but it seemed routine.'], ['2925728', 'Patrick sucks. He farted politely, but it seemed routine.'])

现在我已经能够编写代码将句子分成段落，并根据字典获得每个句子的点击次数。我现在想要将ID与每个问题相关联。

以下是处理没有任何“密钥”的句子的代码。步骤1和2我省略了空间保护：

Dictionary = ['book', 'should have', 'open']

####Step3#####
#Create Blank list to append final output
final_out = []

##Find Matches
for sent in sentences:
  for sent in sentences:
      final_out.append((sent, sum(sent.count(col) for col in dictionary)))

#####Spit out final distinct output
##Output in dictionary structure
final_out = dict(sorted(set(final_out)))

####Get sentences and rank by max first

import operator
sorted_final_out = sorted(final_out.iteritems(),key = operator.itemgetter(1), reverse = True)

这个输出是： （['johny吃了羚羊'，80]，['sally有一个朋友'，20]） 等等。然后我选择顶部X b量级。我现在想要实现的是这样的：（[''12222'，'johny吃羚羊'，80]，[22332，'sally有一个朋友'，20]）。所以我基本上希望确保解析出的所有句子都分配给“密钥”。这很复杂抱歉。这就是为什么约翰的早期解决方案可以用于更简单的案例。

Answer 1

from itertools import chain
list(chain(*[[[y[0],z] for z in y[1].split('. ')] for y in x]))

产生

[['2925729', 'Patrick came outside and greeted us promptly.'],
 ['2925729', 'Patrick did not shake our hands nor ask our names'],
 ['2925729', 'He greeted us promptly and politely, but it seemed routine.'],
 ['2925728', 'Patrick sucks'],
 ['2925728', 'He farted politely, but it seemed routine.']]

list(chain(*...))展平由[[[y[0],z] for z in y[1].split('. ')] for y in x]生成的嵌套列表。

如果你想更改“就地”，你可以使用

xl = list(x) # you gave us a tuple          
for i,y in enumerate(xl):
    xx = xl[i]
    xx = [[xx[0],y] for y in xx[1].split('. ')]
    xl[i:i+1] = xx

我不确定当数据集非常大时哪个会更快或更好。

将列表转换为子列表，同时保持“键”

1 个答案: