Question

我有这段代码，我一直在努力优化。

我的数据框是一个包含2列的csv文件，其中第二列包含文本。看起来像图片：

我有一个函数汇总（text，n），它需要一个文本和一个整数作为输入。

Unhandled Server Error (Oops!) +11ms { Invariant Violation: _registerComponent(...): Target container is not a DOM element.
at invariant (/Users/biel/workspace/sonder/client/node_modules/fbjs/lib/invariant.js:44:15)
at Object._renderNewRootComponent (/Users/biel/workspace/sonder/client/node_modules/react-dom/lib/ReactMount.js:310:76)
at Object._renderSubtreeIntoContainer (/Users/biel/workspace/sonder/client/node_modules/react-dom/lib/ReactMount.js:401:32)
at Object.render (/Users/biel/workspace/sonder/client/node_modules/react-dom/lib/ReactMount.js:422:23)
at callee$1$0$ (/Users/biel/workspace/sonder/client/src/server.js:141:18)
at tryCatch (/Users/biel/workspace/sonder/client/node_modules/regenerator/runtime.js:61:40)
at GeneratorFunctionPrototype.invoke [as _invoke] (/Users/biel/workspace/sonder/client/node_modules/regenerator/runtime.js:328:22)
at GeneratorFunctionPrototype.prototype.(anonymous function) [as next] (/Users/biel/workspace/sonder/client/node_modules/regenerator/runtime.js:94:21)
at GeneratorFunctionPrototype.invoke (/Users/biel/workspace/sonder/client/node_modules/regenerator/runtime.js:136:37) name: 'Invariant Violation', framesToPop: 1 }

总结（）所有文本，我首先遍历我的数据框并创建所有文本的列表，然后我再次迭代将它们逐个发送到summarize（）函数，这样我就可以得到摘要文本。这些for循环使我的代码变得非常非常慢，但我还没有找到一种方法来提高它的效率，我非常感谢任何建议。

function getMin(arr){
  if(arr.length == 0)return undefined;
    var min = arr[0];
    for(var i =0; i < arr.length; i++){
        if(Math.abs(arr[i]) < Math.abs(min)){
        min = arr[i];
      }
      else if(Math.abs(arr[i]) == Math.abs(min) && arr[i] > 0)
        min = arr[i]
    }

  return min

}

编辑：其他两个功能是：

def summarize(text, n):
sents = sent_tokenize(text) # text into tokenized sentences
# Checking if there are less sentences in the given review than the required length of the summary
assert n <= len(sents)
list_sentences = [word_tokenize(s.lower()) for s in sents] # word tokenized sentences
frequency = calculate_freq(list_sentences) # calculating the word frequency for all the sentences
ranking = defaultdict(int)
for i, sent in enumerate(list_sentences):
    for w in sent:
        if w in frequency:
            ranking[i] += frequency[w]
# Calling the rank function to get the highest ranking
sents_idx = rank(ranking, n)
# Return the best choices
return [sents[j] for j in sents_idx]

输入文字：食谱很容易，狗喜欢它们。我会一次又一次地买这本书。唯一的问题是食谱不会告诉你他们制作了多少款，但我认为这是因为你可以制作各种不同尺寸的食谱。太棒了！输出文字：我会一次又一次地买这本书。

Answer 1

你尝试过这样的事吗？

# Test data
df = pd.DataFrame({'ASIN': [0,1], 'Summary': ['This is the first text', 'Second text']})

# Example function
def summarize(text, n=5):

    """A very basic summary"""
    return (text[:n] + '..') if len(text) > n else text

# Applying the function to the text
df['Result'] = df['Summary'].map(summarize)

#    ASIN                 Summary   Result
# 0     0  This is the first text  This ..
# 1     1             Second text  Secon..

Answer 2

这么长的故事......

我将假设您正在执行文本频率分析，reviewText的顺序无关紧要。如果是这样的话：

Mega_String = ' '.join(data['reviewText'])

这应该将评论文本功能中的所有字符串连接成一个大字符串，每个评论用空格分隔。

您可以将此结果抛给您的函数。

如何避免for循环并正确迭代pandas数据帧？

2 个答案: