如何避免for循环并正确迭代pandas数据帧?

时间:2017-08-26 13:38:40

标签: python pandas csv optimization machine-learning

我有这段代码,我一直在努力优化。

我的数据框是一个包含2列的csv文件,其中第二列包含文本。看起来像图片:

enter image description here

我有一个函数汇总(text,n),它需要一个文本和一个整数作为输入。

Unhandled Server Error (Oops!) +11ms { Invariant Violation: _registerComponent(...): Target container is not a DOM element.
at invariant (/Users/biel/workspace/sonder/client/node_modules/fbjs/lib/invariant.js:44:15)
at Object._renderNewRootComponent (/Users/biel/workspace/sonder/client/node_modules/react-dom/lib/ReactMount.js:310:76)
at Object._renderSubtreeIntoContainer (/Users/biel/workspace/sonder/client/node_modules/react-dom/lib/ReactMount.js:401:32)
at Object.render (/Users/biel/workspace/sonder/client/node_modules/react-dom/lib/ReactMount.js:422:23)
at callee$1$0$ (/Users/biel/workspace/sonder/client/src/server.js:141:18)
at tryCatch (/Users/biel/workspace/sonder/client/node_modules/regenerator/runtime.js:61:40)
at GeneratorFunctionPrototype.invoke [as _invoke] (/Users/biel/workspace/sonder/client/node_modules/regenerator/runtime.js:328:22)
at GeneratorFunctionPrototype.prototype.(anonymous function) [as next] (/Users/biel/workspace/sonder/client/node_modules/regenerator/runtime.js:94:21)
at GeneratorFunctionPrototype.invoke (/Users/biel/workspace/sonder/client/node_modules/regenerator/runtime.js:136:37) name: 'Invariant Violation', framesToPop: 1 }

总结()所有文本,我首先遍历我的数据框并创建所有文本的列表,然后我再次迭代将它们逐个发送到summarize()函数,这样我就可以得到摘要文本。这些for循环使我的代码变得非常非常慢,但我还没有找到一种方法来提高它的效率,我非常感谢任何建议。

function getMin(arr){
  if(arr.length == 0)return undefined;
    var min = arr[0];
    for(var i =0; i < arr.length; i++){
        if(Math.abs(arr[i]) < Math.abs(min)){
        min = arr[i];
      }
      else if(Math.abs(arr[i]) == Math.abs(min) && arr[i] > 0)
        min = arr[i]
    }

  return min

}

编辑: 其他两个功能是:

def summarize(text, n):
sents = sent_tokenize(text) # text into tokenized sentences
# Checking if there are less sentences in the given review than the required length of the summary
assert n <= len(sents)
list_sentences = [word_tokenize(s.lower()) for s in sents] # word tokenized sentences
frequency = calculate_freq(list_sentences) # calculating the word frequency for all the sentences
ranking = defaultdict(int)
for i, sent in enumerate(list_sentences):
    for w in sent:
        if w in frequency:
            ranking[i] += frequency[w]
# Calling the rank function to get the highest ranking
sents_idx = rank(ranking, n)
# Return the best choices
return [sents[j] for j in sents_idx]

输入文字:食谱很容易,狗喜欢它们。我会一次又一次地买这本书。唯一的问题是食谱不会告诉你他们制作了多少款,但我认为这是因为你可以制作各种不同尺寸的食谱。太棒了! 输出文字:我会一次又一次地买这本书。

2 个答案:

答案 0 :(得分:1)

你尝试过这样的事吗?

# Test data
df = pd.DataFrame({'ASIN': [0,1], 'Summary': ['This is the first text', 'Second text']})

# Example function
def summarize(text, n=5):

    """A very basic summary"""
    return (text[:n] + '..') if len(text) > n else text

# Applying the function to the text
df['Result'] = df['Summary'].map(summarize)

#    ASIN                 Summary   Result
# 0     0  This is the first text  This ..
# 1     1             Second text  Secon..

答案 1 :(得分:0)

这么长的故事......

我将假设您正在执行文本频率分析,reviewText的顺序无关紧要。如果是这样的话:

Mega_String = ' '.join(data['reviewText'])

这应该将评论文本功能中的所有字符串连接成一个大字符串,每个评论用空格分隔。

您可以将此结果抛给您的函数。