在python pandas中删除for循环查找并替换文本

时间:2017-05-21 15:04:16

标签: python performance pandas for-loop vectorization

我有2个pandas数据帧。我想在2个数据帧之间进行查找和替换。在df_find数据框的current_title列中,我想在每一行中搜索来自' keywrod'的任何值的值。 df_replace数据框中的列,如果找到则将其替换为来自' keywordlength'的相应值。列。

我已经能够摆脱df_find数据帧的循环,因为我需要使用str.replace迭代这个数据帧中的每一行,replacedf_replace函数的矢量化形式

在我的情况下,性能很重要,因为两个数据帧都符合GB的要求。所以,我想在这里摆脱df_replace的循环,并使用任何其他有效的方式迭代import pandas as pd df_find = pd.read_csv("input_find.csv") df_replace = pd.read_csv("input_replace.csv") #replace for i,j in zip(df_replace.keyword,df_replace.keywordLength): df_find.current_title=df_find.current_title.str.replace(i,j,case=False) 数据帧的所有行。

keyword       keywordLength
IT Manager    ##10##
Sales Manager ##13##
IT Analyst    ##12##
Store Manager ##13##

df_replace 此数据框具有查找和替换所需的数据

current_title
I have been working here as a store manager since after I passed from college
I am sales manager and primarily work in the ASEAN region. My primary rolw is to bring new customers.
I initially joined as a IT analyst and because of my sheer drive and dedication, I was promoted to IT manager position within 3 years

df_find是我们需要进行转换的地方。

在执行查找和替换代码之前:

current_title
I have been working here as a ##13## since after I passed from college
I am ##13## and primarily work in the ASEAN region. My primary rolw is to bring new customers.
I initially joined as a ##12## and because of my sheer drive and dedication, I was promoted to ##10## position within 3 years

执行查找并通过上面的代码替换

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
    at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
    at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:103)
    at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:91)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
    at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
    at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:462)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:698)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

我将永远感激不尽!感谢

1 个答案:

答案 0 :(得分:1)

如果我理解正确,您应该可以对数据集进行相对简单的合并(使用其他几行)并获得所需的结果。

没有你的数据集,我只是自己做了。下面的代码可能会更优雅一点,但它可以让你在四行中占据一席之地,最重要的是 - 没有循环:

<强>设定:

df_find = pd.DataFrame({
            'current_title':['a','a','b','c','b','c','b','a'],
            'other':['this','is','just','a','bunch','of','random','words']
        })

df_replace = pd.DataFrame({'keyword':['a','c'], 'keywordlength':['x','z']})

<强>代码:

# This line is to simply re-sort at the end of the code.  Someone with more experience can probably bypass this step.
df_find['idx'] = df_find.index

# Merge together the two data sets based on matching the "current_title" and the "keyword"
dfx = df_find.merge(df_replace, left_on = 'current_title', right_on = 'keyword', how = 'outer').drop('keyword', 1)

# Now, copy the non-null "keywordlength" values to "current_title"
dfx.loc[dfx['keywordlength'].notnull(), 'current_title'] = dfx.loc[dfx['keywordlength'].notnull(), 'keywordlength']

# Clean up by dropping the unnecessary columns and resort based on the first line above.
df_find = dfx.sort_values('idx').drop(['keywordlength','idx'], 1)

<强>输出:

  current_title   other
0             x    this
1             x      is
3             b    just
6             z       a
4             b   bunch
7             z      of
5             b  random
2             x   words