熊猫内连接性能问题

时间:2016-08-27 07:17:55

标签: python performance python-2.7 pandas dataframe

我有两个csv文件,我将它们加载到pandas数据框中。一个文件很大,大约10M行和20列(所有字符串类型),大小约为1G字节,另一个文件很小,约5k行,5列和大小约为1M。我想通过两个数据框之间的单个公共列进行内部连接。

这是我加入的方式,

mergedDataSet = pd.merge(smallDataFrame, largeDataFrame, on='uid', how='inner')

如果我采样1%的大数据集,程序运行顺利,没有任何问题,并在5秒内完成,我试过,所以我验证函数应该可以用于我的代码。

但是如果我加入真正的大数据集,程序将在大约20-30秒内终止,错误消息为Process finished with exit code 137 (interrupted by signal 9: SIGKILL)。我在Mac OSX上使用Python 2.7和miniconda,我从PyCharm运行。我的机器有16G内存,远远超过1G文件的大小。

想知道是否有任何关于调整数据框架性能的想法会加入熊猫,还是内联接的任何其他快速解决方案?

我的另一个困惑是,为什么这个程序被杀了?由谁和为什么原因?

编辑1 ,执行内部联接时在/var/log/system.log中捕获错误,

Aug 27 11:00:18 foo-laptop com.apple.CDScheduler[702]: Thermal pressure state: 1 Memory pressure state: 0
Aug 27 11:00:18 foo-laptop com.apple.CDScheduler[47]: Thermal pressure state: 1 Memory pressure state: 0
Aug 27 11:00:33 foo-laptop iTerm2[43018]: Time to encode state for window <PseudoTerminal: 0x7fb3659d3960 tabs=1 window=<PTYWindow: 0x7fb3637c0c80 frame=NSRect: {{0, 0}, {1280, 800}} title=5. tail alpha=1.000000 isMain=1 isKey=1 isVisible=1 delegate=0x7fb3659d3960>>: 0.02136099338531494
Aug 27 11:00:41 foo-laptop iTerm2[43018]: Time to encode state for window <PseudoTerminal: 0x7fb3659d3960 tabs=1 window=<PTYWindow: 0x7fb3637c0c80 frame=NSRect: {{0, 0}, {1280, 800}} title=5. tail alpha=1.000000 isMain=0 isKey=0 isVisible=1 delegate=0x7fb3659d3960>>: 0.01138699054718018
Aug 27 11:00:46 foo-laptop kernel[0]: low swap: killing pid 92118 (python2.7)
Aug 27 11:00:46 foo-laptop kernel[0]: memorystatus_thread: idle exiting pid 789 [CallHistoryPlugi]
Aug 27 11:00:56 foo-laptop iTerm2[43018]: Time to encode state for window <PseudoTerminal: 0x7fb3659d3960 tabs=1 window=<PTYWindow: 0x7fb3637c0c80 frame=NSRect: {{0, 0}, {1280, 800}} title=5. tail alpha=1.000000 isMain=0 isKey=0 isVisible=1 delegate=0x7fb3659d3960>>: 0.01823097467422485
Aug 27 11:00:58 foo-laptop kernel[0]: process WeChat[85077] caught causing excessive wakeups. Observed wakeups rate (per sec): 184; Maximum permitted wakeups rate (per sec): 150; Observation period: 300 seconds; Task lifetime number of wakeups: 2193951
Aug 27 11:00:58 foo-laptop com.apple.xpc.launchd[1] (com.apple.ReportCrash[92123]): Endpoint has been activated through legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in(): com.apple.ReportCrash
Aug 27 11:00:58 foo-laptop ReportCrash[92123]: Invoking spindump for pid=85077 wakeups_rate=184 duration=245 because of excessive wakeups
Aug 27 11:01:03 foo-laptop com.apple.CDScheduler[702]: Thermal pressure state: 0 Memory pressure state: 0
Aug 27 11:01:03 foo-laptop com.apple.CDScheduler[47]: Thermal pressure state: 0 Memory pressure state: 0

的问候, 林

1 个答案:

答案 0 :(得分:3)

检查&#39; uid&#39;的基数。双方专栏。您的联接最有可能将数据倍增倍数。例如,如果您在100个dataframe1记录中的值为1的uid和在dataframe2中的10个记录中,则您的联接将产生1000个记录。

要检查基数,我会执行以下操作:

df1[df1.uid.isin(df2.uid.unique())]['uid'].value_counts()
df2[df2.uid.isin(df1.uid.unique())]['uid'].value_counts()

此代码将检查&#39; uid&#39;的值。存在于其他框架中并且有重复的内容。