熊猫的速度问题和列表理解

时间:2019-01-10 02:41:38

标签: python pandas list-comprehension

我有一个包含4m行数据的数据集,我使用pd.read_csv(chunk size ...)将其分成多个块,然后执行一些简单的数据清理代码以使其成为我需要的格式。

tqdm.pandas()
print("Merging addresses...")

df_adds = chunk.progress_apply(merge_addresses, axis = 1)

[(chunk.append(df_adds[idx][0], ignore_index=True),chunk.append(df_adds[idx][1], \
ignore_index=True)) for idx in tqdm(range(len(chunk))) \
if pd.notnull(df_adds[idx][0]['street_address'])]


def merge_addresses(row):
    row2 = pd.Series(
            {'Org_ID' : row.Org_ID,
            'org_name': row.org_name,
            'street_address': row.street_address2})
    row3 = pd.Series(
            {'Org_ID' : row.Org_ID,
            'org_name': row.org_name,
            'street_address': row.street_address3})
    return row2, row3

我正在使用tqdm分析两个操作的速度,第一个,pandas apply函数以大约1.5k it / s的速度运行良好,第二个,列表理解以大约2k it / s的速度运行,然后迅速下降至200 it / s。谁能帮助解释我如何提高速度?

我的目标是获取street_address 2和3,并将所有不为null的它们合并并复制到street_address1列中,并根据需要复制org_id和org_name。

更新

我试图捕获merge_addresses中的所有NaN并将其替换为字符串。我的目的是将address2和address3放入与address1相同的列中它们自己的行(具有org_name和org_id(因此这两个字段将是重复的))。因此,相同的org_id可能有三行,但地址会有所不同。 / p>

df_adds = chunk.progress_apply(merge_addresses, axis = 1)

[(chunk.append(x[0]), chunk.append(x[1])) for x in tqdm(df_adds) if (pd.notnull(x[0][3]),pd.notnull(x[0][3]))]

def merge_addresses(row):
    if pd.isnull(row.street_address2):
        row.street_address2 = ''
    if pd.isnull(row.street_address3):
        row.street_address3 = ''
    return ([row.Org_ID, row.pub_name_adj, row.org_name, row.street_address2], [row.Org_ID, row.pub_name_adj, row.org_name, row.street_address3])

我收到错误'<' not supported between instances of 'str' and 'int', sort order is undefined for incomparable objects result = result.union(other)

使用tqdm,列表理解似乎可以正常工作,但速度很慢(24 it / s)

更新

请澄清一下,数据为当前格式: enter image description here

我的目标是使它达到以下目标:

enter image description here

我玩过不同的块大小:

  

20k行= 70 it / s 100k行= 35 it / s 200k = 31 it / s

似乎要进行权衡的最佳大小是20万行。

2 个答案:

答案 0 :(得分:2)

过于频繁地致电i = 0可能会很昂贵(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html):

  

将行迭代添加到DataFrame可能比单个连接更多地占用大量计算资源。更好的解决方案是将这些行添加到列表中,然后一次将列表与原始DataFrame连接起来。

如果可以,请使用DataFrame.append加快实施速度。

答案 1 :(得分:1)

正如评论所证明的那样,这里的瓶颈是由于创建和提供了太多的对象而导致的,它们占用了太多的内存。另外,创建对象会浪费内存分配时间并减慢其速度。

在10万个数据集上得到证明:

{
  "_from": "react-native",
  "_id": "react-native@0.57.8",
  "_inBundle": false,
  "_integrity": "sha512-K6DAMTPTq+lxVYC73y4Kh/bgLajddBaIKzwsVeV4JOoS1Fdq48/ISXD3vApV+x+/IBVTXnrT9qlA+9U6MMZCqA==",
  "_location": "/react-native",
  "_phantomChildren": {},
  "_requested": {
    "type": "tag",
    "registry": true,
    "raw": "react-native",
    "name": "react-native",
    "escapedName": "react-native",
    "rawSpec": "",
    "saveSpec": null,
    "fetchSpec": "latest"
  },
  "_requiredBy": [
    "#USER",
    "/"
  ],
  "_resolved": "https://registry.npmjs.org/react-native/-/react-native-0.57.8.tgz",
  "_shasum": "1a840fbe144cd3902cc14313a783ce28efc48cb9",
  "_spec": "react-native",
  "_where": "C:\\Users\\Vipin\\Desktop\\Fuelex",
  "bin": {
    "react-native": "local-cli/wrong-react-native.js"
  },
  "bugs": {
    "url": "https://github.com/facebook/react-native/issues"
  },
  "bundleDependencies": false,
  "dependencies": {
    "@babel/runtime": "^7.0.0",
    "absolute-path": "^0.0.0",
    "art": "^0.10.0",
    "base64-js": "^1.1.2",
    "chalk": "^1.1.1",
    "commander": "^2.9.0",
    "compression": "^1.7.1",
    "connect": "^3.6.5",
    "create-react-class": "^15.6.3",
    "debug": "^2.2.0",
    "denodeify": "^1.2.1",
    "envinfo": "^5.7.0",
    "errorhandler": "^1.5.0",
    "escape-string-regexp": "^1.0.5",
    "event-target-shim": "^1.0.5",
    "fbjs": "^1.0.0",
    "fbjs-scripts": "^1.0.0",
    "fs-extra": "^1.0.0",
    "glob": "^7.1.1",
    "graceful-fs": "^4.1.3",
    "inquirer": "^3.0.6",
    "lodash": "^4.17.5",
    "metro": "^0.48.1",
    "metro-babel-register": "^0.48.1",
    "metro-core": "^0.48.1",
    "metro-memory-fs": "^0.48.1",
    "mime": "^1.3.4",
    "minimist": "^1.2.0",
    "mkdirp": "^0.5.1",
    "morgan": "^1.9.0",
    "node-fetch": "^2.2.0",
    "node-notifier": "^5.2.1",
    "npmlog": "^2.0.4",
    "opn": "^3.0.2",
    "optimist": "^0.6.1",
    "plist": "^3.0.0",
    "pretty-format": "^4.2.1",
    "promise": "^7.1.1",
    "prop-types": "^15.5.8",
    "react-clone-referenced-element": "^1.0.1",
    "react-devtools-core": "^3.4.2",
    "react-timer-mixin": "^0.13.2",
    "regenerator-runtime": "^0.11.0",
    "rimraf": "^2.5.4",
    "semver": "^5.0.3",
    "serve-static": "^1.13.1",
    "shell-quote": "1.6.1",
    "stacktrace-parser": "^0.1.3",
    "ws": "^1.1.5",
    "xcode": "^1.0.0",
    "xmldoc": "^0.4.0",
    "yargs": "^9.0.0"
  },
  "deprecated": false,
  "description": "A framework for building native apps using React",
  "detox": {
    "test-runner": "jest",
    "runner-config": "RNTester/e2e/config.json",
    "specs": "RNTester/e2e",
    "configurations": {
      "ios.sim.release": {
        "binaryPath": "RNTester/build/Build/Products/Release-iphonesimulator/RNTester.app/",
        "build": "xcodebuild -project RNTester/RNTester.xcodeproj -scheme RNTester -configuration Release -sdk iphonesimulator -derivedDataPath RNTester/build -quiet",
        "type": "ios.simulator",
        "name": "iPhone 8"
      }
    }
  },
  "devDependencies": {
    "@babel/core": "^7.0.0",
    "async": "^2.4.0",
    "babel-eslint": "9.0.0",
    "babel-generator": "^6.26.0",
    "detox": "9.0.4",
    "eslint": "5.1.0",
    "eslint-config-fb-strict": "22.1.0",
    "eslint-config-fbjs": "2.0.1",
    "eslint-plugin-eslint-comments": "^3.0.1",
    "eslint-plugin-flowtype": "2.43.0",
    "eslint-plugin-jest": "21.8.0",
    "eslint-plugin-prettier": "2.6.0",
    "eslint-plugin-react": "7.8.2",
    "eslint-plugin-react-native": "^3.2.1",
    "flow-bin": "^0.78.0",
    "jest": "23.4.1",
    "jest-junit": "5.1.0",
    "prettier": "1.13.6",
    "react": "16.6.3",
    "react-native-dummy": "0.1.0",
    "react-test-renderer": "16.6.3",
    "shelljs": "^0.7.8"
  },
  "engines": {
    "node": ">=8.3"
  },
  "files": [
    ".flowconfig",
    "android",
    "cli.js",
    "flow",
    "flow-github",
    "init.sh",
    "scripts/ios-configure-glog.sh",
    "scripts/ios-install-third-party.sh",
    "scripts/launchPackager.bat",
    "scripts/launchPackager.command",
    "scripts/packager.sh",
    "scripts/react-native-xcode.sh",
    "jest-preset.json",
    "jest",
    "lib",
    "rn-get-polyfills.js",
    "setupBabel.js",
    "Libraries",
    "LICENSE",
    "local-cli",
    "packager",
    "react.gradle",
    "React.podspec",
    "React",
    "ReactAndroid",
    "ReactCommon",
    "README.md",
    "third-party-podspecs"
  ],
  "homepage": "https://github.com/facebook/react-native#readme",
  "jest": {
    "transform": {
      "^.+\\.(bmp|gif|jpg|jpeg|mp4|png|psd|svg|webp)$": "<rootDir>/jest/assetFileTransformer.js",
      ".*": "./jest/preprocessor.js"
    },
    "setupFiles": [
      "./jest/setup.js"
    ],
    "timers": "fake",
    "moduleNameMapper": {
      "^React$": "<rootDir>/Libraries/react-native/React.js"
    },
    "testPathIgnorePatterns": [
      "Libraries/Renderer",
      "/node_modules/",
      "local-cli/templates/",
      "RNTester/e2e"
    ],
    "haste": {
      "defaultPlatform": "ios",
      "hasteImplModulePath": "<rootDir>/jest/hasteImpl.js",
      "providesModuleNodeModules": [
        "react-native"
      ],
      "platforms": [
        "ios",
        "android"
      ]
    },
    "modulePathIgnorePatterns": [
      "/node_modules/(?!react|fbjs|react-native|react-transform-hmr|core-js|promise)/",
      "node_modules/react/node_modules/fbjs/",
      "node_modules/react/lib/ReactDOM.js",
      "node_modules/fbjs/lib/Map.js",
      "node_modules/fbjs/lib/Promise.js",
      "node_modules/fbjs/lib/fetch.js",
      "node_modules/fbjs/lib/ErrorUtils.js",
      "node_modules/fbjs/lib/URI.js",
      "node_modules/fbjs/lib/Deferred.js",
      "node_modules/fbjs/lib/PromiseMap.js",
      "node_modules/fbjs/lib/UserAgent.js",
      "node_modules/fbjs/lib/areEqual.js",
      "node_modules/fbjs/lib/base62.js",
      "node_modules/fbjs/lib/crc32.js",
      "node_modules/fbjs/lib/everyObject.js",
      "node_modules/fbjs/lib/fetchWithRetries.js",
      "node_modules/fbjs/lib/filterObject.js",
      "node_modules/fbjs/lib/flattenArray.js",
      "node_modules/fbjs/lib/forEachObject.js",
      "node_modules/fbjs/lib/isEmpty.js",
      "node_modules/fbjs/lib/nullthrows.js",
      "node_modules/fbjs/lib/removeFromArray.js",
      "node_modules/fbjs/lib/resolveImmediate.js",
      "node_modules/fbjs/lib/someObject.js",
      "node_modules/fbjs/lib/sprintf.js",
      "node_modules/fbjs/lib/xhrSimpleDataSerializer.js",
      "node_modules/jest-cli",
      "node_modules/react/dist",
      "node_modules/fbjs/.*/__mocks__/",
      "node_modules/fbjs/node_modules/"
    ],
    "unmockedModulePathPatterns": [
      "node_modules/react/",
      "Libraries/Renderer",
      "promise",
      "source-map",
      "fastpath",
      "denodeify",
      "fbjs"
    ],
    "testEnvironment": "node"
  },
  "license": "MIT",
  "main": "Libraries/react-native/react-native-implementation.js",
  "name": "react-native",
  "peerDependencies": {
    "react": "16.6.3"
  },
  "prettier": {
    "requirePragma": true,
    "singleQuote": true,
    "trailingComma": "all",
    "bracketSpacing": false,
    "jsxBracketSameLine": true,
    "parser": "flow"
  },
  "repository": {
    "type": "git",
    "url": "git+ssh://git@github.com/facebook/react-native.git"
  },
  "scripts": {
    "build-ios-e2e": "detox build -c ios.sim.release",
    "docker-build-android": "docker build -t reactnativeci/android -f ContainerShip/Dockerfile.android .",
    "docker-build-android-base": "docker build -t reactnativeci/android-base -f ContainerShip/Dockerfile.android-base .",
    "docker-setup-android": "docker pull reactnativeci/android-base:latest",
    "flow": "flow",
    "lint": "eslint .",
    "prettier": "prettier \"./**/*.js\" --write",
    "start": "node ./local-cli/cli.js start",
    "test": "jest",
    "test-android-all": "yarn run docker-build-android && yarn run test-android-run-unit && yarn run test-android-run-instrumentation && yarn run test-android-run-e2e",
    "test-android-e2e": "yarn run docker-build-android && yarn run test-android-run-e2e",
    "test-android-instrumentation": "yarn run docker-build-android && yarn run test-android-run-instrumentation",
    "test-android-run-e2e": "docker run --privileged -it reactnativeci/android bash ContainerShip/scripts/run-ci-e2e-tests.sh --android --js",
    "test-android-run-instrumentation": "docker run --cap-add=SYS_ADMIN -it reactnativeci/android bash ContainerShip/scripts/run-android-docker-instrumentation-tests.sh",
    "test-android-run-unit": "docker run --cap-add=SYS_ADMIN -it reactnativeci/android bash ContainerShip/scripts/run-android-docker-unit-tests.sh",
    "test-android-unit": "yarn run docker-build-android && yarn run test-android-run-unit",
    "test-ci": "JEST_JUNIT_OUTPUT=\"reports/junit/js-test-results.xml\" jest --maxWorkers=2 --ci --testResultsProcessor=\"jest-junit\"",
    "test-ios-e2e": "detox test -c ios.sim.release --cleanup"
  },
  "version": "0.57.8"
}

# create sample dataframe s = [] for i in range(100000): s.append(tuple(['name%d' %i, 'a%d' %i, 'b%d' %i])) labels = ['name', 'addr1', 'addr2'] df = pd.DataFrame(s, columns=labels) # addr1, addr2 to addr s = [] for k in ['addr1', 'addr2']: s.append(df.filter(['id', 'name', k]).rename(columns={k:'addr'})) result = pd.concat(s) 比列表的内置df.append慢得多。该示例将在几秒钟内完成。