我有一个包含4m行数据的数据集,我使用pd.read_csv(chunk size ...)将其分成多个块,然后执行一些简单的数据清理代码以使其成为我需要的格式。
tqdm.pandas()
print("Merging addresses...")
df_adds = chunk.progress_apply(merge_addresses, axis = 1)
[(chunk.append(df_adds[idx][0], ignore_index=True),chunk.append(df_adds[idx][1], \
ignore_index=True)) for idx in tqdm(range(len(chunk))) \
if pd.notnull(df_adds[idx][0]['street_address'])]
def merge_addresses(row):
row2 = pd.Series(
{'Org_ID' : row.Org_ID,
'org_name': row.org_name,
'street_address': row.street_address2})
row3 = pd.Series(
{'Org_ID' : row.Org_ID,
'org_name': row.org_name,
'street_address': row.street_address3})
return row2, row3
我正在使用tqdm分析两个操作的速度,第一个,pandas apply函数以大约1.5k it / s的速度运行良好,第二个,列表理解以大约2k it / s的速度运行,然后迅速下降至200 it / s。谁能帮助解释我如何提高速度?
我的目标是获取street_address 2和3,并将所有不为null的它们合并并复制到street_address1列中,并根据需要复制org_id和org_name。
更新
我试图捕获merge_addresses中的所有NaN并将其替换为字符串。我的目的是将address2和address3放入与address1相同的列中它们自己的行(具有org_name和org_id(因此这两个字段将是重复的))。因此,相同的org_id可能有三行,但地址会有所不同。 / p>
df_adds = chunk.progress_apply(merge_addresses, axis = 1)
[(chunk.append(x[0]), chunk.append(x[1])) for x in tqdm(df_adds) if (pd.notnull(x[0][3]),pd.notnull(x[0][3]))]
def merge_addresses(row):
if pd.isnull(row.street_address2):
row.street_address2 = ''
if pd.isnull(row.street_address3):
row.street_address3 = ''
return ([row.Org_ID, row.pub_name_adj, row.org_name, row.street_address2], [row.Org_ID, row.pub_name_adj, row.org_name, row.street_address3])
我收到错误'<' not supported between instances of 'str' and 'int', sort order is undefined for incomparable objects
result = result.union(other)
使用tqdm,列表理解似乎可以正常工作,但速度很慢(24 it / s)
更新
我的目标是使它达到以下目标:
我玩过不同的块大小:
20k行= 70 it / s 100k行= 35 it / s 200k = 31 it / s
似乎要进行权衡的最佳大小是20万行。
答案 0 :(得分:2)
过于频繁地致电i = 0
可能会很昂贵(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html):
将行迭代添加到DataFrame可能比单个连接更多地占用大量计算资源。更好的解决方案是将这些行添加到列表中,然后一次将列表与原始DataFrame连接起来。
如果可以,请使用DataFrame.append
加快实施速度。
答案 1 :(得分:1)
正如评论所证明的那样,这里的瓶颈是由于创建和提供了太多的对象而导致的,它们占用了太多的内存。另外,创建对象会浪费内存分配时间并减慢其速度。
在10万个数据集上得到证明:
{
"_from": "react-native",
"_id": "react-native@0.57.8",
"_inBundle": false,
"_integrity": "sha512-K6DAMTPTq+lxVYC73y4Kh/bgLajddBaIKzwsVeV4JOoS1Fdq48/ISXD3vApV+x+/IBVTXnrT9qlA+9U6MMZCqA==",
"_location": "/react-native",
"_phantomChildren": {},
"_requested": {
"type": "tag",
"registry": true,
"raw": "react-native",
"name": "react-native",
"escapedName": "react-native",
"rawSpec": "",
"saveSpec": null,
"fetchSpec": "latest"
},
"_requiredBy": [
"#USER",
"/"
],
"_resolved": "https://registry.npmjs.org/react-native/-/react-native-0.57.8.tgz",
"_shasum": "1a840fbe144cd3902cc14313a783ce28efc48cb9",
"_spec": "react-native",
"_where": "C:\\Users\\Vipin\\Desktop\\Fuelex",
"bin": {
"react-native": "local-cli/wrong-react-native.js"
},
"bugs": {
"url": "https://github.com/facebook/react-native/issues"
},
"bundleDependencies": false,
"dependencies": {
"@babel/runtime": "^7.0.0",
"absolute-path": "^0.0.0",
"art": "^0.10.0",
"base64-js": "^1.1.2",
"chalk": "^1.1.1",
"commander": "^2.9.0",
"compression": "^1.7.1",
"connect": "^3.6.5",
"create-react-class": "^15.6.3",
"debug": "^2.2.0",
"denodeify": "^1.2.1",
"envinfo": "^5.7.0",
"errorhandler": "^1.5.0",
"escape-string-regexp": "^1.0.5",
"event-target-shim": "^1.0.5",
"fbjs": "^1.0.0",
"fbjs-scripts": "^1.0.0",
"fs-extra": "^1.0.0",
"glob": "^7.1.1",
"graceful-fs": "^4.1.3",
"inquirer": "^3.0.6",
"lodash": "^4.17.5",
"metro": "^0.48.1",
"metro-babel-register": "^0.48.1",
"metro-core": "^0.48.1",
"metro-memory-fs": "^0.48.1",
"mime": "^1.3.4",
"minimist": "^1.2.0",
"mkdirp": "^0.5.1",
"morgan": "^1.9.0",
"node-fetch": "^2.2.0",
"node-notifier": "^5.2.1",
"npmlog": "^2.0.4",
"opn": "^3.0.2",
"optimist": "^0.6.1",
"plist": "^3.0.0",
"pretty-format": "^4.2.1",
"promise": "^7.1.1",
"prop-types": "^15.5.8",
"react-clone-referenced-element": "^1.0.1",
"react-devtools-core": "^3.4.2",
"react-timer-mixin": "^0.13.2",
"regenerator-runtime": "^0.11.0",
"rimraf": "^2.5.4",
"semver": "^5.0.3",
"serve-static": "^1.13.1",
"shell-quote": "1.6.1",
"stacktrace-parser": "^0.1.3",
"ws": "^1.1.5",
"xcode": "^1.0.0",
"xmldoc": "^0.4.0",
"yargs": "^9.0.0"
},
"deprecated": false,
"description": "A framework for building native apps using React",
"detox": {
"test-runner": "jest",
"runner-config": "RNTester/e2e/config.json",
"specs": "RNTester/e2e",
"configurations": {
"ios.sim.release": {
"binaryPath": "RNTester/build/Build/Products/Release-iphonesimulator/RNTester.app/",
"build": "xcodebuild -project RNTester/RNTester.xcodeproj -scheme RNTester -configuration Release -sdk iphonesimulator -derivedDataPath RNTester/build -quiet",
"type": "ios.simulator",
"name": "iPhone 8"
}
}
},
"devDependencies": {
"@babel/core": "^7.0.0",
"async": "^2.4.0",
"babel-eslint": "9.0.0",
"babel-generator": "^6.26.0",
"detox": "9.0.4",
"eslint": "5.1.0",
"eslint-config-fb-strict": "22.1.0",
"eslint-config-fbjs": "2.0.1",
"eslint-plugin-eslint-comments": "^3.0.1",
"eslint-plugin-flowtype": "2.43.0",
"eslint-plugin-jest": "21.8.0",
"eslint-plugin-prettier": "2.6.0",
"eslint-plugin-react": "7.8.2",
"eslint-plugin-react-native": "^3.2.1",
"flow-bin": "^0.78.0",
"jest": "23.4.1",
"jest-junit": "5.1.0",
"prettier": "1.13.6",
"react": "16.6.3",
"react-native-dummy": "0.1.0",
"react-test-renderer": "16.6.3",
"shelljs": "^0.7.8"
},
"engines": {
"node": ">=8.3"
},
"files": [
".flowconfig",
"android",
"cli.js",
"flow",
"flow-github",
"init.sh",
"scripts/ios-configure-glog.sh",
"scripts/ios-install-third-party.sh",
"scripts/launchPackager.bat",
"scripts/launchPackager.command",
"scripts/packager.sh",
"scripts/react-native-xcode.sh",
"jest-preset.json",
"jest",
"lib",
"rn-get-polyfills.js",
"setupBabel.js",
"Libraries",
"LICENSE",
"local-cli",
"packager",
"react.gradle",
"React.podspec",
"React",
"ReactAndroid",
"ReactCommon",
"README.md",
"third-party-podspecs"
],
"homepage": "https://github.com/facebook/react-native#readme",
"jest": {
"transform": {
"^.+\\.(bmp|gif|jpg|jpeg|mp4|png|psd|svg|webp)$": "<rootDir>/jest/assetFileTransformer.js",
".*": "./jest/preprocessor.js"
},
"setupFiles": [
"./jest/setup.js"
],
"timers": "fake",
"moduleNameMapper": {
"^React$": "<rootDir>/Libraries/react-native/React.js"
},
"testPathIgnorePatterns": [
"Libraries/Renderer",
"/node_modules/",
"local-cli/templates/",
"RNTester/e2e"
],
"haste": {
"defaultPlatform": "ios",
"hasteImplModulePath": "<rootDir>/jest/hasteImpl.js",
"providesModuleNodeModules": [
"react-native"
],
"platforms": [
"ios",
"android"
]
},
"modulePathIgnorePatterns": [
"/node_modules/(?!react|fbjs|react-native|react-transform-hmr|core-js|promise)/",
"node_modules/react/node_modules/fbjs/",
"node_modules/react/lib/ReactDOM.js",
"node_modules/fbjs/lib/Map.js",
"node_modules/fbjs/lib/Promise.js",
"node_modules/fbjs/lib/fetch.js",
"node_modules/fbjs/lib/ErrorUtils.js",
"node_modules/fbjs/lib/URI.js",
"node_modules/fbjs/lib/Deferred.js",
"node_modules/fbjs/lib/PromiseMap.js",
"node_modules/fbjs/lib/UserAgent.js",
"node_modules/fbjs/lib/areEqual.js",
"node_modules/fbjs/lib/base62.js",
"node_modules/fbjs/lib/crc32.js",
"node_modules/fbjs/lib/everyObject.js",
"node_modules/fbjs/lib/fetchWithRetries.js",
"node_modules/fbjs/lib/filterObject.js",
"node_modules/fbjs/lib/flattenArray.js",
"node_modules/fbjs/lib/forEachObject.js",
"node_modules/fbjs/lib/isEmpty.js",
"node_modules/fbjs/lib/nullthrows.js",
"node_modules/fbjs/lib/removeFromArray.js",
"node_modules/fbjs/lib/resolveImmediate.js",
"node_modules/fbjs/lib/someObject.js",
"node_modules/fbjs/lib/sprintf.js",
"node_modules/fbjs/lib/xhrSimpleDataSerializer.js",
"node_modules/jest-cli",
"node_modules/react/dist",
"node_modules/fbjs/.*/__mocks__/",
"node_modules/fbjs/node_modules/"
],
"unmockedModulePathPatterns": [
"node_modules/react/",
"Libraries/Renderer",
"promise",
"source-map",
"fastpath",
"denodeify",
"fbjs"
],
"testEnvironment": "node"
},
"license": "MIT",
"main": "Libraries/react-native/react-native-implementation.js",
"name": "react-native",
"peerDependencies": {
"react": "16.6.3"
},
"prettier": {
"requirePragma": true,
"singleQuote": true,
"trailingComma": "all",
"bracketSpacing": false,
"jsxBracketSameLine": true,
"parser": "flow"
},
"repository": {
"type": "git",
"url": "git+ssh://git@github.com/facebook/react-native.git"
},
"scripts": {
"build-ios-e2e": "detox build -c ios.sim.release",
"docker-build-android": "docker build -t reactnativeci/android -f ContainerShip/Dockerfile.android .",
"docker-build-android-base": "docker build -t reactnativeci/android-base -f ContainerShip/Dockerfile.android-base .",
"docker-setup-android": "docker pull reactnativeci/android-base:latest",
"flow": "flow",
"lint": "eslint .",
"prettier": "prettier \"./**/*.js\" --write",
"start": "node ./local-cli/cli.js start",
"test": "jest",
"test-android-all": "yarn run docker-build-android && yarn run test-android-run-unit && yarn run test-android-run-instrumentation && yarn run test-android-run-e2e",
"test-android-e2e": "yarn run docker-build-android && yarn run test-android-run-e2e",
"test-android-instrumentation": "yarn run docker-build-android && yarn run test-android-run-instrumentation",
"test-android-run-e2e": "docker run --privileged -it reactnativeci/android bash ContainerShip/scripts/run-ci-e2e-tests.sh --android --js",
"test-android-run-instrumentation": "docker run --cap-add=SYS_ADMIN -it reactnativeci/android bash ContainerShip/scripts/run-android-docker-instrumentation-tests.sh",
"test-android-run-unit": "docker run --cap-add=SYS_ADMIN -it reactnativeci/android bash ContainerShip/scripts/run-android-docker-unit-tests.sh",
"test-android-unit": "yarn run docker-build-android && yarn run test-android-run-unit",
"test-ci": "JEST_JUNIT_OUTPUT=\"reports/junit/js-test-results.xml\" jest --maxWorkers=2 --ci --testResultsProcessor=\"jest-junit\"",
"test-ios-e2e": "detox test -c ios.sim.release --cleanup"
},
"version": "0.57.8"
}
# create sample dataframe
s = []
for i in range(100000):
s.append(tuple(['name%d' %i, 'a%d' %i, 'b%d' %i]))
labels = ['name', 'addr1', 'addr2']
df = pd.DataFrame(s, columns=labels)
# addr1, addr2 to addr
s = []
for k in ['addr1', 'addr2']:
s.append(df.filter(['id', 'name', k]).rename(columns={k:'addr'}))
result = pd.concat(s)
比列表的内置df.append
慢得多。该示例将在几秒钟内完成。