我正试图从本网站(https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s)的表格中删除所有数据,但似乎无法弄清楚如何抓取所有后续页面。这是将结果的第一页刮到CSV文件中的代码:
apply plugin: "com.android.application"
import com.android.build.OutputFile
project.ext.react = [
entryFile: "index.js"
]
apply from: "../../node_modules/react-native/react.gradle"
apply from: "../../node_modules/react-native-code-push/android/codepush.gradle"
/**
* Set this to true to create two separate APKs instead of one:
* - An APK that only works on ARM devices
* - An APK that only works on x86 devices
* The advantage is the size of the APK is reduced by about 4MB.
* Upload all the APKs to the Play Store and people will download
* the correct one based on the CPU architecture of their device.
*/
def enableSeparateBuildPerCPUArchitecture = false
/**
* Run Proguard to shrink the Java bytecode in release builds.
*/
def enableProguardInReleaseBuilds = false
android {
compileSdkVersion 26
buildToolsVersion "26.0.2"
defaultConfig {
applicationId "com.gauge"
minSdkVersion 23
targetSdkVersion 26
multiDexEnabled true
versionCode 25
versionName "1.2.10"
ndk {
abiFilters "armeabi-v7a", "x86"
}
}
signingConfigs {
release {
if (project.hasProperty('MYAPP_RELEASE_STORE_FILE')) {
storeFile file(MYAPP_RELEASE_STORE_FILE)
storePassword MYAPP_RELEASE_STORE_PASSWORD
keyAlias MYAPP_RELEASE_KEY_ALIAS
keyPassword MYAPP_RELEASE_KEY_PASSWORD
}
}
}
splits {
abi {
reset()
enable enableSeparateBuildPerCPUArchitecture
universalApk false // If true, also generate a universal APK
include "armeabi-v7a", "x86"
}
}
buildTypes {
debug {
// Note: CodePush updates should not be tested in Debug mode as they are overriden by the RN packager. However, because CodePush checks for updates in all modes, we must supply a key.
buildConfigField "String", "CODEPUSH_KEY", '""'
}
releaseStaging {
minifyEnabled enableProguardInReleaseBuilds
signingConfig signingConfigs.release
buildConfigField "String", "CODEPUSH_KEY", '"1psOppiGxP0-cJpCePhMqgEjeO4l2533309f-9929-415c-8999-d7fda42c3857"'
}
release {
minifyEnabled enableProguardInReleaseBuilds
proguardFiles getDefaultProguardFile("proguard-android.txt"), "proguard-rules.pro"
signingConfig signingConfigs.release
buildConfigField "String", "CODEPUSH_KEY", '"0wPxPhihmtxxEdma3mU4zIGIFNdi2533309f-9929-415c-8999-d7fda42c3857"'
}
}
// applicationVariants are e.g. debug, release
applicationVariants.all { variant ->
variant.outputs.each { output ->
// For each separate APK per architecture, set a unique version code as described here:
// http://tools.android.com/tech-docs/new-build-system/user-guide/apk-splits
def versionCodes = ["armeabi-v7a":1, "x86":2]
def abi = output.getFilter(OutputFile.ABI)
if (abi != null) { // null for the universal-debug, universal-release variants
output.versionCodeOverride =
versionCodes.get(abi) * 1048576 + defaultConfig.versionCode
}
}
}
}
buildscript {
repositories {
maven { url 'https://maven.fabric.io/public' }
}
dependencies {
classpath 'io.fabric.tools:gradle:1.22.1'
}
}
apply plugin: 'io.fabric'
repositories {
maven { url 'https://maven.fabric.io/public' }
}
dependencies {
compile project(':react-native-intercom')
compile project(':react-native-video')
compile (project(':react-native-code-push')) {
exclude(group: 'android.arch.core')
}
compile project(':react-native-config')
compile project(':react-native-vector-icons')
compile(project(':react-native-radar')) {
exclude group: 'com.google.android.gms'
exclude module: 'support-v4'
}
compile project(':react-native-push-notification')
compile project(':react-native-photo-view')
compile project(':react-native-linear-gradient')
compile project(':react-native-image-picker')
compile project(':react-native-fcm')
compile fileTree(dir: "libs", include: ["*.jar"])
compile(project(':react-native-fbsdk')){
exclude(group: 'com.facebook.android', module: 'facebook-android-sdk')
}
compile "com.facebook.android:facebook-android-sdk:4.22.1"
compile "com.android.support:appcompat-v7:26.0.2"
compile "com.facebook.react:react-native:+" // From node_modules
compile('com.crashlytics.sdk.android:crashlytics:2.6.7@aar') {
transitive = true
}
compile 'com.android.support:multidex:1.0.2'
compile "com.google.android.gms:play-services-location:12.0.0"
}
// Run this once to be able to run the application with BUCK
// puts all compile dependencies into folder libs for BUCK to use
task copyDownloadableDepsToLibs(type: Copy) {
from configurations.compile
into 'libs'
}
// ADD THIS AT THE BOTTOM
apply plugin: 'com.google.gms.google-services'
如何进入下一页的结果?
答案 0 :(得分:3)
虽然我无法运行您发布的代码,但我确实发现您链接到的原始tutorial代码可以在url =
行更改为:
url = 'https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s' \
+ '?max_rows=250'
运行python scrape.py
然后成功输出inmates.csv
所有可用记录。
简而言之,这适用于:
How do I get to the next page
?How do I remove pagination
?url = 'https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s'
使用新网址。教程中的旧网址:http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp
重定向到此新网址,但不能使用我们的解决方案,因此我们无法使用旧网址\
是一个换行符,允许我在下一行继续执行代码行,以提高可读性+
是连接的,因此我们可以添加?max_rows=250
。url = 'https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s?max_rows=250'
?max_rows=<number-of-records-to-display>
是我发现的查询字符串,适用于此特定Current Detainees
页面。通过首先注意用户为每页设置自定义行的Page Size
文本输入字段,可以找到这一点。它显示默认值50
。检查其HTML代码,例如在Firefox浏览器(52.7.3)中,使用 Ctrl + shift + i 来显示Firefox&#39; s Web Developer Inspector
工具窗口。单击Select element button(图标类似于带有鼠标光标箭头的方框轮廓),然后单击包含50
的输入字段。下面的HTML窗格通过突出显示:<input class="mrcinput" name="max_rows" size="3" title="max_rowsp" value="50" type="text">
。这意味着它提交了一个名为max_rows
的表单变量,它是一个数字,默认为50
。某些网页(根据编码方式)可以识别此类变量(如果作为查询字符串附加到URL),因此可以通过附加?max_rows=
加上您选择的数字来尝试此操作。当我开始页面250 Total Items
时,我选择通过更改浏览器地址栏来加载250
来尝试自定义号码https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s?max_rows=250
。它成功显示了250条记录,因此无需分页,因此?max_rows=250
是我们用来构成脚本使用的URL的原因242 Total Items
,所以看起来他们正在删除囚犯,或至少列出的囚犯记录。您可以:?max_rows=242
,但?max_rows=250
仍然有效,因为250
大于记录总数242
,只要它更大,页面就不需要了分页,从而允许你在一个页面上拥有所有记录。Current Detainees
页面和可能以相同方式编码的页面?max_rows=...
。但是,另一个网站,即使它们具有可调整的每页限制,也可能对此max_rows
变量使用不同的名称,或者完全忽略查询字符串,因此我们的解决方案可能无法在其他网站上运行所以将来如果你需要下载大量的记录,这种一次性下载的方法可能会让你陷入与内存相关的麻烦,但是为了抓取这个特定的Current Detainees
页面,它会完成工作。