Question

所以我不想拉整页，只是页面的前40KB。就像这个Facebook Debugger工具一样。

我的目标是获取社交媒体元数据，即og:image等。

可以使用任何编程语言，PHP或Python。

我在phpQuery中有使用file_get_contents / cURL的代码，我知道如何解析收到的HTML，我的问题是＆＃34;如何只获取页面的前nKB而不取整页＆＃34;

Answer 1

这不是特定于Facebook或任何其他社交媒体网站，但您可以像这样获得前40 KB的python：

import urllib2
start = urllib2.urlopen(your_link).read(40000)

Answer 2

可以使用：

curl -r 0-40000 -o 40k.raw https://www.keycdn.com/support/byte-range-requests/

-r代表范围：

来自curl手册页：

r, --range <range>
          (HTTP FTP SFTP FILE) Retrieve a byte range (i.e a partial document) from a HTTP/1.1, FTP or SFTP server or a local  FILE.  Ranges  can  be
          specified in a number of ways.

          0-499     specifies the first 500 bytes

          500-999   specifies the second 500 bytes

          -500      specifies the last 500 bytes

          9500-     specifies the bytes from offset 9500 and forward

          0-0,-1    specifies the first and last byte only(*)(HTTP)

可以在本文中找到更多信息：https://www.keycdn.com/support/byte-range-requests/

以防这是如何处理go

的基本示例

package main

import (
    "fmt"
    "io"
    "io/ioutil"
    "log"
    "net/http"
)

func main() {
    response, err := http.Get("https://google.com")
    if err != nil {
        log.Fatal(err)
    }
    defer response.Body.Close()
    data, err := ioutil.ReadAll(io.LimitReader(response.Body, 40000))
    fmt.Printf("data = %s\n", data)
}

如何通过cURL仅获取页面的前40KB

2 个答案: