Question

tl; dr：我正在寻找一种方法来查找数据库中缺少信息的条目，从网站获取该信息并将其添加到数据库条目。

我们有一个媒体管理程序，它使用mySQL表来存储信息。当员工下载媒体（视频文件，图像，音频文件）并将其导入媒体管理器时，他们假设也复制媒体描述（来自源网站）并将其添加到描述中在媒体经理。但是，数千的文件尚未完成此操作。

文件名（例如 file123 .mov）是唯一的，可以通过访问源网站上的URL来访问该文件的详细信息页面：

website.com/content/的 file123

我们想要从该页面抓取的信息具有始终相同的元素ID。

在我看来，这个过程将是：

连接到数据库和加载表

过滤："format"为"Still Image (JPEG)"

过滤："description"为"NULL"

获取第一个结果

获取"FILENAME"无延期）

加载网址：website.com/content / FILENAME

复制元素"description"的内容（在网站上）

将内容粘贴到"description"（SQL条目）
中
获得第二个结果

冲洗并重复，直到达到最后结果

我的问题是：

是否有可以执行此类任务的软件或者是否需要编写脚本？
如果编写脚本，最好的脚本类型是什么（例如，我可以使用AppleScript实现这一目标，还是需要在java或php等中实现。）

Answer 1

是否有可以执行此类任务的软件或者是否需要编写脚本？

我不知道任何可以开箱即用的东西（即使有的话，所需的配置也不会比滚动你自己的解决方案所涉及的脚本少得多）。

如果编写脚本，那么最好的脚本类型是什么（例如，我可以使用AppleScript实现这一目标，还是需要在java或php等中实现。）

AppleScript无法连接到数据库，因此您肯定需要在混合中添加其他内容。如果选择在Java和PHP之间（并且你同样熟悉它们），我肯定会推荐PHP用于此目的，因为涉及的代码要少得多。

您的PHP脚本如下所示：

$BASEURL  = 'http://website.com/content/';

// connect to the database
$dbh = new PDO($DSN, $USERNAME, $PASSWORD);

// query for files without descriptions
$qry = $dbh->query("
  SELECT FILENAME FROM mytable
  WHERE  format = 'Still Image (JPEG)' AND description IS NULL
");

// prepare an update statement
$update = $dbh->prepare('
  UPDATE mytable SET description = :d WHERE FILENAME = :f
');

$update->bindParam(':d', $DESCRIPTION);
$update->bindParam(':f', $FILENAME);

// loop over the files
while ($FILENAME = $qry->fetchColumn()) {
  // construct URL
  $i = strrpos($FILENAME, '.');
  $url = $BASEURL . (($i === false) ? $FILENAME : substr($FILENAME, 0, $i));

  // fetch the document
  $doc = new DOMDocument();
  $doc->loadHTMLFile($url);

  // get the description
  $DESCRIPTION = $doc->getElementsById('description')->nodeValue;

  // update the database
  $update->execute();
}

Answer 2

PHP是一个很好的刮板。我在这里创建了一个包装PHP的cURL端口的类：

http://semlabs.co.uk/journal/object-oriented-curl-class-with-multi-threading

你可能需要使用一些选项：

http://www.php.net/manual/en/function.curl-setopt.php

为了抓取HTML，我通常使用正则表达式，但这里是一个我应该能够查询HTML而没有问题的类：

http://pastebin.com/Jm9jKjAU

用法是：

$h = new HTMLQuery();
$h->load( $string_containing_html );
$h->getElements( 'p', 'id' ); // Returns all p tags with an id attribute

scrape的最佳选择是XPath，但它无法处理脏HTML。您可以使用它来执行以下操作：

// div [@class ='itm'] / p [last（）和text（）='Hello World']＆lt; - 选择具有innerHTML'Hello World'的div元素中的最后一个

您可以在PHP中使用DOM类（内置）。

Answer 3

我也不知道任何现有的软件包可以完成您正在寻找的所有内容。但是，Python可以连接到您的数据库，轻松发出Web请求，并处理脏HTML。假设您已安装Python，则需要三个软件包：

MySQLdb用于连接数据库。
Requests可轻松发出http网络请求。
BeautifulSoup用于强大的html解析。

您可以使用pip命令或Windows安装程序安装这些软件包。每个站点都有适当的说明。整个过程不会超过10分钟。

import MySQLdb as db
import os.path
import requests
from bs4 import BeautifulSoup

# Connect to the database. Fill in these fields as necessary.

con = db.connect(host='hostname', user='username', passwd='password',
                 db='dbname')

# Create and execute our SELECT sql statement.

select = con.cursor()
select.execute('SELECT filename FROM table_name \
                WHERE format = ? AND description = NULL',
               ('Still Image (JPEG)',))

while True:
    # Fetch a row from the result of the SELECT statement.

    row = select.fetchone()
    if row is None: break

    # Use Python's built-in os.path.splitext to split the extension
    # and get the url_name.

    filename = row[0]
    url_name = os.path.splitext(filename)[0]
    url = 'http://www.website.com/content/' + url_name

    # Make the web request. You may want to rate-limit your requests
    # so that the website doesn't get angry. You can slow down the
    # rate by inserting a pause with:
    #               
    # import time   # You can put this at the top with other imports
    # time.sleep(1) # This will wait 1 second.

    response = requests.get(url)
    if response.status_code != 200:

        # Don't worry about skipped urls. Just re-run this script
        # on spurious or network-related errors.

        print 'Error accessing:', url, 'SKIPPING'
        continue

    # Parse the result. BeautifulSoup does a great job handling
    # mal-formed input.

    soup = BeautifulSoup(response.content)
    description = soup.find('div', {'id': 'description'}).contents

    # And finally, update the database with another query.

    update = db.cursor()
    update.execute('UPDATE table_name SET description = ? \
                    WHERE filename = ?',
                   (description, filename))

我会警告我已经努力使代码“看起来正确”，但我还没有真正测试过它。您需要填写私人信息。

将外部网站上的内容与mySQL数据库中的条目配对

3 个答案: