从Emacs缓冲区中提取URL?

时间:2009-10-29 07:58:02

标签: elisp

如何编写Emacs Lisp函数来查找HTML文件中的所有href并提取所有链接?

输入:

<html>
 <a href="http://www.stackoverflow.com" _target="_blank">StackOverFlow</a>
 <h1>Emacs Lisp</h1>
 <a href="http://news.ycombinator.com" _target="_blank">Hacker News</a>
</html>

输出:

http://www.stackoverflow.com|StackOverFlow
http://news.ycombinator.com|Hacker News

我在搜索过程中看到了多次提到的重新搜索转发功能。根据我到目前为止所读到的内容,我认为我需要这样做。

(defun extra-urls (file)
 ...
 (setq buffer (...
 (while
        (re-search-forward "http://" nil t)
        (when (match-string 0)
...
))

3 个答案:

答案 0 :(得分:5)

我采用了Heinzi的解决方案,并提出了我需要的最终解决方案。我现在可以获取文件列表,提取所有URL和标题,并将结果放在一个输出缓冲区中。

(defun extract-urls (fname)
 "Extract HTML href url's,titles to buffer 'new-urls.csv' in | separated format."
  (setq in-buf (set-buffer (find-file fname))); Save for clean up
  (beginning-of-buffer); Need to do this in case the buffer is already open
  (setq u1 '())
  (while
      (re-search-forward "^.*<a href=\"\\([^\"]+\\)\"[^>]+>\\([^<]+\\)</a>" nil t)

      (when (match-string 0)            ; Got a match
        (setq url (match-string 1) )    ; URL
        (setq title (match-string 2) )  ; Title
        (setq u1 (cons (concat url "|" title "\n") u1)) ; Build the list of URLs
       )
      )
  (kill-buffer in-buf)          ; Don't leave a mess of buffers
  (progn
    (with-current-buffer (get-buffer-create "new-urls.csv"); Send results to new buffer
      (mapcar 'insert u1))
    (switch-to-buffer "new-urls.csv"); Finally, show the new buffer
    )
  )

;; Create a list of files to process
;;
(mapcar 'extract-urls '(
                       "/tmp/foo.html"
                       "/tmp/bar.html"
               ))

答案 1 :(得分:2)

如果每行最多只有一个链接而你不介意一些非常丑陋的正则表达式黑客攻击,请在缓冲区中运行以下代码:

(defun getlinks ()
  (beginning-of-buffer)
  (replace-regexp "^.*<a href=\"\\([^\"]+\\)\"[^>]+>\\([^<]+\\)</a>.*$" "LINK:\\1|\\2")
  (beginning-of-buffer)
  (replace-regexp "^\\([^L]\\|\\(L[^I]\\)\\|\\(LI[^N]\\)\\|\\(LIN[^K]\\)\\).*$" "")
  (beginning-of-buffer)
  (replace-regexp "
+" "
")
  (beginning-of-buffer)
  (replace-regexp "^LINK:\\(.*\\)$" "\\1")
)

它用LINK替换所有链接:url | description,删除包含其他内容的所有行,删除空行,最后删除“LINK:”。

详细的HOWTO:(1)通过将<href替换为<a href来纠正示例html文件中的错误,(2)将上述函数复制到Emacs scratch中,(3)在Cx Ce之后命中Cx Ce final“)”加载函数,(4)加载示例HTML文件,(5)用M-执行函数:(getlinks)

请注意,第三个replace-regexp中的换行符非常重要。不要缩进这两行。

答案 2 :(得分:1)

您可以使用'xml库,找到使用解析器的示例here。要解析您的特定文件,以下操作符合您的要求:

(defun my-grab-html (file)
  (interactive "fHtml file: ")
  (let ((res (car (xml-parse-file file)))) ; 'car because xml-parse-file returns a list of nodes
    (mapc (lambda (n)
            (when (consp n) ; don't operate on the whitespace, xml preserves whitespace
              (let ((link (cdr (assq 'href (xml-node-attributes n)))))
                (when link
                  (insert link)
                  (insert "|")
                  (insert (car (xml-node-children n))) ;# grab the text for the link
                  (insert "\n")))))
          (xml-node-children res))))

这不会递归地解析HTML以查找所有链接,但它应该让您从一般解决方案的方向开始。

相关问题