当我使用selenium来抓取网站时获取python UnicodeEncodeError

时间:2015-02-24 16:37:08

标签: python selenium selenium-webdriver phantomjs

我试图用硒来搜索本网站上的论文标题:http://www.ncbi.nlm.nih.gov/pubmed?term=(%222013%22%5BDate%20-%20Publication%5D%20%3A%20%222013%22%5BDate%20-%20Publication%5D)

#coding="utf-8"

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

domain = "http://www.ncbi.nlm.nih.gov/"
url_tail = "pubmed?term=(%222013%22%5BDate%20-%20Publication%5D%20%3A%20%222013%22%5BDate%20-%20Publication%5D)"
url = domain + url_tail

browser = webdriver.Firefox()
browser.get(url)
time.sleep(5)

def extract_data(browser):
    titles = browser.find_elements_by_css_selector("div.rprt div.rslt p.title a")
    return [title.text for title in titles]

page_start = 1
page_end = 10

f = open('titles.txt', 'a')
for page in range(page_start, page_end):
    print "page %d" % page
    page_jump_box = browser.find_element_by_class_name("num").clear()
    page_jump_box_cleared = browser.find_element_by_class_name("num")
    page_jump_box_cleared.send_keys(str(page) + Keys.RETURN)

    time.sleep(15)

    f = open('titles.txt', 'a')
    for line in extract_data(browser):
        f.write(line + '\n')

f.close()

当我运行它时,我得到了这个:

gao@gao:~/crawler$ python crawler3.0.py 
page 1
page 2
page 3
page 4
Traceback (most recent call last):
  File "crawler3.0.py", line 33, in <module>
    f.write(line + '\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 36: ordinal not in range(128)

当我在Stackoverflow上搜索时,我发现了一个类似的问题:UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)。 我了解到,当你使用str()时,它会导致unicode问题。但在我的代码中,我只使用str()使page数字成为一个字符串。所以,如何纠正代码

这是另一个问题。我已经了解到,如果我想使用含有硒的phantomjs,我只需要将browser = webdriver.Firefox()更改为browser = webdriver.PhantomJS(),但是当我这样做时,内容是我重复了一遍(只翻了第1页的标题)。

我不是母语为英语的人,如果有任何语法错误或任何错误,请告诉我。

提前感谢。

1 个答案:

答案 0 :(得分:2)

在写入文件之前,您需要encode行:

for line in extract_data(browser):
    f.write(line.encode('utf-8') + '\n')

关于你的第二个问题,我建议进行以下改进(这会使其有效):

  • 使用Explicit Waits代替time.sleep()来电 - 这也会大大提高效果
  • 而不是输入页码,请点击&#34;下一步&#34;按钮
  • 在&#34中打开文件;追加&#34;循环前的模式并使用with context manager
  • 完成后
  • close()浏览器

代码:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

domain = "http://www.ncbi.nlm.nih.gov/"
url_tail = "pubmed?term=(%222013%22%5BDate%20-%20Publication%5D%20%3A%20%222013%22%5BDate%20-%20Publication%5D)"
url = domain + url_tail

browser = webdriver.PhantomJS()
browser.get(url)


def extract_data(browser):
    titles = browser.find_elements_by_css_selector("div.rprt div.rslt p.title a")
    return [title.text for title in titles]


page_start, page_end = 1, 10

with open('titles.txt', 'a') as f:
    for page in range(page_start, page_end):
        WebDriverWait(browser, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "div.rprt p.title"))
        )

        for line in extract_data(browser):
            f.write(line.encode('utf-8') + '\n')

        print "page %d" % page

        browser.find_element_by_css_selector("div.pagination a.next").click()

browser.close()

这会产生titles.txt个结果页1-9的标题:

Robotic-assisted tubal anastomosis with one-stitch technique.
Effectiveness of pictorial health warnings on cigarette packs among Lebanese school and university students.
...
Importance and globalization status of good manufacturing practice (GMP) requirements for pharmaceutical excipients.
Systemic review on drug related hospital admissions - A pubmed based search.