我需要抓取this website。这是一个脑筋急转弯网站,当你点击一个按钮时,它会运行一个JavaScript来显示答案窗口。
<tr>
<td width="60" bgcolor="#ECF5FF"> <p align="center"><font color="#800000".htm>1</font></p></td>
<td width="539" bgcolor="#ECF5FF"> <font color="#008080">一种东西,东方人的短,西方人的长,结婚后女的就可以用男的这东西,和尚有但是不用它 </font>
</td>
<td width="95" bgcolor="#ECF5FF"> <p align="center">
<INPUT onClick="MM_popupMsg('答案:名字 ')" type=button value=答案 name=button8639 style='font-size:12px;height:18px;border:1px solid black;'>
</p></td>
</tr>
这是我编写的用于抓取问题和答案的代码。我可以成功地得到问题,但未能得到答案。 (当我打印出答案时,它是空的[]
。)
questions = hxs.select('//td[@width="539"]/font/text()').extract()
answers = hxs.select('//td[@width="95"]/INPUT/@onClick').extract()
答案是onclick脚本的内容,即:我想得到这个字符串:
MM_popupMsg('答案:名字 ')
这是我的蜘蛛:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import re
class ReviewSpider(BaseSpider):
name = "2345jzw"
allowed_domains = ['2345.com/jzw']
start_urls = ['http://www.2345.com/jzw/index.htm']
page = 1
while page <= 1:
url = 'http://www.2345.com/jzw/%d.htm' % page
start_urls.append(url)
page = page + 1
def parse(self, response):
hxs = HtmlXPathSelector(response)
questions = hxs.select('//td[@width="539"]/font/text()').extract()
answers = hxs.select('//td[3]/p/INPUT/@onClick').extract()
print questions
print answers
id = 1
while id <= 50:
question = questions[id - 1]
question = re.sub(r'<[^>]*?>', '', str(question.encode('utf8')))
question = ' '.join(question.split())
question = question.replace('&', ' ')
question = question.replace('\'', ' ')
question = question.replace(',', ';')
answer = answers[id - 1]
answer = re.sub(r'<[^>]*?>', '', str(answer.encode('utf8')))
answer = ' '.join(answer.split())
answer = answer.replace('&', ' ')
answer = answer.replace('\'', ' ')
answer = answer.replace(',', ';')
file = open('crawled.xml', 'a')
file.write(question)
file.write(",")
file.write(answer)
file.write("\n")
file.close()
id = id + 1
我试过了
hxs.select('//INPUT/@onClick').extract()
但它仍然无效。这条路有什么问题?
请注意,问题已成功解除。问答结构非常相似。为什么答案是空的?
答案 0 :(得分:1)
首先,获取答案的xpath表达式不正确,而不是
//td[3]/p/INPUT/@onClick
使用
//td[3]/p/input/@onclick
另外,这是我的蜘蛛版本:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class DmozItem(Item):
number = Field()
question = Field()
answer = Field()
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["2345.com"]
start_urls = ["http://www.2345.com/jzw/1.htm"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
rows = hxs.select('//body/center/table[2]/tr')
for row in rows:
item = DmozItem()
try:
item['number'] = row.select(".//td[1]/p/font/text()").extract()[0]
item['question'] = row.select(".//td[2]/font/text()").extract()[0]
item['answer'] = row.select(".//td[3]/p/input/@onclick").extract()[0][13:-2]
except:
continue
yield item
通过scrapy runspider <spider_name.py> --output-format csv --output output.csv
运行,并在output.csv
文件中查看csv格式的结果。
希望有所帮助。