我正在编写一个Scrapy蜘蛛来刮擦this page,我只想要类jam_content
的元素的文本及其所有后代。所以理想情况下,我应该得到
CYBERPUNK GAME JAM 2014
CLICK HERE!
这是我的选择器response.css(".jam_content *::text").extract()
:
甚至返回子代的HTML
['\r\n\r\n\r\n',
'\r\n',
'CYBERPUNK GAME JAM 2014',
'\r\n',
'\r\n \r\n .game_grid .game_cell .game_title a {\r\n color: #029671;\r\n }\r\n \r\n .game_grid .game_cell .game_author a {\r\n color: #00aa99;\r\n }\r\n \r\n .game_grid .game_cell .game_genre {\r\n color: #c5007d;\r\n }\r\n \r\n .game_grid .game_cell .game_platform {\r\n color: #990088;\r\n }\r\n \r\n \r\n .add_game_btn {\r\n background-color: #029671;\r\n border: 4px solid #c5007d;\r\n box-shadow: 0 0 0 4px #380024;\r\n padding: 10px 15px;\r\n font-size: 18px;\r\n font-family: \'Lucida Console\';\r\n color: #00ffcc;\r\n cursor: pointer;\r\n} \r\n \r\n
.view_jam .grid_outer {\r\n border-top:0;\r\n border-bottom:0;\r\n background:#000; }\r\n \r\nbody {\r\n\tbackground-image: url(http://i.imgur.com/ReRqo6t.png);\r\n\tbackground-repeat: repeat-x;\r\n\tbackground-color: #000;\r\n}\r\nbody,td,th {\r\n\tcolor: #0FF;\r\n\tfont-family: "Lucida Console", Monaco, monospace;\r\n}\r\na:link {\r\n\tcolor: #C5007D;\r\n}\r\na:visited {\r\n\tcolor: #C5007D;\r\n}\r\na:hover {\r\n\tcolor: #C5007D;\r\n}\r\na:active {\r\n\tcolor: #C5007D;\r\n}\r\n.mag_not_link {\r\n\tcolor: #C5007D;\r\n\tfont-weight: bold;\r\n}\r\n',
'\r\n\r\n\r\n\r\n',
'\r\n ',
'\r\n ',
'CLICK HERE!',
'\r\n',
'\r\n\r\n']
我尝试了另一个response.xpath("./*[@class='jam_content']//text()")
,它什么也没返回
我该怎么做?
答案 0 :(得分:1)
更新选择器以不获取style
元素的内容:
response.css(".jam_content *:not(style)::text").extract()
然后您可以进行列表理解,使用.strip()
过滤掉空白文本项:
my_text = [text for text in response.css(".jam_content *:not(style)::text").extract() if text.strip()]
这将返回:
['CYBERPUNK GAME JAM 2014', 'CLICK HERE!']
然后您可以将其简单地连接在一起:
print('\n'.join(my_text))