Scrapy CrawlSpider:拒绝子域名不起作用,为什么?

时间:2018-06-11 10:23:34

标签: python-3.x scrapy

我正在学习scrapy并尝试抓取const drawerWidth = 225; const styles = theme => ({ drawerPaper: { width: drawerWidth, position: "relative", [theme.breakpoints.up("sml")]: { width: theme.spacing.unit * 9 } } }); const EmailWrapper = styled.div` display: flex; width: 100%; `; const DrawerImageList = styled(ListItem)` && { display: flex; flex-wrap: wrap; padding: 16px 0px 4px 5px; background: #0094d9; margin-top: -12px; } `; const Text = styled.p` color: ${props => props.color && props.color}; font-size: ${props => props.size && props.size}; padding: ${props => props.padding && props.padding}; color: ${props => props.color && props.color}; `; const CustomAvatar = styled(Avatar)` && { width: ${props => props.width && props.width}; height: ${props => props.height && props.height}; } `; const DrawerItems = ({ items }) => items.map(item => ( <ListItem key={item.name} button> <Divider /> <ListItemIcon>{React.createElement(item.icon, {})}</ListItemIcon> <ListItemText primary={item.name} /> </ListItem> )); const Sidebar = ({ classes, items, location }: { classes: Object, items: Array, location: Object }) => { console.log("sidebar", items); return ( <Drawer variant="permanent" open classes={{ paper: classes.drawerPaper }} > <List> <DrawerImageList> <CustomAvatar alt="avatar" src="https://s3.amazonaws.com/list-engine-bucket/pictures/flat-avatar.png" width="60px" height="60px" /> <EmailWrapper> <Text size="15px" color="#fff"> admin@example.com </Text> </EmailWrapper> </DrawerImageList> {DrawerItems({ items })} </List> </Drawer> ); }; export default withStyles(styles)(Sidebar); 。 我对以下蜘蛛进行了编程,但它仍然访问了www.google.com/.*这样的子域。我错过了什么?

support.google.com

注意:调试输出太多,所以我在import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class GoogleSpider(CrawlSpider): name = 'google' allowed_domains = ['www.google.com'] start_urls = ['http://www.google.com'] rules = [ Rule(LinkExtractor( allow=[r"^http[s]?://www.google.com/.*"]), callback='parse_item', follow = True) ] def parse_item(self, response): print('Processing {}'.format(response.url)) 添加了行LOG_LEVEL = 'ERROR',我正在使用settings.py来查看哪个网页是访问。

此脚本打印子域名网址,例如print,为什么?

1 个答案:

答案 0 :(得分:0)

尝试

allowed_domains = ['google.com']

仅代替allowed_domains = ['www.google.com']