我正在学习scrapy并尝试抓取const drawerWidth = 225;
const styles = theme => ({
drawerPaper: {
width: drawerWidth,
position: "relative",
[theme.breakpoints.up("sml")]: {
width: theme.spacing.unit * 9
}
}
});
const EmailWrapper = styled.div`
display: flex;
width: 100%;
`;
const DrawerImageList = styled(ListItem)`
&& {
display: flex;
flex-wrap: wrap;
padding: 16px 0px 4px 5px;
background: #0094d9;
margin-top: -12px;
}
`;
const Text = styled.p`
color: ${props => props.color && props.color};
font-size: ${props => props.size && props.size};
padding: ${props => props.padding && props.padding};
color: ${props => props.color && props.color};
`;
const CustomAvatar = styled(Avatar)`
&& {
width: ${props => props.width && props.width};
height: ${props => props.height && props.height};
}
`;
const DrawerItems = ({ items }) =>
items.map(item => (
<ListItem key={item.name} button>
<Divider />
<ListItemIcon>{React.createElement(item.icon, {})}</ListItemIcon>
<ListItemText primary={item.name} />
</ListItem>
));
const Sidebar = ({
classes,
items,
location
}: {
classes: Object,
items: Array,
location: Object
}) => {
console.log("sidebar", items);
return (
<Drawer
variant="permanent"
open
classes={{
paper: classes.drawerPaper
}}
>
<List>
<DrawerImageList>
<CustomAvatar
alt="avatar"
src="https://s3.amazonaws.com/list-engine-bucket/pictures/flat-avatar.png"
width="60px"
height="60px"
/>
<EmailWrapper>
<Text size="15px" color="#fff">
admin@example.com
</Text>
</EmailWrapper>
</DrawerImageList>
{DrawerItems({ items })}
</List>
</Drawer>
);
};
export default withStyles(styles)(Sidebar);
。
我对以下蜘蛛进行了编程,但它仍然访问了www.google.com/.*
这样的子域。我错过了什么?
support.google.com
注意:调试输出太多,所以我在import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class GoogleSpider(CrawlSpider):
name = 'google'
allowed_domains = ['www.google.com']
start_urls = ['http://www.google.com']
rules = [
Rule(LinkExtractor(
allow=[r"^http[s]?://www.google.com/.*"]),
callback='parse_item',
follow = True)
]
def parse_item(self, response):
print('Processing {}'.format(response.url))
添加了行LOG_LEVEL = 'ERROR'
,我正在使用settings.py
来查看哪个网页是访问。
此脚本打印子域名网址,例如print
,为什么?
答案 0 :(得分:0)
尝试
allowed_domains = ['google.com']
仅代替allowed_domains = ['www.google.com']