如何使用scrapy来废弃和解析嵌套的div

时间:2016-11-18 23:12:08

标签: html parsing scrapy web-crawler scrapy-spider

尝试关注此github页面,以便在facebook中学习抓取嵌套的div。 https://github.com/talhashraf/major-scrapy-spiders/blob/master/mss/spiders/facebook_profile.py

enter image description here 文件中的$('div.areaTitle').html(prod_name); parse_info_text_only可以很好地获取范围信息

我有一个类似的页面,我试图从嵌套div获取parse_info_has_image,但result_id在div本身。

enter image description here

根据我的理解div,我试图废弃在第二行,所以我尝试类似

result_id

如何从嵌套div中获取data-xt?

2 个答案:

答案 0 :(得分:1)

用css:

    package com.example.myapp;

    import ...

public class ActivityAlarmReceiver extends BroadcastReceiver {

    @Override
    public void onReceive(Context context, Intent intent) {
        try {

            DatabaseHelper db = new DatabaseHelper(context);
            String nextDate = db.getNextDate();

            if (nextDate == null) {
                return;
            }

            SimpleDateFormat sdf = new SimpleDateFormat("dd/MM/yyyy");
            SimpleDateFormat stf = new SimpleDateFormat("HH:MM:SS");

            String nextTime = "08:00:00";

            Date dateFormat = sdf.parse(nextDate);
            Date timeFormat = stf.parse(nextTime);

            Date today = new Date();

            if (dateFormat.equals(today)) {
                Intent intent1 = new Intent(context, MainActivity.class);
                createNotification(context, intent1, "New Message", "body!", "This is alarm");
            }
        } catch (Exception e) {
            Log.i("date", "error == " + e.getMessage());
        }
    }

    private void createNotification(Context context, Intent intent1, String ticker, String title, String description) {

        NotificationManager notificationManager = (NotificationManager) context.getSystemService(Context.NOTIFICATION_SERVICE);
        PendingIntent pendingIntent = PendingIntent.getActivity(context, 0, intent1, 0);

        NotificationCompat.Builder builder = new NotificationCompat.Builder(context);

        builder.setTicker(ticker);
        builder.setContentTitle(title);
        builder.setContentText(description);
        builder.setSmallIcon(R.drawable.my_time_logo_transparent);
        builder.setContentIntent(pendingIntent);

        Notification n = builder.build();

        // create the notification
        n.vibrate = new long[]{150, 300, 150, 400};
        n.flags = Notification.FLAG_AUTO_CANCEL;
        notificationManager.notify(R.drawable.my_time_logo_transparent, n);

        // create a vibration
        try {
            Uri som = RingtoneManager.getDefaultUri(RingtoneManager.TYPE_NOTIFICATION);
            Ringtone toque = RingtoneManager.getRingtone(context, som);
            toque.play();
        } catch (Exception e) {

        }
    }
}

答案 1 :(得分:0)

我认为,如果你想要所有数据 - xt那么

def parse_info_has_id(self, css_path):
       text = css_path.xpath('//div[@data-xt != ""]').extract()
       text = [t.strip() for t in text]
       text = [t for t in text if re.search('result_id', t)]
       return "\n".join(text)