使用scrapy从动态网站复制数据

时间:2018-09-19 13:24:00

标签: javascript python parsing web-scraping scrapy

我开始为该网站编写刮板,以收集汽车数据。事实证明,数据结构可以改变,因为卖方没有填写所有字段,因为有些字段可以更改,并且在csv文件中的刮板过程中,值是不同的字段。

页面示例:

https://www.olx.ua/obyavlenie/prodam-voikswagen-touran-2011-goda-IDBzxYq.html#87fcf09cbd

https://www.olx.ua/obyavlenie/fiat-500-1-4-IDBjdOc.html#87fcf09cbd

数据示例: Data example

一种方法是使用text () = "Category name"检查字段名称,但是我不确定如何将结果正确写入正确的单元格中。

我还使用内置的Google开发人员工具,并在命令document.getElementsByClassName('margintop5')[0].innerText的帮助下 我带出了表格的全部内容,但是结果不是结构化的。

那么,如果输出可以是 json格式,那么它将解决我的问题吗?

innerText result

另外,当我研究页面代码时,遇到了一个javascript脚本,其中已经构造了所有必需的数据,但是我不知道如何获取它们。

                 <script type="text/javascript">
                var GPT = GPT || {};
                GPT.targeting = {"cat_l0":"transport","cat_l1":"legkovye-avtomobili","cat_l2":"volkswagen","cat_l0_id":"1532","cat_l1_id":"108","cat_l2_id":"1109","ad_title":"volkswagen-jetta","ad_img":"https:\/\/img01-olxua.akamaized.net\/img-olxua\/676103437_1_644x461_volkswagen-jetta-kiev.jpg","offer_seek":"offer","private_business":"private","region":"ko","subregion":"kiev","city":"kiev","model":["jetta"],"modification":[],"motor_year":[2006],"car_body":["sedan"],"color":["6"],"fuel_type":["543"],"motor_engine_size":["1751-2000"],"transmission_type":["546"],"motor_mileage":["175001-200000"],"condition":["first-owner"],"car_option":["air_con","climate-control","cruise-control","electric_windows","heated-seats","leather-interior","light-sensor","luke","on-board-computer","park_assist","power-steering","rain-sensor"],"multimedia":["acoustics","aux","cd"],"safety":["abs","airbag","central-locking","esp","immobilizer","servorul"],"other":["glass-tinting"],"cleared_customs":["no"],"price":["3001-5000"],"ad_price":"4500","currency":"USD","safedealads":"","premium_ad":"0","imported":"0","importer_code":"","ad_type_view":"normal","dfp_user_id":"e3db0bed-c3c9-98e5-2476-1492de8f5969-ver2","segment":[],"dfp_segment_test":"76","dfp_segment_test_v2":"46","dfp_segment_test_v3":"46","dfp_segment_test_v4":"32","adx":["bda2p24","bda1p24","bdl2p24","bdl1p24"],"comp":["o12"],"lister_lifecycle":"0","last_pv_imps":"2","user-ad-fq":"2","ses_pv_seq":"1","user-ad-dens":"2","listingview_test":"1","env":"production","url_action":"ad","lang":"ru","con_inf":"transportxxlegkovye-avtomobilixx46"};

data in json dict

如何使用python和scrapy从页面获取数据?

2 个答案:

答案 0 :(得分:2)

您可以执行以下操作:从<script>块中提取JS代码,使用正则表达式仅获取包含数据的JS对象,然后使用json模块加载它:

query = 'script:contains("GPT.targeting = ")::text'
js_code = response.css(query).re_first('targeting = ({.*});')
data = json.loads(js_code)

通过这种方式,data是一个包含JS对象数据的python字典。

此处有关re_first方法的更多信息:https://doc.scrapy.org/en/latest/topics/selectors.html#using-selectors-with-regular-expressions

答案 1 :(得分:0)

我会说你需要:

1)将下面的C#类转换为python类。 (我使用这篇文章创建了它:https://stackoverflow.com/a/48023576/4180382

2)使用regex(“ GPT.targeting”后面的文本)从Python进行网络调用,以提取json字符串的javascript文件

3)将json字符串转换为新创建的Python类。

    public class Rootobject
{
    public string cat_l0 { get; set; }
    public string cat_l1 { get; set; }
    public string cat_l2 { get; set; }
    public string cat_l0_id { get; set; }
    public string cat_l1_id { get; set; }
    public string cat_l2_id { get; set; }
    public string ad_title { get; set; }
    public string ad_img { get; set; }
    public string offer_seek { get; set; }
    public string private_business { get; set; }
    public string region { get; set; }
    public string subregion { get; set; }
    public string city { get; set; }
    public string[] model { get; set; }
    public object[] modification { get; set; }
    public int[] motor_year { get; set; }
    public string[] car_body { get; set; }
    public string[] color { get; set; }
    public string[] fuel_type { get; set; }
    public string[] motor_engine_size { get; set; }
    public string[] transmission_type { get; set; }
    public string[] motor_mileage { get; set; }
    public string[] condition { get; set; }
    public string[] car_option { get; set; }
    public string[] multimedia { get; set; }
    public string[] safety { get; set; }
    public string[] other { get; set; }
    public string[] cleared_customs { get; set; }
    public string[] price { get; set; }
    public string ad_price { get; set; }
    public string currency { get; set; }
    public string safedealads { get; set; }
    public string premium_ad { get; set; }
    public string imported { get; set; }
    public string importer_code { get; set; }
    public string ad_type_view { get; set; }
    public string dfp_user_id { get; set; }
    public object[] segment { get; set; }
    public string dfp_segment_test { get; set; }
    public string dfp_segment_test_v2 { get; set; }
    public string dfp_segment_test_v3 { get; set; }
    public string dfp_segment_test_v4 { get; set; }
    public string[] adx { get; set; }
    public string[] comp { get; set; }
    public string lister_lifecycle { get; set; }
    public string last_pv_imps { get; set; }
    public string useradfq { get; set; }
    public string ses_pv_seq { get; set; }
    public string useraddens { get; set; }
    public string listingview_test { get; set; }
    public string env { get; set; }
    public string url_action { get; set; }
    public string lang { get; set; }
    public string con_inf { get; set; }
}