How to assign an ID for each start_url in scrapy from dataframe

时间:2019-03-06 11:38:34

标签: python pandas web-scraping scrapy id

Lets say I have a dataframe as such:

 id     url
 1      www.google.com
 2      www.youtube.com
 3      www.google.com
 4      wwww.facebook.com

If I want to iterate each url in the dataframe. So what I'll do is:

start_urls = list(df['url'])

def parse(self,response):
    thing = response.css("*").extract()
    item = scrapyItem()
    item['content'] = thing
    yield item

and that'll iterate over my urls, and yield an item for each of them. The problem is that with the yielded file I have no way of telling different id apart.

The urls arent unique and I can't assign the URL as an "id", I need the "id" column from my dataframe combined with the URL to generate a unique id. How can i access the id column when iterating over my urls? Or alternatively what other approaches could I take to be able to achieve what I want?

EDIT: I have tried to save url as an "id" but that doesn't work due to the urls not being unique, scrapy also works asynchronously so the order of the items will not remain constant.

2 个答案:

答案 0 :(得分:3)

You could try iterrows:

for index, row in df.iterrows():
    print(index, row['url'])
    parsed_response = parse(response)
    df.loc[index, 'scrapy_content'] = parsed_response

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html

答案 1 :(得分:0)

尽管URL重复,尽管记录重复,我仍然可以使用“ response.url”作为ID。重复的记录无论如何都会返回相同的响应,因此我可以返回到数据框,并将相同的信息附加到我拥有该ID的每个位置。