Question

我只是在抓取数据，想输入两列标题和日期，但发生TypeError

TypeError：from_dict（）得到了意外的关键字参数“ columns”

代码：

import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://timesofindia.indiatimes.com/topic/Hiv'

    while True:
        response=requests.get(url)
        soup = BeautifulSoup(response.content,'html.parser')
        content = soup.find_all('div',{'class': 'content'})


    for contents in content:
        title_tag = contents.find('span',{'class':'title'})
        title= title_tag.text[1:-1] if title_tag else 'N/A'
        date_tag = contents.find('span',{'class':'meta'})
        date = date_tag.text if date_tag else 'N/A'

        hiv={title : date}
        print(' title : ', title ,' \n date : ' ,date )



    url_tag = soup.find('div',{'class':'pagination'})
    if url_tag.get('href'):
        url = 'https://timesofindia.indiatimes.com/' + url_tag.get('href')
        print(url)    
    else:
        break
hiv1 = pd.DataFrame.from_dict(hiv , orient = 'index' , columns = ['title' ,'date'])

pandas已更新至0.23.4版本，然后还会发生错误。

Answer 1

我注意到的第一件事是字典的结构已关闭。我假设您想要整个title：date的字典。您现在拥有的方式只会保留最后一个。

然后，执行此操作时，将带有的数据框的索引作为键，并且值是系列/列。因此，从技术上讲，只有1列。我可以通过重置索引来创建两列，然后将该索引放入我重命名'title'

的列中

import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://timesofindia.indiatimes.com/topic/Hiv'


response=requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
content = soup.find_all('div',{'class': 'content'})

hiv = {}
for contents in content:
    title_tag = contents.find('span',{'class':'title'})
    title= title_tag.text[1:-1] if title_tag else 'N/A'
    date_tag = contents.find('span',{'class':'meta'})
    date = date_tag.text if date_tag else 'N/A'

    hiv.update({title : date})
    print(' title : ', title ,' \n date : ' ,date )

hiv1 = pd.DataFrame.from_dict(hiv , orient = 'index' , columns = ['date'])  
hiv1 = hiv1.rename_axis('title').reset_index()

输出：

print (hiv1)
                                                title                  date
0   I told my boyfriend I was HIV positive and thi...           01 Dec 2018
1   Pay attention to these 7 very common HIV sympt...           30 Nov 2018
2   Transfusion of HIV blood: Panel seeks time til...  2019-01-06T03:54:33Z
3   No. of pregnant women testing HIV+ dips; still...           01 Dec 2018
4                             Busted:5 HIV AIDS myths           30 Nov 2018
5                    Myths and taboos related to AIDS           01 Dec 2018
6                                                 N/A                   N/A
7   Mumbai: Free HIV tests at six railway stations...           23 Nov 2018
8   HIV blood tranfusion: Tamil Nadu govt assures ...  2019-01-05T09:05:27Z
9     Autopsy performed on HIV+ve donor’s body at GRH  2019-01-03T07:45:03Z
10  Madras HC directs to videograph HIV+ve donor’s...  2019-01-01T01:23:34Z
11  HIV +ve Tamil Nadu teen who attempted suicide ...  2018-12-31T03:37:56Z
12    Another woman claims she got HIV-infected blood  2018-12-31T06:34:32Z
13    Another woman says she got HIV from donor blood           29 Dec 2018
14  HIV case: Five-member panel begins inquiry in ...           29 Dec 2018
15  Pregnant woman turns HIV positive after blood ...           26 Dec 2018
16  Pregnant woman contracts HIV after blood trans...           26 Dec 2018
17  Man attacks niece born with HIV for sleeping i...           16 Dec 2018
18  Health ministry implements HIV AIDS Act 2017: ...           11 Sep 2018
19  When meds don’t heal: HIV+ kids fight daily wa...           03 Sep 2018

我不太确定为什么会出现错误。由于您使用的是更新的熊猫，因此没有任何意义。也许卸载Pandas，然后重新点安装它？

否则，我想您可以只用两行就可以完成，并在转换为数据框后为列命名：

hiv1 = pd.DataFrame.from_dict(hiv, orient = 'index').reset_index()
hiv1.columns = ['title','date']

TypeError：在执行网络抓取时

1 个答案: