在python中抓取数据后,我在保存的文件中得到了错误的数据

时间:2019-06-20 09:07:09

标签: python

我是python编程的初学者。我想从容器中提取所有的h1,h2,h3,h4,h5,h6标题,因此这样做,但是在以csv格式保存该数据时遇到了问题。

当我尝试使用熊猫进行保存时,它向我显示了此保存的csv文件。

这是我的代码:

import requests
import bs4
import pandas as pd

url = 'https://www.nidm.net/home/weather/best-air-purifiers/'
target = requests.get(url) #sends the requests to the website
status = target.status_code #checks the status of website
print(status) #prints the status code,, it should be 200

text_response = target.text #basically downloads the website into our 
machine
#print(text_response) #prints the website in console

#now beautiful soup will come in handy to print the results more 
effectively
soup = bs4.BeautifulSoup(text_response, 'lxml')
#print(soup.prettify()) #it will make data more understandable

h1 = ""
h2 = ""
h3 = ""
h4 = ""
h5 = ""
h6 = ""

all_div = soup.find('div', attrs={'class': 'jeg_inner_content'})

for _h1 in all_div.find_all('h1'):
   h1 = _h1.text
   print(h1, sep='\n')

for _h2 in all_div.find_all('h2'):
   h2 = _h2.text
   print(h2, sep='\n')

for _h3 in all_div.find_all('h3'):
   h3 = _h3.text
   print(h3, sep='\n')

for _h4 in all_div.find_all('h4'):
   h4 = _h4.text
   print(h4, sep='\n')

for _h5 in all_div.find_all('h5'):
   h5 = _h5.text
   print(h5, sep='\n')

for _h6 in all_div.find_all('h6'):
   h6 = _h6.text
   print(h6, sep='\n')

headings = h1 + '\n' + h2 + '\n' + h3 + '\n' + h4 + '\n' + h5 + '\n' + h6
print(headings)

df = pd.DataFrame({h1, h2, h3, h4, h5, h6})
df.to_csv('Data1.csv', index=True)

#file = open('data1.txt', 'w+')
#file.write(headings)
#file.close()

现在的问题是它没有在我想要查看的csv文件中向我显示正确的结果。相反,它只显示h1标题。.请帮助我..............

1 个答案:

答案 0 :(得分:0)

这里的主要问题似乎是您在 find_all 循环的每次迭代中都覆盖了“ heading”变量。可以这样修复(假设您想用换行符分隔行):

for _h1 in all_div.find_all('h1'):
   h1 = h1 + ('\n' + _h1.text if h1 else _h1.text)   # <-- 
   print(h1, sep='\n')

# do the same for the remaining loops

此外,在DataFrame声明中还应注意,您正在使用集合来定义数据。这意味着不能保证行的顺序。最好使用元组:

df = pd.DataFrame((h1, h2, h3, h4, h5, h6))  # changed {...} to (...)

关于缺少结果,请注意,h4,h5和h6在搜索范围内没有任何结果。