我是python编程的初学者。我想从容器中提取所有的h1,h2,h3,h4,h5,h6标题,因此这样做,但是在以csv格式保存该数据时遇到了问题。
当我尝试使用熊猫进行保存时,它向我显示了此保存的csv文件。
这是我的代码:
import requests
import bs4
import pandas as pd
url = 'https://www.nidm.net/home/weather/best-air-purifiers/'
target = requests.get(url) #sends the requests to the website
status = target.status_code #checks the status of website
print(status) #prints the status code,, it should be 200
text_response = target.text #basically downloads the website into our
machine
#print(text_response) #prints the website in console
#now beautiful soup will come in handy to print the results more
effectively
soup = bs4.BeautifulSoup(text_response, 'lxml')
#print(soup.prettify()) #it will make data more understandable
h1 = ""
h2 = ""
h3 = ""
h4 = ""
h5 = ""
h6 = ""
all_div = soup.find('div', attrs={'class': 'jeg_inner_content'})
for _h1 in all_div.find_all('h1'):
h1 = _h1.text
print(h1, sep='\n')
for _h2 in all_div.find_all('h2'):
h2 = _h2.text
print(h2, sep='\n')
for _h3 in all_div.find_all('h3'):
h3 = _h3.text
print(h3, sep='\n')
for _h4 in all_div.find_all('h4'):
h4 = _h4.text
print(h4, sep='\n')
for _h5 in all_div.find_all('h5'):
h5 = _h5.text
print(h5, sep='\n')
for _h6 in all_div.find_all('h6'):
h6 = _h6.text
print(h6, sep='\n')
headings = h1 + '\n' + h2 + '\n' + h3 + '\n' + h4 + '\n' + h5 + '\n' + h6
print(headings)
df = pd.DataFrame({h1, h2, h3, h4, h5, h6})
df.to_csv('Data1.csv', index=True)
#file = open('data1.txt', 'w+')
#file.write(headings)
#file.close()
现在的问题是它没有在我想要查看的csv文件中向我显示正确的结果。相反,它只显示h1标题。.请帮助我..............
答案 0 :(得分:0)
这里的主要问题似乎是您在 find_all 循环的每次迭代中都覆盖了“ heading”变量。可以这样修复(假设您想用换行符分隔行):
for _h1 in all_div.find_all('h1'):
h1 = h1 + ('\n' + _h1.text if h1 else _h1.text) # <--
print(h1, sep='\n')
# do the same for the remaining loops
此外,在DataFrame声明中还应注意,您正在使用集合来定义数据。这意味着不能保证行的顺序。最好使用元组:
df = pd.DataFrame((h1, h2, h3, h4, h5, h6)) # changed {...} to (...)
关于缺少结果,请注意,h4,h5和h6在搜索范围内没有任何结果。