我想将HTML表格数据存储到CSV文件中。
我使用python,selenium,BeautifulSoup,pandas,tabulate,numpy编写了以下代码。
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
from tabulate import tabulate
import numpy as np
#---Some code are here
datalist2 = []
for i in range(1, total+1):
xpath="/html/body/div[3]/table/tbody/tr/td[2]/div[2]/table/tbody/tr["+str(i)+"]/td[1]/a/img"
driver.find_element_by_xpath(xpath).click()
print("Open button " + str(i) + " Clicked")
soup_level2=BeautifulSoup(driver.page_source, 'lxml')
table2=soup_level2.find_all('table')[0]
df2=pd.read_html(str(table2),header=0)
datalist2.append(df2[0])
driver.execute_script("window.history.go(-1)")
print("moving_back_to_previous_page")
for i in range(len(datalist2)):
print(tabulate(datalist2[i]))
#text_file=open("output.csv","w")
#text_file.write(str(datalist2))
#text_file.close()
#print("report generated and saved")
#np.savetxt("output.csv", datalist2, delimiter=",", fmt='%s')
此代码print(tabulate(datalist2[i]))
在控制台中显示表数据。
print(tabulate(datalist2[i]))
的示例输出
0 Date Crashed nan 2018-10-09 07:56:49 UTC
1 Date Reported nan 2018-10-09 07:56:57 UTC
2 Date Built nan 2018-06-06 01:26:35 UTC
3 Crash Reason nan SIGSEGV
4 Crash Addr nan 0x0
5 Dump file name nan 9556393da77a562fa086b0147a37106c6ff4bb76_mac14B7F66_dat2018-10-09-07-56-49_boxXB6_modC40COM_54dc2dd1-9abe-a568-1e3119e4-1908ccb0.dmp.tgz
此代码text_file.write(str(datalist2))
将datalist2存储到CSV文件中。这段代码有问题。它不显示长文本。例如,索引5不能完整显示转储文件名。
text_file.write(str(datalist2))
0 Date Crashed NaN 2018-10-09 07:56:49 UTC
1 Date Reported NaN 2018-10-09 07:56:57 UTC
2 Date Built NaN 2018-06-06 01:26:35 UTC
3 Crash Reason NaN SIGSEGV
4 Crash Addr NaN 0x0
5 Dump file name NaN 9556393da77a562fa086b0147a37106c6ff4bb76_mac14...
我也想删除索引列,第二列包含'nan'作为值。 我想将此数据存储到CSV文件中。 我该怎么办?
答案 0 :(得分:-1)
这是因为默认的熊猫列宽度为50。您可以简单地将其设置为所需的最大长度,或者将其设置为-1以禁用列的最大宽度。
在编写之前添加以下行:
pd.set_option('display.max_colwidth', -1)
有关详细信息,请参见链接: https://pandas.pydata.org/pandas-docs/stable/options.html
答案 1 :(得分:-1)
要删除索引列和第二列,其中第二列包含显示为“ NaN”的空值
soup_level2=BeautifulSoup(driver.page_source, 'lxml')
table2=soup_level2.find_all('table')[0]
table_body = table2.find_all('tbody')[0]
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
datalist2.append([ele for ele in cols if ele]) # Get rid of empty values
以下代码将表格输出数据导出到csv文件中
content2=tabulate(datalist2, tablefmt="tsv")
text_file=open("output.csv","w")
text_file.write(content2)
text_file.close()
现在,它还会显示长文本。
以下代码使用'numpy'将datalist2导出到CSV
np.savetxt("output_np.csv", datalist2, delimiter=",", fmt='%s'
以下代码使用“ pandas”将datalist2导出为CSV
my_df=pd.DataFrame(datalist2)
my_df.to_csv('output.csv', index=False, header=False)