我试图解析纯文本文档的网页,用HTML编码,所以我尝试使用BeautifulSoup来提取文本并列出一个列表,但我不能至。
Piche;Temp Comp Media;Umidade Relativa Media;Velocidade do Vento Media;
83702;01/01/2015;0000;;;;;;;73.5;3.333333;
83702;06/01/2016;1200;5;;;;;;;;
83702;07/01/2016;0000;;;;;;;76.25;2.40072;
83702;01/02/2016;1200;15.2;;;;;;;;
我对:
感兴趣soup = BeautifulSoup(a.content, 'html.parser')
soup = soup.find_all('pre')
text = []
for i in soup:
print(i)
text.append(i)
理想情况下,构建一个DataFrame并另存为CSV。
到目前为止,我尝试了类似的东西:
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using System.Data.SqlClient;
namespace CorpLogin
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
//DB Connection String
string cs = @"Data Source=(LocalDB)\MSSQLLocalDB;AttachDbFilename=|DataDirectory|\CORPORATION.mdf;" +
"Integrated Security=True";
//Login Button clicked
private void LoginButton1_Click(object sender, EventArgs e)
{
//Validates text entered
if (userNameText1.Text == "")
{
MessageBox.Show("USERNAME and PASSWORD are required fields.");
return;
}
if (passwordText1.Text == "")
{
MessageBox.Show("USERNAME and PASSWORD are required fields.");
return;
}
try
{
//Connect to SQL
SqlConnection con = new SqlConnection(cs);
con.Open();
SqlCommand cmd = new SqlCommand("select * from USERS where USERNAME=@username" +
"and PASSWORD=@password", con);
cmd.Parameters.AddWithValue("@username", userNameText1.Text);
cmd.Parameters.AddWithValue("@password", passwordText1.Text);
SqlDataAdapter da = new SqlDataAdapter(cmd);
DataSet ds = new DataSet();
da.Fill(ds);
con.Close();
int count = ds.Tables[0].Rows.Count;
//Show new form or fail message
if (count == 1)
{
this.Hide();
CorpView cv = new CorpView();
cv.Show();
}
else
{
MessageBox.Show("ACCESS DENIED");
}
}
//Catch program exceptions
catch(Exception ex)
{
MessageBox.Show(ex.Message);
}
}
//Cancel Button Clicked
private void CancelButton1_Click(object sender, EventArgs e)
{
Application.Exit();
}
}
}
但它没有成功。它使列表中的所有条目都成为一个条目。
答案 0 :(得分:2)
BS
对HTML标记非常有用,但您主要使用文字,因此请使用split('\n')
等字符串函数并切片[start_row:end_row]
您的HTML文字
content = '''<body>
<pre>
--------------------
BDMEP - INMET
--------------------
Estação : PONTA PORA - MS (OMM: 83702)
Latitude (graus) : -22.55
Longitude (graus) : -55.71
Altitude (metros): 650.00
Estação Operante
Inicio de operação: 24/11/1941
Periodo solicitado dos dados: 01/01/2015 a 17/11/2016
Os dados listados abaixo são os que encontram-se digitados no BDMEP
Hora em UTC
--------------------
Obs.: Os dados aparecem separados por ; (ponto e vírgula) no formato txt.
Para o formato planilha XLS,
<a href="instrucao.html" target="_top" rel="facebox">siga as instruções</a>
--------------------
Estacao;Data;Hora;Precipitacao;TempMaxima;TempMinima;Insolacao;Evaporacao Piche;Temp Comp Media;Umidade Relativa Media;Velocidade do Vento Media;
83702;01/01/2015;0000;;;;;;;73.5;3.333333;
83702;06/01/2016;1200;5;;;;;;;;
83702;07/01/2016;0000;;;;;;;76.25;2.40072;
83702;01/02/2016;1200;15.2;;;;;;;;
</pre>
</body>'''
和
from bs4 import BeautifulSoup
soup = BeautifulSoup(content, 'html.parser')
text = soup.find('pre').text
lines = text.split('\n')
print(lines[-6:-1])
或一行
print(content.split('\n')[-7:-2])
如果表有更多行,那么您可以搜索最后----------------
以查找表的开头
last = content.rfind(' --------------------')
lines = content[last:].split('\n')
print(lines[1:-2])
现在您可以使用split(';')
将行拆分为列来为pandas创建数据:)
或使用io.StringIO
在内存中创建类文件对象并使用pd.read_csv()
import pandas as pd
import io
last = content.rfind(' --------------------')
lines = content[last:].split('\n')[1:-2]
# create one string with table
text = '\n'.join(lines)
# create file-like object with text
fileobject = io.StringIO(text)
# use file-like object with read_csv()
df = pd.read_csv(fileobject, delimiter=';')
print(df)
或
import pandas as pd
import io
start = content.rfind(' --------------------')
start += len(' --------------------')
end = content.rfind(' </pre>')
text = content[start:end]
fileobject = io.StringIO(text)
df = pd.read_csv(fileobject, delimiter=';')
print(df)
答案 1 :(得分:0)
你需要重新做这个工作
在:
import re
re.findall(r'\w+;.+\n', string=html)
出:
['Estacao;Data;Hora;Precipitacao;TempMaxima;TempMinima;Insolacao;Evaporacao Piche;Temp Comp Media;Umidade Relativa Media;Velocidade do Vento Media;\n',
'83702;01/01/2015;0000;;;;;;;73.5;3.333333;\n',
'83702;06/01/2016;1200;5;;;;;;;;\n',
'83702;07/01/2016;0000;;;;;;;76.25;2.40072;\n',
'83702;01/02/2016;1200;15.2;;;;;;;;\n']