解析所有文本的网页

时间:2016-11-17 22:00:46

标签: python text beautifulsoup

我试图解析纯文本文档的网页,用HTML编码,所以我尝试使用BeautifulSoup来提取文本并列出一个列表,但我不能至。

    Piche;Temp Comp Media;Umidade Relativa Media;Velocidade do Vento Media;
    83702;01/01/2015;0000;;;;;;;73.5;3.333333;
    83702;06/01/2016;1200;5;;;;;;;;
    83702;07/01/2016;0000;;;;;;;76.25;2.40072;
    83702;01/02/2016;1200;15.2;;;;;;;;

我对:

感兴趣
soup = BeautifulSoup(a.content, 'html.parser')
soup = soup.find_all('pre')
text = []
for i in soup:
    print(i)
    text.append(i)

理想情况下,构建一个DataFrame并另存为CSV。

到目前为止,我尝试了类似的东西:

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using System.Data.SqlClient;

namespace CorpLogin
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }
        //DB Connection String
        string cs = @"Data Source=(LocalDB)\MSSQLLocalDB;AttachDbFilename=|DataDirectory|\CORPORATION.mdf;" +
        "Integrated Security=True";

    //Login Button clicked
    private void LoginButton1_Click(object sender, EventArgs e)
    {
        //Validates text entered
        if (userNameText1.Text == "")
        {
            MessageBox.Show("USERNAME and PASSWORD are required fields.");
            return;
        }
        if (passwordText1.Text == "")
        {
            MessageBox.Show("USERNAME and PASSWORD are required fields.");
            return;
        }

        try
        {
            //Connect to SQL
            SqlConnection con = new SqlConnection(cs);
            con.Open();
            SqlCommand cmd = new SqlCommand("select * from USERS where USERNAME=@username" +
                "and PASSWORD=@password", con);
            cmd.Parameters.AddWithValue("@username", userNameText1.Text);
            cmd.Parameters.AddWithValue("@password", passwordText1.Text);
            SqlDataAdapter da = new SqlDataAdapter(cmd);
            DataSet ds = new DataSet();
            da.Fill(ds);
            con.Close();
            int count = ds.Tables[0].Rows.Count;

            //Show new form or fail message
            if (count == 1)
            {
                this.Hide();
                CorpView cv = new CorpView();
                cv.Show();
                }
                else
                {
                    MessageBox.Show("ACCESS DENIED");
                }
            }
            //Catch program exceptions
            catch(Exception ex)
            {
            MessageBox.Show(ex.Message);
            }
        }
        //Cancel Button Clicked
        private void CancelButton1_Click(object sender, EventArgs e)
        {
        Application.Exit();
        }
    }
}

但它没有成功。它使列表中的所有条目都成为一个条目。

2 个答案:

答案 0 :(得分:2)

BS对HTML标记非常有用,但您主要使用文字,因此请使用split('\n')等字符串函数并切片[start_row:end_row]

您的HTML文字

content = '''<body>
    <pre>
    --------------------
    BDMEP - INMET
    --------------------
    Estação           : PONTA PORA - MS (OMM: 83702)
    Latitude  (graus) : -22.55
    Longitude (graus) : -55.71
    Altitude  (metros): 650.00
    Estação Operante
    Inicio de operação: 24/11/1941
    Periodo solicitado dos dados: 01/01/2015 a 17/11/2016
    Os dados listados abaixo são os que encontram-se digitados no BDMEP
    Hora em UTC
    --------------------
    Obs.: Os dados aparecem separados por ; (ponto e vírgula) no formato txt.
     Para o formato planilha XLS, 
    <a href="instrucao.html" target="_top" rel="facebox">siga as instruções</a>
    --------------------
Estacao;Data;Hora;Precipitacao;TempMaxima;TempMinima;Insolacao;Evaporacao Piche;Temp Comp Media;Umidade Relativa Media;Velocidade do Vento Media;
83702;01/01/2015;0000;;;;;;;73.5;3.333333;
83702;06/01/2016;1200;5;;;;;;;;
83702;07/01/2016;0000;;;;;;;76.25;2.40072;
83702;01/02/2016;1200;15.2;;;;;;;;
    </pre>    
</body>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(content, 'html.parser')
text = soup.find('pre').text
lines = text.split('\n')
print(lines[-6:-1])

或一行

print(content.split('\n')[-7:-2])

如果表有更多行,那么您可以搜索最后----------------以查找表的开头

last = content.rfind('    --------------------')
lines = content[last:].split('\n')
print(lines[1:-2])

现在您可以使用split(';')将行拆分为列来为pandas创建数据:)

或使用io.StringIO在内存中创建类文件对象并使用pd.read_csv()

import pandas as pd
import io

last = content.rfind('    --------------------')

lines = content[last:].split('\n')[1:-2]

# create one string with table
text = '\n'.join(lines)

# create file-like object with text
fileobject = io.StringIO(text)

# use file-like object with read_csv()
df = pd.read_csv(fileobject, delimiter=';')

print(df)

import pandas as pd
import io

start = content.rfind('    --------------------')
start += len('    --------------------')
end   = content.rfind('    </pre>')

text = content[start:end]

fileobject = io.StringIO(text)

df = pd.read_csv(fileobject, delimiter=';')

print(df)

答案 1 :(得分:0)

你需要重新做这个工作

在:

import re

re.findall(r'\w+;.+\n', string=html)

出:

['Estacao;Data;Hora;Precipitacao;TempMaxima;TempMinima;Insolacao;Evaporacao Piche;Temp Comp Media;Umidade Relativa Media;Velocidade do Vento Media;\n',
 '83702;01/01/2015;0000;;;;;;;73.5;3.333333;\n',
 '83702;06/01/2016;1200;5;;;;;;;;\n',
 '83702;07/01/2016;0000;;;;;;;76.25;2.40072;\n',
 '83702;01/02/2016;1200;15.2;;;;;;;;\n']