如何从数据集中在熊猫中创建标题和列?

时间:2018-07-03 20:56:25

标签: python pandas dataframe

我知道如何对标题名称进行硬编码,但是我需要从数组中生成它们。这可能吗?

我的数据是动态抓取的,因此我无法对标题或列进行硬编码

results_headings包含诸如动物,矿物质,蔬菜之类的字符串

results_columns包含诸如Bear,Quartz和Brocolli之类的字符串

我的代码

#Imports
from bs4 import BeautifulSoup
import requests
import pandas as pd 

#Specify URL & Assign to page object
url = 'http://www.example.com'
page = requests.get(url)

#Grab our page as text
page.text   
soup = BeautifulSoup(page.text, 'html.parser')   #Use the HTML Parser

#Find our information
boxinfo = soup.find("div", {"id": "box1"})
headings = boxinfo.find_all("td", {"class": "label"})
columns = boxinfo.find_all("td")

#Get the headings
results_headings = []
for result in headings:
    result_NoHTML = result.getText()
    results.append(result_NoHTML)

#Get the columns
results_columns = []
for result2 in columns:
    result2_NoHTML = result2.getText()
    results_columns.append(result2_NoHTML)

df = pd.DataFrame(results_headings, results_columns)   
df.to_csv('index.csv', index=False, encoding='utf-8')

我要从中抓取的表结构

<div class="box1">

<table class="table1">

<tr><td class="label">Item1</td><td>Value1</td></tr>

<tr><td class="label">Item2</td><td>Value2</td></tr>

<tr><td class="label">Item3</td><td>Value3</td></tr>

<tr><td class="label">Item4</td><td>Value4</td></tr>

</table>

</div>

3 个答案:

答案 0 :(得分:2)

因此,您已抓取数据并最终得到如下数据框。请注意,列仍未命名,但列名显示在第一行中,与数据没有任何分隔:

let tasks = URLSession.shared.dataTask(with: URL(string: "https://talaikis.com/api/quotes/random/")!) { (data, response, error) in
        if error != nil {
            print("error")
        } else {
            if let content = data {
                do {
                    let Json = try JSONSerialization.jsonObject(with: content, options: JSONSerialization.ReadingOptions.mutableContainers) as AnyObject
                    if let data = Json as? [AnyHashable:Any] {

                        if let quote = data["quote"], let cat = data["cat"], let author = data["author"] as? String {

                            print(cat)

                            DispatchQueue.main.async {
                                    self.myLabel.text = "\(quote)"
                                    self.authorLabel.text = "\(author)"
                                }

                        }
                    }
                } catch {

                }
            }
        }
    }
    tasks.resume()

您可以从第二行开始构建一个新的数据框,并将第一行分配为列:

df = pd.DataFrame([['Animal', 'Mineral', 'Vegetable'],
                   ['Bear', 'Quartz', 'Brocolli'],
                   ['Turtle', 'Amethyst', 'Asparagus']])

print(df)

        0         1          2
0  Animal   Mineral  Vegetable
1    Bear    Quartz   Brocolli
2  Turtle  Amethyst  Asparagus

答案 1 :(得分:0)

您可以根据results_headingsresults_columns生成的字典创建数据框

import pandas as pd
results_headings = ['col 1', 'col 2']
results_columns = [('a','bb'), ('ccc','dddd')]
data_dict = {h: c for h, c in zip(results_headings, results_columns)}
df = pd.DataFrame(data_dict)   
df.to_csv('index.csv', index=False, encoding='utf-8')

答案 2 :(得分:0)

您还可以只对熊猫使用read_html函数并传递表ID。我已经完成了bs4的合并,只是隔离了整个表本身,然后将该html发送到函数中。

文档对其进行了很好的描述: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html