使用beautifulsoup获取数据框熊猫中的内容

时间:2019-06-19 05:19:24

标签: dataframe beautifulsoup

我定期向我发送一个文本文件,其中一列中包含html内容。我希望可以对此专栏做一个漂亮的汤,但似乎来源有限。

sample.csv:

id web-scraper-order html_content
0  15636             <div class="product-details detail-row"><div c...
1  15619             <div class="product-details detail-row"><div c...
2  15656             <div class="product-details detail-column"><di...

所需的输出:

id web-scraper-order html_content                        html_content2
0  15636             <div class="product-details detail-row"><div c... ['EF1744','Grey Three/Off White/Gold Metallic','$120','2019-06-22']
1  15619             <div class="product-details detail-row"><div c...['...','...','...','...']
2  15656             <div class="product-details detail-column"><di...['...','...','...','...']

html_content中的内容

<div class="product-details detail-row"><div class="detail"><div class="pinfo-container"><span class="icon"></span><span class="title">Style</span></div><span> EF1744 </span></div><div class="detail"><div class="pinfo-container"><span class="icon"></span><span class="title">Colorway</span></div><span> Grey Three/Off White/Gold Metallic </span></div><div class="detail"><div class="pinfo-container"><span class="icon"></span><span class="title">Retail Price</span></div><span> $120 </span></div><div class="detail"><div class="pinfo-container"><span class="icon"></span><span class="title">Release Date</span></div><span> 2019-06-22 </span></div></div>

所需的文本在下面每一行:

['EF1744','Grey Three/Off White/Gold Metallic','$120','2019-06-22']

我找到了类似的帖子here,但由于我需要指定一列,因此似乎与我的帖子不兼容。

我很傲慢地做到了,但是没有运气。...

import numpy as np
import pandas as pd
import re
from bs4 import BeautifulSoup 

d = pd.read_csv("sample.csv") 
df = pd.DataFrame(d,columns=['web-scraper-order','html_content'])

soup = BeautifulSoup(df['html_content'],'xml')
style = [item.text.strip() for item in soup.find_all('div', class_='detail')]

1 个答案:

答案 0 :(得分:0)

您可以使用CSS选择器create or replace procedure test01 is vsql varchar2(50); l_date DATE; begin vsql := 'select sysdate from dual'; execute immediate vsql into l_date; --<<<< here DBMS_APPLICATION_INFO.SET_MODULE('TEST','Starting...'); dbms_lock.sleep ( 20 ); DBMS_APPLICATION_INFO.SET_MODULE(NULL, NULL); end; / ,即直接在类INTO clause的任何标签下选择所有INTO clause标签:

.detail > span

打印:

<span>