使用Python / Beautiful soup / pandas

时间:2017-02-08 03:26:03

标签: python pandas web-scraping beautifulsoup

我是Python的新手,我正在使用漂亮的汤来为项目进行网页抓取。

我希望只能在列表/字典中获取部分文本。我从以下代码开始:

url = "http://eng.mizon.co.kr/productlist.asp" 
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
tables = soup.find_all('table')

这有助于我将数据解析为表格,表格中的一个项目如下所示:

<table border="0" cellpadding="0" cellspacing="0" width="235">
<tr>
<td align="center" height="238"><a href="javascript:fnMoveDetail(7499)" onfocus="this.blur()"><img alt="LL IN ONE SNAIL REPAIR CREAM, SNAIL REPAIR BLEMISH BALM, WATERMAX MOISTURE B.B CREAM, WATERMAX AQUA GEL CREAM, CORRECT COMBO CREAM, GOLD STARFISH ALL IN ONE CREAM, S-VENOM WRINKLE TOX CREAM, BLACK SNAIL ALL IN ONE CREAM, APPLE SMOOTHIE PEELING GEL, REAL SOYBEAN DEEP CLEANSING OIL, COLLAGEN POWER LIFTING CREAM, SNAIL RECOVERY GEL CREAM" border="0" src="http://www.mizon.co.kr/images/upload/product/20150428113514_3.jpg" width="240"/></a></td>
</tr>
<tr>
<td align="center" height="43" valign="middle"><a href="javascript:fnMoveDetail(7499)" onfocus="this.blur()"><span class="style3">ENJOY VITAL-UP TIME Lift Up Mask <br/>
                         Volume:25ml</span></a></td>
</tr>
</table>

对于表格中的每个此类项目,我只想从上表中的最后一个数据单元格中提取以下内容:

1)href = javascript中的四位数:fnMoveDetail(7499)

2)类下的名称:style3

3)课堂下的音量:style3

我的代码中的下一行如下:

df = pd.read_html(str(tables), skiprows={0}, flavor="bs4")[0]
a_links = soup.find_all('a', attrs={'class':'style3'})
stnid_dict = {}
for a_link in a_links:
    cid = ((a_link['href'].split("javascript:fnMoveDetail("))[1].split(")")[0])
    stnid_dict[a_link.text] = cid

我的目标是使用这些数字转到单个链接,然后将此页面上抓取的信息与每个链接相匹配。 什么是最好的方法来解决这个问题?

1 个答案:

答案 0 :(得分:1)

使用包含javascript href作为锚点的a标记,找到所有span,然后获取parent标记。

url = "http://eng.mizon.co.kr/productlist.asp" 
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
spans = soup.select('td > a[href*="javascript:fnMoveDetail"] > span')
for span in spans:
    href = span.find_parent('a').get('href').strip('javascript:fnMoveDetail()')
    name, volume = span.get_text(strip=True).split('Volume:')
    print(name, volume, href)

出:

Dust Clean up Peeling Toner 150ml 8235
Collagen Power Lifting EX Toner 150ml 8067
Collagen Power Lifting EX Emulsion 150ml 8068
Barrier Oil Toner 150ml 8059
Barrier Oil Emulsion 150ml 8060
BLACK CLEAN UP PORE WATER FINISHER 150ml 7650
Vita Lemon Sparkling Toner 150ml 7356
INTENSIVE SKIN BARRIER TONER 150ml 7110
INTENSIVE SKIN BARRIER EMULSION 150ml 7111