试图仅从页面中提取第一篇文章

时间:2017-02-05 21:51:01

标签: python parsing beautifulsoup extract

我想在论坛页面中仅提取原始帖子的用户名。我似乎无法弄清楚如何做到这一点。有人可以帮助我吗?我的代码从原始帖子和回复中提取用户名。

import bs4 as bs
import urllib.request
import pandas as pd
import urllib.parse
import re


#source = urllib.request.urlopen('https://messageboards.webmd.com/').read()
source = urllib.request.urlopen('https://messageboards.webmd.com').read()
soup = bs.BeautifulSoup(source,'lxml')


df = pd.DataFrame(columns = ['link'],data=[url.a.get('href') for url in soup.find_all('div',class_="link")])
lists =[]
page_links = []
for i in range(0,1):
    link = (df.link.iloc[i])
    req = urllib.request.Request(link)
    resp = urllib.request.urlopen(req)
    respData = resp.read()
    temp1=re.findall(r'Filter by</span>(.*?)data-pagedcontenturl',str(respData))
    temp1=re.findall(r'data-totalitems=(.*?)data-pagekey',str(temp1))[0]
    pageunm=round(int(re.sub("[^0-9]","",temp1))/10)
    lists.append(pageunm)


    for number in range(1, pageunm+1):
        url_pages = link + '?pi157388622=' + str(number)
        page_links.append(url_pages)

lists2=[]
df1= pd.DataFrame (columns=['page'],data=page_links)
for j in range (0,9):
    page = (df1.page.iloc[j])
    url = urllib.request.urlopen(page).read()
    soup1 = bs.BeautifulSoup(url,'lxml')
    for body_links in soup1.find_all('div',class_="thread-detail"):
        body= body_links.a.get('href')
        lists2.append(body)

usernames=[]

df2= pd.DataFrame(columns =['post'], data= lists2)
for y in range(0,26):
    post=( df2.post.iloc[y])
    url_post = urllib.request.urlopen(post).read()
    soup2= bs.BeautifulSoup(url_post,'lxml')
    for username in soup2.find_all('div', class_="user-name"):
        usernames.append([username.get_text().strip()])

1 个答案:

答案 0 :(得分:0)

在最后几行中,不是迭代所有用户名,而是只获得第一个这样的

username = soup2.find_all('div', class_="user-name")[0].get_text().strip()