BeautifulSoup从锚标记中的脚本获取文本

时间:2018-12-26 08:13:49

标签: beautifulsoup

所以我有一个<tr>标签,其中有多个<td>作为它的子字符串。

<tr>
    <td align='center' class="row2">
        <a href="javascript:who_posted(4713426);">10</a>    
    </td>
    <td align="center" class="row2">
        <a href='https://forum.net/index.php?;showuser=17311'>xxboxx</a>
    </td>
    <td align="center" class="row2"> 
            <!--script type="text/javascript">
            s = "236".replace(/,/g,'');
            document.write(abbrNum(s,1));
            </script-->
            236
    </td>
</tr>

这是我当前的代码;我没有问题,但是我想尝试摆脱一个脚本,但是我尝试了其他类似的关于stackoverflow的问题所提供的方法。但我没有成功。

def extractDataFromRow2(_url, 'td', 'row2', 'align' , 'center'):
    try:
        for container in _url.find_all('td', {'class': 'row2','align': 'center''}):
            # get data from topic title in table cell
            replies_numb = container.select_one(
                'a[href^="javascript:]"').text
            print('there are ' + replies_numb + ' replies')
            topic_starter = container.next_sibling.text
            print('the owner of this topic is ' + topic_starter)
            for total_view in container.find('a', href=True, style=True):
                #total_view = container.select_one(style="background-color:").text
                #total_view = container.find(("td")["style"])
                #total_view = container.next_sibling.find_next_sibling/next_sibling
                #but they're not able to access the last one within <tr> tag
                print(total_view )
            if replies_numb and topic_starter is not None:
                dict_replies = {'Replies' : replies_numb}
                dict_topic_S = {'Topic_Starter' : topic_starter}
                list_1.append(dict_replies)
                list_2.append(dict_topic_S)
            else:
                print('no data')
    except Exception as e:
        print('Error.extractDataFromRow2:', e)
        return None

Link of the page I'm trying to get data from.

是否有更清洁的方法;我很高兴从给出的反馈中学习。

2 个答案:

答案 0 :(得分:1)

您共享的html代码可能不足以回答问题,因此我签出了您共享的url。这是刮擦桌子的方法。

dataPassingDelegate = "I have something to say"
dismiss(animated: true, completion: nil)
// or if you're using a navigation controller
// navigationController?.popViewController(animated: true)

结果是

from bs4 import BeautifulSoup
import requests

r = requests.get("https://forum.lowyat.net/ReviewsandGuides")

soup = BeautifulSoup(r.text, 'lxml')

index = 0
#First two rows of table is not data so we skip it. Last row of table is for searching we also skip it. Table contains 30 rows of data. That is why we are slicing list
for row in soup.select('table[cellspacing="1"] > tr')[2:32]:   
    replies = row.select_one('td:nth-of-type(4)').text.strip()
    topic_started = row.select_one('td:nth-of-type(5)').text.strip()
    total_views = row.select_one('td:nth-of-type(6)').text.strip()
    index +=1

    print(index,replies, topic_started, total_views)

答案 1 :(得分:1)

请注意,您必须使用lxml解析器,否则将出错。

def extractDataFromRow2(url):
    results = []
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'lxml')
    for row in soup.select('#forum_topic_list tr'):
        cols = row.select('td')
        if len(cols) != 7:
            continue
        cols[2] = cols[2].find('a') # fix title
        values = [c.text.strip() for c in cols]
        results.append({
          'Title' : values[2],
          'Replies' : values[3],
          'Topic_Starter' : values[4],
          'total_view: ' : values[5]
        })
    return results

threadlists = extractDataFromRow2('https://forum.....')
print(threadlists)

结果

[
  {
    "Title": "Xiaomi 70Mai Pro",
    "Replies": "148",
    "Topic_Starter": "blurjoey",
    "total_view: ": "9,996"
  },
  {
    "Title": "Adata XPG SX8200 Pro 512GB NVME SSD",
    "Replies": "10",
    "Topic_Starter": "xxboxx",
    "total_view: ": "265"
  },
  ....
]