使用BeautifulSoup从标题标签下的" e中提取数据?

时间:2016-09-21 18:36:01

标签: python css-selectors beautifulsoup html-parser

我希望在python中通过void MainWindow::on_pushButton_clicked() { wav->setHeader(Filename); //set the QByteArray bufffer in WAV class with first 40 bytes for (int i=0;i<40;i++) { unsigned char var = wav->buffer.at(i); ui->textBrowser->insertPlainText(QString::number(var)); ui->textBrowser->insertPlainText(" "); } } 库获取HTML之后提取链接的标题。 基本上,整个标题标签是

BeautifulSoup

我想提取只有 <title>Imaan Z Hazir on Twitter: &quot;Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)&quot;</title> 的&amp; quot标签中的数据 我试过

Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)

我也试过

import urllib
import urllib.request

from bs4 import BeautifulSoup

link = "https://twitter.com/ImaanZHazir/status/778560899061780481"
try:
    List=list()
    r = urllib.request.Request(link, headers={'User-Agent': 'Chrome/51.0.2704.103'})
    h = urllib.request.urlopen(r).read()
    data = BeautifulSoup(h,"html.parser")
    for i in data.find_all("title"):
        List.append(i.text)
        print(List[0])
except urllib.error.HTTPError as err:
    pass

for i in data.find_all("title.&quot"):

for i in data.find_all("title>&quot"):

for i in data.find_all("&quot"):

但没人在工作。

3 个答案:

答案 0 :(得分:0)

一旦你解析了html:

data = BeautifulSoup(h,"html.parser")

以这种方式找到标题:

title = data.find("title").string  # this is without <title> tag

现在在字符串中找到两个引号(")。有很多方法可以做到这一点。我会用正则表达式:

import re
match = re.search(r'".*"', title)
if match:
    print match.group(0)

您永远不会搜索&quot;或任何其他&NAME;序列,因为BeautifulSoup会将它们转换为它们所代表的实际字符。

修改

不捕获引号的正则表达式将是:

re.search(r'(?<=").*(?=")', title)

答案 1 :(得分:0)

这是一个简单的完整示例,使用正则表达式在引号中提取文本:

select *
from 
  (  select 
         t2.productid,
         t1.orderdate,
         SUM(t2.orderqty) as total_amt_ordered,
         RANK ()
         OVER (PARTITION BY t1.orderdate
               order by SUM(t2.orderqty)) as ranking
     from
         saleslt.salesorderheader t1 
     inner join 
         saleslt.salesorderdetail t2 on t1.salesorderid = t2.salesorderid 
     group by 
         productid, orderdate
  ) t3
where 
    ranking = 3; 

这里发生的事情是,在获取网页的来源并找到import urllib import re from bs4 import BeautifulSoup link = "https://twitter.com/ImaanZHazir/status/778560899061780481" r = urllib.request.urlopen(link) soup = BeautifulSoup(r, "html.parser") title = soup.title.string quote = re.match(r'^.*\"(.*)\"', title) print(quote.group(1)) 之后,我们对标题使用正则表达式来提取引号内的文本。

我们告诉正则表达式在开头引号(title)之前的字符串开头(^.*)查找任意数量的符号,然后捕获它与结束之间的文本引用(第二\")。

然后我们通过告诉Python打印第一个捕获的组(正则表达式中的括号之间的部分)来打印捕获的文本。

这里有关于在python中与正则表达式匹配的更多信息 - https://docs.python.org/3/library/re.html#match-objects

答案 2 :(得分:0)

只需在冒号上分割文字:

In [1]:  h = """<title>Imaan Z Hazir on Twitter: &quot;Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)&quot;</title>"""

In [2]: from bs4 import BeautifulSoup

In [3]: soup  = BeautifulSoup(h, "lxml")

In [4]: print(soup.title.text.split(": ", 1)[1])
 "Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"

实际上查看您根本不需要拆分的页面,文本位于 div.js-tweet-text-container中的 p 标记中,th:

In [8]: import requests

In [9]: from bs4 import BeautifulSoup


In [10]: soup  = BeautifulSoup(requests.get("https://twitter.com/ImaanZHazir/status/778560899061780481").content, "lxml")


In [11]: print(soup.select_one("div.js-tweet-text-container p").text)
Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)

In [12]: print(soup.title.text.split(": ", 1)[1])
"Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"

所以你可以用同样的方式做到这一点。