我希望在python中通过void MainWindow::on_pushButton_clicked()
{
wav->setHeader(Filename); //set the QByteArray bufffer in WAV class with first 40 bytes
for (int i=0;i<40;i++)
{
unsigned char var = wav->buffer.at(i);
ui->textBrowser->insertPlainText(QString::number(var));
ui->textBrowser->insertPlainText(" ");
}
}
库获取HTML之后提取链接的标题。
基本上,整个标题标签是
BeautifulSoup
我想提取只有 <title>Imaan Z Hazir on Twitter: "Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"</title>
的&amp; quot标签中的数据
我试过
Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)
我也试过
import urllib
import urllib.request
from bs4 import BeautifulSoup
link = "https://twitter.com/ImaanZHazir/status/778560899061780481"
try:
List=list()
r = urllib.request.Request(link, headers={'User-Agent': 'Chrome/51.0.2704.103'})
h = urllib.request.urlopen(r).read()
data = BeautifulSoup(h,"html.parser")
for i in data.find_all("title"):
List.append(i.text)
print(List[0])
except urllib.error.HTTPError as err:
pass
和
for i in data.find_all("title.""):
for i in data.find_all("title>""):
for i in data.find_all("""):
但没人在工作。
答案 0 :(得分:0)
一旦你解析了html:
data = BeautifulSoup(h,"html.parser")
以这种方式找到标题:
title = data.find("title").string # this is without <title> tag
现在在字符串中找到两个引号("
)。有很多方法可以做到这一点。我会用正则表达式:
import re
match = re.search(r'".*"', title)
if match:
print match.group(0)
您永远不会搜索"
或任何其他&NAME;
序列,因为BeautifulSoup会将它们转换为它们所代表的实际字符。
修改强>
不捕获引号的正则表达式将是:
re.search(r'(?<=").*(?=")', title)
答案 1 :(得分:0)
这是一个简单的完整示例,使用正则表达式在引号中提取文本:
select *
from
( select
t2.productid,
t1.orderdate,
SUM(t2.orderqty) as total_amt_ordered,
RANK ()
OVER (PARTITION BY t1.orderdate
order by SUM(t2.orderqty)) as ranking
from
saleslt.salesorderheader t1
inner join
saleslt.salesorderdetail t2 on t1.salesorderid = t2.salesorderid
group by
productid, orderdate
) t3
where
ranking = 3;
这里发生的事情是,在获取网页的来源并找到import urllib
import re
from bs4 import BeautifulSoup
link = "https://twitter.com/ImaanZHazir/status/778560899061780481"
r = urllib.request.urlopen(link)
soup = BeautifulSoup(r, "html.parser")
title = soup.title.string
quote = re.match(r'^.*\"(.*)\"', title)
print(quote.group(1))
之后,我们对标题使用正则表达式来提取引号内的文本。
我们告诉正则表达式在开头引号(title
)之前的字符串开头(^.*
)查找任意数量的符号,然后捕获它与结束之间的文本引用(第二\"
)。
然后我们通过告诉Python打印第一个捕获的组(正则表达式中的括号之间的部分)来打印捕获的文本。
这里有关于在python中与正则表达式匹配的更多信息 - https://docs.python.org/3/library/re.html#match-objects
答案 2 :(得分:0)
只需在冒号上分割文字:
In [1]: h = """<title>Imaan Z Hazir on Twitter: "Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"</title>"""
In [2]: from bs4 import BeautifulSoup
In [3]: soup = BeautifulSoup(h, "lxml")
In [4]: print(soup.title.text.split(": ", 1)[1])
"Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"
实际上查看您根本不需要拆分的页面,文本位于 div.js-tweet-text-container中的 p 标记中,th:
In [8]: import requests
In [9]: from bs4 import BeautifulSoup
In [10]: soup = BeautifulSoup(requests.get("https://twitter.com/ImaanZHazir/status/778560899061780481").content, "lxml")
In [11]: print(soup.select_one("div.js-tweet-text-container p").text)
Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)
In [12]: print(soup.title.text.split(": ", 1)[1])
"Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"
所以你可以用同样的方式做到这一点。