我正在尝试从包含
的BS处理的html页面获取行
单词'十亿'。但是我得到空列表.....顺便说一下,这些行都在之间
<li>
代码,我尝试使用soup.findAll("<li>", {"class": "tabcontent"})
但它也给了我一个空列表。
import requests
from bs4 import BeautifulSoup
import re
url = 'http://www.worldstopexports.com/united-states-top-10-exports/'
header = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
page = requests.get (url, headers=header)
soup = BeautifulSoup (page.text, 'lxml')
table = soup.find_all (class_='tabcontent')[0].text
print(re.findall(r'^.*? billion', table))
print(table)
Machinery including computers: US$201.7 billion (13% of total exports)
Electrical machinery, equipment: $174.2 billion (11.3%)
Mineral fuels including oil: $138 billion (8.9%)
Aircraft, spacecraft: $131.2 billion (8.5%)
Vehicles: $130.1 billion (8.4%)
Optical, technical, medical apparatus: $83.6 billion (5.4%)
Plastics, plastic articles: $61.5 billion (4%)
Gems, precious metals: $60.4 billion (3.9%)
Pharmaceuticals: $45.1 billion (2.9%)
Organic chemicals: $36.2 billion (2.3%)
答案 0 :(得分:3)
您可以使用select()
首先获取标签,然后使用li
子项和文字:
# ... right under soup = BeautifulSoup (page.text, 'lxml') ...
# select the first tab
tab = soup.select('div.tabcontent')[0]
# select its items
items = [text
for item in tab.select('li')
for text in [item.text]
if "billion" in text]
print(items)
这会产生
['Machinery including computers: US$201.7 billion (13% of total exports)', 'Electrical machinery, equipment: $174.2 billion (11.3%)', 'Mineral fuels including oil: $138 billion (8.9%)', 'Aircraft, spacecraft: $131.2 billion (8.5%)', 'Vehicles: $130.1 billion (8.4%)', 'Optical, technical, medical apparatus: $83.6 billion (5.4%)', 'Plastics, plastic articles: $61.5 billion (4%)', 'Gems, precious metals: $60.4 billion (3.9%)', 'Pharmaceuticals: $45.1 billion (2.9%)', 'Organic chemicals: $36.2 billion (2.3%)']
答案 1 :(得分:2)
您的错误在于使用.*
;点运算符通常不匹配换行符,table
字符串在开头和单词十亿之间包含换行符。如果您打算使用正则表达式,那么至少使用re.MULTILINE
标记可以在换行后使^
匹配:
>>> re.findall(r'^.*billion', table, flags=re.MULTILINE)
['Machinery including computers: US$201.7 billion',
'Electrical machinery, equipment: $174.2 billion',
'Mineral fuels including oil: $138 billion',
'Aircraft, spacecraft: $131.2 billion',
'Vehicles: $130.1 billion',
'Optical, technical, medical apparatus: $83.6 billion',
'Plastics, plastic articles: $61.5 billion',
'Gems, precious metals: $60.4 billion',
'Pharmaceuticals: $45.1 billion',
'Organic chemicals: $36.2 billion']
但是,既然您想在li
元素中找到文字,为什么不选择那些呢?
soup.find(class_='tabcontent').find_all('li', string=re.compile(r'billion'))
将正则表达式模式传递给string
可以过滤元素的内容。这为您提供了匹配的元素:
>>> soup.find(class_='tabcontent').find_all('li', string=re.compile(r'billion'))
[<li>Machinery including computers: US$201.7 billion (13% of total exports)</li>,
<li>Electrical machinery, equipment: $174.2 billion (11.3%)</li>,
<li>Mineral fuels including oil: $138 billion (8.9%)</li>,
<li>Aircraft, spacecraft: $131.2 billion (8.5%)</li>,
<li>Vehicles: $130.1 billion (8.4%)</li>,
<li>Optical, technical, medical apparatus: $83.6 billion (5.4%)</li>,
<li>Plastics, plastic articles: $61.5 billion (4%)</li>,
<li>Gems, precious metals: $60.4 billion (3.9%)</li>,
<li>Pharmaceuticals: $45.1 billion (2.9%)</li>,
<li>Organic chemicals: $36.2 billion (2.3%)</li>]
如果您只想要他们的内容,您可以随时将.get_text()
应用于这些元素。
答案 2 :(得分:1)
另一种方法可能如下所示:
import requests
from bs4 import BeautifulSoup
URL = 'http://www.worldstopexports.com/united-states-top-10-exports/'
soup = BeautifulSoup(requests.get(URL,headers={"User-Agent":"Mozilla/5.0"}).text, 'lxml')
table = soup.find(class_='tabcontent')
data = '\n'.join([item.text for item in table.find_all("li")])
print(data)
输出:
Machinery including computers: US$201.7 billion (13% of total exports)
Electrical machinery, equipment: $174.2 billion (11.3%)
Mineral fuels including oil: $138 billion (8.9%)
Aircraft, spacecraft: $131.2 billion (8.5%)
Vehicles: $130.1 billion (8.4%)
Optical, technical, medical apparatus: $83.6 billion (5.4%)
Plastics, plastic articles: $61.5 billion (4%)
Gems, precious metals: $60.4 billion (3.9%)
Pharmaceuticals: $45.1 billion (2.9%)
Organic chemicals: $36.2 billion (2.3%)