我正在尝试分别从cars.com提取外部颜色,内部颜色,变速器的信息。
HTML:
<ul class="listing-row__meta">
<li>
<strong>
Ext. Color:
</strong>
Gray
</li>
<li>
<strong>
Int. Color:
</strong>
White
</li>
<li>
<strong>
Transmission:
</strong>
Automatic
</li>
我尝试了以下代码,但显示“预期的字符串或类似字节的对象”。任何建议或解决方案将不胜感激。
from bs4 import BeautifulSoup
import urllib
import re
url ='https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')
all_matches = soup.find_all('div',{'class':'shop-srp-listings__listing-container'})
for each in all_matches:
info=each.findAll('ul',class_='listing-row__meta')
pattern=re.compile(r'Ext. Color:')
matches=pattern.finditer(info)
for match in matches:
print(match.text)
答案 0 :(得分:2)
也许,这可能更接近于您尝试使用类似以下表达式提取的内容:
(?is)<strong>\s*([^<]*?)\s*<\/strong>
或
(?is)(?<=<strong>)\s*[^<]*?\s*(?=<\/strong>)
当然,您也可以使用bs4
内置函数来完成此操作。
from bs4 import BeautifulSoup
import urllib
import re
import requests
url = 'https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')
all_matches = soup.find_all(
'div', {'class': 'shop-srp-listings__listing-container'})
for each in all_matches:
info = each.findAll('ul', class_='listing-row__meta')
matches = re.findall(
r'(?is)<strong>\s*[^<]*?\s*<\/strong>\s*([^<]*?)\s*<', str(info[0]))
for match in matches:
print(match)
Gray
Beige
Automatic
AWD
Gray
White
Automatic
AWD
Black
如果您愿意,还可以通过一些修改来做出指示:
from bs4 import BeautifulSoup
import urllib
import re
import requests
url = 'https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')
all_matches = soup.find_all(
'div', {'class': 'shop-srp-listings__listing-container'})
for each in all_matches:
info = each.findAll('ul', class_='listing-row__meta')
matches = dict(re.findall(
r'(?is)<strong>\s*([^<]*?)\s*<\/strong>\s*([^<]*?)\s*<', str(info[0])))
for k, v in matches.items():
print(f'{k} {v}')
Ext. Color: Gray
Int. Color: Beige
Transmission: Automatic
Drivetrain: AWD
Ext. Color: Gray
Int. Color: White
Transmission: Automatic
Drivetrain: AWD
Ext. Color: Black
如果您愿意列出:
from bs4 import BeautifulSoup
import urllib
import re
import requests
url = 'https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')
all_matches = soup.find_all(
'div', {'class': 'shop-srp-listings__listing-container'})
for each in all_matches:
info = each.findAll('ul', class_='listing-row__meta')
matches = re.findall(
r'(?is)<strong>\s*([^<]*?)\s*<\/strong>\s*([^<]*?)\s*<', str(info[0]))
for match in matches:
print(list(match))
['Transmission:', 'Automatic']
['Drivetrain:', 'RWD']
['Ext. Color:', 'Gray']
['Int. Color:', 'Gray']
['Transmission:', 'Automatic']
['Drivetrain:', 'RWD']
['Ext. Color:', 'White']
['Int. Color:', 'Black']
['Transmission:', 'Automatic']
['Drivetrain:', 'RWD']
['Ext. Color:', 'White']
['Int. Color:', 'Beige']
['Transmission:', 'Automatic']
['Drivetrain:', 'AWD']
['Ext. Color:', 'Gray']
['Int. Color:', 'Beige']
['Transmission:', 'Automatic']
['Drivetrain:', 'AWD']
['Ext. Color:', 'White']
from bs4 import BeautifulSoup
import urllib
import re
import requests
url = 'https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')
all_matches = soup.find_all(
'div', {'class': 'shop-srp-listings__listing-container'})
keys = ['Ext. Color', 'Int. Color', 'Transmission', 'Drivetrain']
outputs = dict()
for each in all_matches:
info = each.findAll('ul', class_='listing-row__meta')
matches = dict(re.findall(
r'(?is)<strong>\s*([^<:]*?)\s*:\s*<\/strong>\s*([^<]*?)\s*<', str(info[0])))
for item in matches.items():
if item[0] not in outputs:
outputs[item[0]] = [item[1]]
if item[0] in keys:
outputs[item[0]].append(item[1])
{'Ext。颜色”:[“银色”,“银色”,“白色”,“白色”,“黑色”,“灰色”, '灰色','黑色','黑色','白色','蓝色','红色','银色','灰色', '黑色','白色','黑色','灰色','白色','黑色','黑色'],“诠释”。 颜色”:[“米色”,“米色”,“黑色”,“白色”,“黑色”,“黑色”,“灰色”, '米色','黑色','黑色','米色','米色','黑色','黑色', '黑色','黑色','黑色','黑色','白色','白色','黑色'], “传输”:[“自动”,“自动”,“自动”,“自动”, “自动”,“自动”,“自动”,“自动”,“自动”, “自动”,“自动”,“自动”,“自动”,“自动”, “自动”,“自动”,“自动”,“自动”,“自动”, “自动”,“自动”],“传动系统”:['AWD','AWD','AWD','AWD', 'RWD','RWD','RWD','RWD','AWD','RWD','RWD','RWD','AWD','RWD', 'RWD','AWD','RWD','AWD','AWD','AWD','AWD']}
from bs4 import BeautifulSoup
import urllib
import re
import requests
url = 'https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')
all_matches = soup.find_all(
'div', {'class': 'shop-srp-listings__listing-container'})
keys = ['Ext. Color', 'Int. Color', 'Transmission', 'Drivetrain']
outputs = dict()
for each in all_matches:
info = each.findAll('ul', class_='listing-row__meta')
matches = dict(re.findall(
r'(?is)<strong>\s*([^<:]*?)\s*:\s*<\/strong>\s*([^<]*?)\s*<', str(info[0])))
for item in matches.items():
if item[0] not in outputs:
outputs[item[0]] = [item[1]]
if item[0] in keys:
outputs[item[0]].append(item[1])
print(outputs)
print('*' * 50)
no_duplicate_outputs = dict()
for item in outputs.items():
if item[0] not in no_duplicate_outputs:
no_duplicate_outputs[item[0]] = list(set(item[1]))
print(no_duplicate_outputs)
{'Ext。颜色”:[“黑色”,“黑色”,“白色”,“黑色”,“其他”,“灰色”, '白色','白色','灰色','白色','灰色','银色','蓝色','黑色', '银色','银色','黑色','蓝色','蓝色','黑色','白色'],“内部”。 颜色”:[“黑色”,“黑色”,“米色”,“米色”,“黑色”,“灰色”,“黑色”, “米色”,“米色”,“白色”,“黑色”,“黑色”,“灰色”,“黑色”,“黑色”, '灰色','黑色','黑色','黑色','白色','黑色'],“透射”: [“自动”,“自动”,“自动”,“自动”,“自动”, “自动”,“自动”,“自动”,“自动”,“自动”, “自动”,“自动”,“自动”,“自动”,“自动”, “自动”,“自动”,“自动”,“自动”,“自动”, '自动'],'传动系统':['AWD','AWD','RWD','RWD','RWD','RWD', 'RWD','AWD','AWD','AWD','RWD','AWD','AWD','AWD','AWD','AWD', 'RWD','AWD','AWD','AWD','AWD']} ****************************************************** {'Ext。颜色”:[“银色”,“白色”,“蓝色”,“其他”,“黑色”,“灰色”],“内部”。颜色': [“米色”,“白色”,“黑色”,“灰色”],“传输”:[“自动”], 'Drivetrain':['RWD','AWD']}
如果您希望简化/修改/探索表达式,请在regex101.com的右上角进行说明。如果愿意,您还可以在this link中查看它如何与某些示例输入匹配。
jex.im可视化正则表达式:
答案 1 :(得分:1)
Regex库的findAll
function returns a List of results;因此info
是一个字符串数组,而不是单个字符串。您可能还需要遍历info
中的每个项目。
这些对象返回bs4.Tag
对象(不是字符串),可以将其强制转换为字符串,使其适合finditer
API。 (这特别令人困惑,因为当您打印对象info
时,bs4会将它们渲染为字符串一样。)
for each in all_matches:
info = each.findAll('ul', class_='listing-row__meta')
for item in info:
pattern = re.compile(r'Ext. Color:')
matches = pattern.finditer(str(item))
for match in matches:
print(match.text)
在此示例中,info
可能是长度= 1的列表;在这种情况下,如果您确定只想要第一个结果,并且只有一个结果,则可以转换为返回单次出现的调用,或者仅在以下行中使用第一个结果:
info = each.findAll('ul', class_='listing-row__meta')[0]
,然后按原样使用问题中的代码。
答案 2 :(得分:1)
您的错误可以通过类型转换为str来解决:
matches=pattern.finditer(info)
更改为:
matches=pattern.finditer(str(info))
答案 3 :(得分:0)
这里绝对不需要正则表达式。 html是常规的,在bs4 4.7.1及更高版本中,您可以使用:contains通过文本来定位适当的元素,然后使用next_sibling获取包含值的相邻节点。抓取列表,压缩并转换为数据框
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
headers = ['Make','Ext','Int','Trans','Drive']
r = requests.get('https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX')
soup = bs(r.content, 'lxml')
make = [i.text.strip() for i in soup.select('.listing-row__title')]
ext_color = [i.next_sibling.strip() for i in soup.select('strong:contains("Ext. Color:")')]
int_color = [i.next_sibling.strip() for i in soup.select('strong:contains("Int. Color:")')]
transmission = [i.next_sibling.strip() for i in soup.select('strong:contains("Transmission:")')]
drive = [i.next_sibling.strip() for i in soup.select('strong:contains("Drivetrain:")')]
df = pd.DataFrame(zip(make, ext_color, int_color, transmission, drive), columns = headers)
print(df)