the field mark with blue, those are the field i am trying to scrape
<div class="txt-block">
<h4 class="inline">Budget:</h4>
"€650,000
"
<span class="attribute">(estimated)</span>
</div>
我想抓取h4标签以外的数据,即€650,000。 我该如何在python中使用scrapy CSS?
我正在尝试这样做,但是它返回多个字段。
item['Budget'] = response.css(".txt-block h4:not(span)::text").extract()
答案 0 :(得分:0)
尝试在您的xpath中使用following-sibling::text()
。
像这样:response.xpath('//div[contains(@class, "txt-block")]/h4/following-sibling::text()').get()
它提供了所需的信息。
答案 1 :(得分:0)
尝试使用:
data = [d.strip() for d in response.css('.txt-block::text') if d.strip()]
您想要的数据实际上在div标签中,而我正在使用该标签来获取数据。
答案 2 :(得分:0)
似乎您正在寻找真实的演示。检查以下实现:
import requests
from scrapy import Selector
url = "https://www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=702AB91P12YZ9Z98XH5T&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1"
res = requests.get(url)
sel = Selector(res)
budget = ' '.join(sel.css(".txt-block:contains('Budget')::text").extract()).strip()
gross = ' '.join(sel.css(".txt-block:contains('Gross USA')::text").extract()).strip()
cumulative = ' '.join(sel.css(".txt-block:contains('Cumulative Worldwide')::text").extract()).strip()
print(f'budget: {budget}\ngross: {gross}\ncumulative: {cumulative}')
此刻的输出:
budget: $25,000,000
gross: $28,341,469
cumulative: $58,500,000
答案 3 :(得分:0)
您需要将文本提取到数组中并从数组中的所需位置获取值。例子
import scrapy
# Print Your code here
html_text="""
<div class="txt-block">'+
<h4 class="inline">Budget:</h4>650,000
<span class="attribute">(estimated)</span>
</div>
"""
# Parse text selector
selector=scrapy.Selector(text=html_text)
print(selector)
# Extract div
d=selector.xpath('//div[@class="txt-block"]//text()')
values=d.extract() # Gives an array of text values
print(values)
# Value index 2 is what you need
print(values[2])
Scrapy缺少BeautifulSoup中可用的标签删除。