URL:https://myanimelist.net/anime/236/Es_Otherwise
我试图在URL中抓取以下内容:
我尝试过:
for i in response.css('span[class = dark_text]') :
i.xpath('/following-sibling::text()')
或者当前无法使用的XPath或我错过了某些东西...
aired_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[11]/text()')
producer_xpath = response.xpath("//*[@id='content']/table/tbody/tr/td[1]/div/div[12]/span/a/@href/text()")
licensor_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[13]/a/text()')
studio_xpath response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[14]/a/@href/title/text()')
studio_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[17]/text()')
str_rating_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[18]/text()')
ranked_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[20]/span/text()')
japanese_title_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[7]/text()')
source_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[15]/text()')
genre_xpath = [response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a[{0}]'.format(i)) for i in range(1,4)]
genre_xpath_v2 = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a/@href/text()')
number_of_users_rated_anime_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[19]/span[3]/text()')
popularity_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[21]/span/text()')
members_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[22]/span/text()')
favorite_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[23]/span/text()')
但是我发现某些文本不在span类中,所以我想使用css / XPath公式使文本超出span。
答案 0 :(得分:1)
在表内遍历div更简单
foundH2 = False
response = Selector(text=htmlString).xpath('//*[@id="content"]/table/tr/td[1]/div/*')
for resp in response:
tagName = resp.xpath('name()').extract_first()
if 'h2' == tagName:
foundH2 = True
if foundH2:
# start adding 'info' after <h2>Alternative Titles</h2> found
info = None
if 'div' == tagName:
for item in resp.xpath('.//text()').extract():
if 'googletag.' in item: break
item = item.strip()
if item and item != ',':
info = info + " " + item if info else item
if info:
print info
我的观点是,BeautifulSoup比刮y更快,更好。
答案 1 :(得分:0)
如果您只想抓取图像中提到的信息,则可以使用
response.xpath('//div[@class="space-it"]//text()').extract()
或者我无法正确理解您的问题。