为什么这不适用于使用pandas的字符串方法?
df['col1'].str.contains(df['col2'])
继续获得:'系列'对象是可变的,因此它们不能被散列。
更新:澄清 - 我会逐行比较这些列,并且部分字符串按顺序完全匹配。例如,对于下面的第1列和第2列,我希望上面的输出与输出一样:
col-1 col-2 output
'cat' 'at' True
'aunt' 'at' False
'dog' 'dg' False
'edge' 'dg' True
答案 0 :(得分:3)
您可以定义一个简单的函数,只需使用一个测试来确定一列中的一个值是否在另一列中:
In [37]:
df = pd.DataFrame({'col1':['mn','mxn','ca','sd','xa','ac'], 'col2':['m','n','x','n','q','y']})
def func(x):
return x.col2 in list(x.col1)
df.apply(func, axis=1)
Out[37]:
0 True
1 True
2 False
3 False
4 False
5 False
dtype: bool
对于您的用例,以下内容应该符合您的要求:
return x.col2 in x.col1
答案 1 :(得分:0)
您可以使用lambda函数在数据框中按行执行任何操作。
您的问题:
import scrapy
from scrapy.pipelines.files import FilesPipeline
from scrapy.pipelines.images import ImagesPipeline
#from scrapy.commands.view import open_in_browser
#import json
class MySpider(scrapy.Spider):
name = 'myspider'
#allowed_domains = []
# see page created for scraping: http://toscrape.com/
start_urls = ['http://books.toscrape.com/'] #'http://quotes.toscrape.com']
def parse(self, response):
print('url:', response.url)
#open_in_browser(response) # to see url in web browser
# download all types of files (without converting images to JPG)
for url in response.css('img::attr(src)').extract():
url = response.urljoin(url)
image = url.rsplit('/')[-1] # get first char from image name
#yield {'image_urls': [url], 'name': 'books', 'album': image[0]}
yield {'file_urls': [url], 'name': 'books', 'album': image[0]} # <--- file_urls instead of image_urls
# --- pipelines ---
import os
#class RenameImagesPipeline(ImagesPipeline):
class RenameFilesPipeline(FilesPipeline): # <-- FilesPipeline instead of ImagesPipeline
'''Pipeline to change file names - to add folder name'''
def get_media_requests(self, item, info):
#for image_url in item['image_urls']:
for image_url in item['file_urls']: # <--- file_urls instead of image_urls
# send `meta` to `file_path()`
yield scrapy.Request(image_url, meta={'name': item['name'], 'album': item['album']})
def file_path(self, request, response=None, info=None):
# get `meta`
name = request.meta['name']
album = request.meta['album']
image = request.url.rsplit('/')[-1]
print('file_path:', request.url, request.meta, image)
return '%s/%s/%s' % (name, album, image)
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEED_FORMAT': 'csv', # csv, json, xml
'FEED_URI': 'output.csv', #
# --- images ---
# download images to `IMAGES_STORE/full` (standard folder) and convert to JPG (even if it is already JPG)
# it needs `yield {'image_urls': [url]}` in `parse()` and both ITEM_PIPELINES and IMAGES_STORE to work
#'ITEM_PIPELINES': {'scrapy.pipelines.images.ImagesPipeline': 1}, # used standard ImagesPipeline (download to IMAGES_STORE/full)
#'ITEM_PIPELINES': {'__main__.RenameImagesPipeline': 1},
#'IMAGES_STORE': '/full/path/to/valid/dir', # this folder has to exist before downloading
# --- files ---
# download files to `FILES_STORE/full` (standard folder) (without converting images)
# it needs `yield {'file_urls': [url]}` in `parse()` and both ITEM_PIPELINES and FILES_STORE to work
#'ITEM_PIPELINES': {'scrapy.pipelines.files.FilesPipeline': 1}, # used standard FilesPipeline (download to FILES_STORE/full)
'ITEM_PIPELINES': {'__main__.RenameFilesPipeline': 1}, # <--- RenameFilesPipeline instead of RenameImagesPipeline
'FILES_STORE': 'Master', # this folder has to exist before downloading # <--- FILES_STORE instead of IMAGES_STORE
})
c.crawl(MySpider)
c.start()
此处,lambda函数将对col-1和col-2执行按行的字符串比较,并将结果存储在“输出”列中。
同样,相同的概念也可以用于对数据框执行数学运算。