使用系列中的字符串检查pandas中的str.contains

时间:2014-09-08 14:06:55

标签: python string pandas

为什么这不适用于使用pandas的字符串方法?

df['col1'].str.contains(df['col2'])

继续获得:'系列'对象是可变的,因此它们不能被散列。

更新:澄清 - 我会逐行比较这些列,并且部分字符串按顺序完全匹配。例如,对于下面的第1列和第2列,我希望上面的输出与输出一样:

col-1    col-2    output
'cat'    'at'     True
'aunt'   'at'     False
'dog'    'dg'     False
'edge'   'dg'     True

2 个答案:

答案 0 :(得分:3)

您可以定义一个简单的函数,只需使用一个测试来确定一列中的一个值是否在另一列中:

In [37]:

df = pd.DataFrame({'col1':['mn','mxn','ca','sd','xa','ac'], 'col2':['m','n','x','n','q','y']})
def func(x):
    return x.col2 in list(x.col1)
df.apply(func, axis=1)
Out[37]:
0     True
1     True
2    False
3    False
4    False
5    False
dtype: bool

对于您的用例,以下内容应该符合您的要求:

return x.col2 in x.col1

答案 1 :(得分:0)

您可以使用lambda函数在数据框中按行执行任何操作。

您的问题:

import scrapy
from scrapy.pipelines.files import FilesPipeline
from scrapy.pipelines.images import ImagesPipeline
#from scrapy.commands.view import open_in_browser
#import json

class MySpider(scrapy.Spider):

    name = 'myspider'

    #allowed_domains = []

    # see page created for scraping: http://toscrape.com/
    start_urls = ['http://books.toscrape.com/'] #'http://quotes.toscrape.com']

    def parse(self, response):
        print('url:', response.url)

        #open_in_browser(response)  # to see url in web browser

        # download all types of files (without converting images to JPG)
        for url in response.css('img::attr(src)').extract():
            url = response.urljoin(url)
            image = url.rsplit('/')[-1] # get first char from image name
            #yield {'image_urls': [url], 'name': 'books', 'album': image[0]}
            yield {'file_urls': [url], 'name': 'books', 'album': image[0]}  # <--- file_urls instead of image_urls

# --- pipelines ---

import os

#class RenameImagesPipeline(ImagesPipeline):
class RenameFilesPipeline(FilesPipeline):  # <-- FilesPipeline instead of ImagesPipeline
    '''Pipeline to change file names - to add folder name'''

    def get_media_requests(self, item, info):
        #for image_url in item['image_urls']:
        for image_url in item['file_urls']:   # <--- file_urls instead of image_urls
            # send `meta` to `file_path()`
            yield scrapy.Request(image_url, meta={'name': item['name'], 'album': item['album']})

    def file_path(self, request, response=None, info=None):
        # get `meta`
        name  = request.meta['name']
        album = request.meta['album']
        image = request.url.rsplit('/')[-1]
        print('file_path:', request.url, request.meta, image)

        return '%s/%s/%s' % (name, album, image)

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',

    # save in file CSV, JSON or XML
    'FEED_FORMAT': 'csv',     # csv, json, xml
    'FEED_URI': 'output.csv', #

    # --- images ---

    # download images to `IMAGES_STORE/full` (standard folder) and convert to JPG (even if it is already JPG)
    # it needs `yield {'image_urls': [url]}` in `parse()` and both ITEM_PIPELINES and IMAGES_STORE to work

    #'ITEM_PIPELINES': {'scrapy.pipelines.images.ImagesPipeline': 1},  # used standard ImagesPipeline (download to IMAGES_STORE/full)
    #'ITEM_PIPELINES': {'__main__.RenameImagesPipeline': 1}, 
    #'IMAGES_STORE': '/full/path/to/valid/dir',  # this folder has to exist before downloading

    # --- files ---

    # download files to `FILES_STORE/full` (standard folder) (without converting images)
    # it needs `yield {'file_urls': [url]}` in `parse()` and both ITEM_PIPELINES and FILES_STORE to work

    #'ITEM_PIPELINES': {'scrapy.pipelines.files.FilesPipeline': 1},  # used standard FilesPipeline (download to FILES_STORE/full)
    'ITEM_PIPELINES': {'__main__.RenameFilesPipeline': 1},  # <--- RenameFilesPipeline instead of RenameImagesPipeline
    'FILES_STORE': 'Master',  # this folder has to exist before downloading  # <--- FILES_STORE instead of IMAGES_STORE
})

c.crawl(MySpider)
c.start()

此处,lambda函数将对col-1和col-2执行按行的字符串比较,并将结果存储在“输出”列中。

同样,相同的概念也可以用于对数据框执行数学运算。