Pandas和Excel中部分重复项的条件格式

时间:2017-06-29 01:06:49

标签: python excel csv pandas

我有以下名为reviews.csv的csv数据:

Movie,Reviewer,Sentence,Tag,Sentiment,Text,
Jaws,John,s1,Plot,Positive,The plot was great,
Jaws,Mary,s1,Plot,Positive,The plot was great,
Jaws,John,s2,Acting,Positive,The acting was OK,
Jaws,Mary,s2,Acting,Neutral,The acting was OK,
Jaws,John,s3,Scene,Positive,The visuals blew me away,
Jaws,Mary,s3,Effects,Positive,The visuals blew me away,
Vertigo,John,s1,Scene,Negative,The scenes were terrible,
Vertigo,Mary,s1,Acting,Negative,The scenes were terrible,
Vertigo,John,s2,Plot,Negative,The actors couldn’t make the story believable,
Vertigo,Mary,s2,Acting,Positive,The actors couldn’t make the story believable,
Vertigo,John,s3,Effects,Negative,The effects were awful,
Vertigo,Mary,s3,Effects,Negative,The effects were awful,

我的目标是将此csv文件转换为带有条件格式的Excel电子表格。具体来说,我想应用以下规则:

  1. 如果“电影”,“句子”,“标记”和“情感”值相同,则整行应为绿色。

  2. 如果'Movie','Sentence'和'Tag'值相同,但'Sentiment'值不同,则该行应为蓝色。

  3. 如果“电影”和“句子”值相同,但“标记”值不同,则该行应为红色。

  4. 所以我想创建一个如下所示的Excel电子表格(.xlsx):

    Spreadsheet with color-coded partial duplicates

    我一直在查看Pandas的Styles文档,以及XlsxWriter上的条件格式教程,但我似乎无法将它们放在一起。这是我到目前为止所拥有的。我可以将csv读入Pandas数据框,对其进行排序(虽然我不确定是否必要),然后将其写回Excel电子表格。我如何进行条件格式化,以及代码中的位置呢?

    def csv_to_xls(source_path, dest_path):
        """
        Convert a csv file to a formatted xlsx spreadsheet
        Input: path to hospital review csv file
        Output: formatted xlsx spreadsheet
        """
        #Read the source file and convert to Pandas dataframe
        df = pd.read_csv(source_path)
    
        #Sort by Filename, then by sentence number
        df.sort_values(['File', 'Sent'], ascending=[True, True], inplace = True)
    
        #Create the xlsx file that we'll be writing to
        orig = pd.ExcelWriter(dest_path, engine='xlsxwriter')
    
        #Convert the dataframe to Excel, create the sheet
        df.to_excel(orig, index=False, sheet_name='report')
    
        #Variables for the workbook and worksheet
        workbook = orig.book
        worksheet = orig.sheets['report']
    
        #Formatting for exact, partial, mismatch, gold
        exact = workbook.add_format({'bg_color':'#B7F985'}) #green
        partial = workbook.add_format({'bg_color':'#D3F6F4'}) #blue
        mismatch = workbook.add_format({'bg_color':'#F6D9D3'}) #red
    
        #Do the conditional formatting somehow
    
        orig.save()
    

1 个答案:

答案 0 :(得分:2)

免责声明:我是图书馆的作者之一

使用StyleFrameDataFrame.duplicated

可以轻松实现这一目标
from StyleFrame import StyleFrame, Styler

sf = StyleFrame(df)

green = Styler(bg_color='#B7F985')
blue = Styler(bg_color='#D3F6F4')
red = Styler(bg_color='#F6D9D3')

sf.apply_style_by_indexes(sf[df.duplicated(subset=['Movie', 'Sentence'], keep=False)],
                          styler_obj=red)
sf.apply_style_by_indexes(sf[df.duplicated(subset=['Movie', 'Sentence', 'Tag'], keep=False)],
                          styler_obj=blue)
sf.apply_style_by_indexes(sf[df.duplicated(subset=['Movie', 'Sentence', 'Tag', 'Sentiment'],
                                           keep=False)],
                          styler_obj=green)

sf.to_excel('test.xlsx').save()

这输出以下内容:

enter image description here