查找熊猫数据每两行的字符串之间的差异

时间:2020-04-12 14:29:09

标签: python string pandas difference

我是python的新手,并且为此苦苦挣扎了一段时间。 我有一个看起来像这样的文件:

    name   seq
1   a1     bbb
2   a2     bbc
3   b1     fff
4   b2     fff
5   c1     aaa
6   c2     acg

其中name是字符串的名称,seq是字符串。 我想要一个新的列或一个新的数据框,以指示每两行之间没有重叠的差异数。例如,我想要名称[a1-a2],然后是[b1-b2],最后是[c1-c2],序列之间的差异数。

所以我需要这样的东西:

    name   seq   diff  
1   a1     bbb    NA   
2   a2     bbc    1
3   b1     fff    NA
4   b2     fff    0
5   c1     aaa    NA
6   c2     acg    2

我们非常感谢您的帮助

4 个答案:

答案 0 :(得分:5)

您似乎想要jaccard distance对字符串。这是使用groupbyscipy.spatial.distance.jaccard的一种方法:

from scipy.spatial.distance import jaccard
g = df.groupby(df.name.str[0])

df['diff'] = [sim for _, seqs in g.seq for sim in 
              [float('nan'), jaccard(*map(list,seqs))]]

print(df)

  name  seq  diff
1   a1  bbb   NaN
2   a2  bbc   1.0
3   b1  fff   NaN
4   b2  fff   0.0
5   c1  aaa   NaN
6   c2  acg   2.0

答案 1 :(得分:4)

距离Levenshtein的替代项:

import Levenshtein
s = df['name'].str[0]
out = df.assign(Diff=s.drop_duplicates(keep='last').map(df.groupby(s)['seq']
                    .apply(lambda x: Levenshtein.distance(x.iloc[0],x.iloc[-1]))))

  name  seq  Diff
1   a1  bbb   NaN
2   a2  bbc   1.0
3   b1  fff   NaN
4   b2  fff   0.0
5   c1  aaa   NaN
6   c2  acg   2.0

答案 2 :(得分:1)

第一步,我用以下方法重新创建了数据:

function OpticalFunction
daq.reset
clear, close all
clc;
s = daq.createSession('ni'); 
% Creates the session object
s.addDigitalChannel('Dev1','Port0/Line0:7','OutputOnly');
% Adds 8 digital output channels (numbered 0:7) on the DAQ card

% The following creates the uicontrols
onoff = uicontrol('Style','togglebutton','String','go',...
'Position',[20 200 70 40],'Callback',@move_buggy);

forwards = uicontrol('Style','pushbutton','String','forwards',...
'Position',[20 150 70 40],'Callback',@go_forward);

backwards = uicontrol('Style','pushbutton','String','backwards',...
'Position',[20 100 70 40],'Callback',@go_backward);

nout = [51 102 204 153]; % decimal sequence for forward motion

% This is the callback function for the toggle button.
% It moves the buggy when the toggle button is pressed.
% 'hObject' is the handle for the uicontrol calling the function.
function move_buggy(hObject,eventdata)
   while hObject.Value == hObject.Max
       for n=1:4
       output_data=dec2binvec(nout(n),8);
       % high state=1 low state=0
       outputSingleScan(s,output_data);
       % outputs the data in output_data to the device
       pause(1.6) 
       % use this to change the speed of the motor
       end
   end
end
% These are the callbacks for the pushbuttons.
% They set the direction of travel for the motors.

   function go_forward(hObject,eventdata)
       nout = [51 102 204 153];
   end

   function go_backward(hObject,eventdata)
       nout = [153 204 102 51];
   end
end 

%%

startw = input('Enter starting wavelength: ');
deend = input('Desired final wavelength: ');
r = 11/62; % this is the rate of wavelegth change with time for GaAs
r = 29.5/66; %this is the rate of wavelenght change with time for GaP 
% comment off the r value not used

OpticalFunction
% calls on the function optical thing

解决方案 您可以尝试遍历数据框,并将上一次迭代的 fetch(`{{ url('fetch/proficiency/list') }}`, { method: 'GET', headers: { 'Content-Type': 'application/json', }, }).then(r => { return r.json(); }).then(results => { //console.log(results); $("#all-proficiency").html(""); $("#edit-all-proficiency").html(""); $.each(results, function(index, val) { $("#all-proficiency").append(` <input type="checkbox" class="form-check-input" name="proficiency[]" value="${val.id}">${val.name}<br> `); $("#edit-all-proficiency").append(` <input type="checkbox" class="form-check-input" name="proficiency[]" value="${val.id}">${val.name}<br> `); }); }).catch(err => { console.log(err); }) }``` **the ajax saving the project** function saveNewProject() { var _token = $('#token').val(); var title = $('#title').val(); var context = $('#context').val(); var description = $('#description').val(); var start_date = $('#start_date').val(); var project = $('#project').val(); var proficiency = []; $.each($("input[name='proficiency']:checked"), function() { proficiency.push($(this).val()); }); var details = $('#details').val(); $.ajax({ url: "add/project", type: "POST", data:{ "_token": "{{ csrf_token() }}", title:title, context:context, description:description, start_date:start_date, project:project, stack:stack, proficiency:proficiency, details:details,` **my project model where i fetch the proficiency from...** public function getProficiencyList(){ // body $proficiency = [ [ "id" => 1, "name" => "Expert" ], [ "id" => 2, "name" => "Intermediate" ], [ "id" => 3, "name" => "Beginner" ], [ "id" => 4, "name" => "Novice" ], ]; // return return $proficiency; } 值与当前迭代进行比较。为了比较两个字符串(存储在数据框的#!/usr/bin/env python3 import pandas as pd # Setup data = {'name': {1: 'a1', 2: 'a2', 3: 'b1', 4: 'b2', 5: 'c1', 6: 'c2'}, 'seq': {1: 'bbb', 2: 'bbc', 3: 'fff', 4: 'fff', 5: 'aaa', 6: 'acg'}} df = pd.DataFrame(data) 列中),您可以像下面的函数一样应用简单的列表理解:

seq

对数据框行进行迭代

seq

结果看起来像这样

def diff_letters(a,b):
    return sum ( a[i] != b[i] for i in range(len(a)) )

答案 3 :(得分:0)

选中这个

import pandas as pd

data = {'name':  ['a1', 'a2','b1','b2','c1','c2'],
    'seq': ['bbb', 'bbc','fff','fff','aaa','acg']
    }

df = pd.DataFrame (data, columns = ['name','seq'])
diffCntr=0
df['diff'] = np.nan
i=0
while i < len(df)-1:
    diffCntr=np.nan
    item=df.at[i,'seq']
    df.at[i,'diff']=diffCntr
    diffCntr=0
    for j in df.at[i+1,'seq']:
        if item.find(j) < 0:
            diffCntr +=1
    df.at[i+1,'diff']=diffCntr
    i +=2    
df  

结果是这样的:

    name seq    diff
0   a1   bbb    NaN
1   a2   bbc    1.0
2   b1   fff    NaN
3   b2   fff    0.0
4   c1   aaa    NaN
5   c2   acg    2.0
相关问题