Question

给出以下数据框：

sd = pd.DataFrame({'Site':['A','B','B','C','A','A'],
                   'Station(s)':[',1,2,,',',1,2,,',',,,,',',1,2,,','0,1,2,,',',,2,,'],
                   'Item 1':[1,1,0,0,1,0],
                   'Item 2':[1,0,0,1,1,1]})
sd[['Site','Station(s)','Item 1','Item 2']]

sd

    Site    Station(s)  Item 1  Item 2
0   A        ,1,2,,        1    1
1   B        ,1,2,,        1    0
2   B        ,,,,          0    0
3   C        ,1,2,,        0    1
4   A        0,1,2,,       1    1
5   A        ,,2,,         0    1

和

    Contractor  President   Site(s)     Station(s)  Item 1  Item 2
0      1           1           A         ,1,2,,       1     1
1      1           0           B         ,1,2,,       1     0
2      0           0           B         ,,,,         0     0
3      0           0           C         ,1,2,,       0     1
4      0           1           A         0,1,2,,      1     1
5      1           1           A         ,,2,,        0     1

results = pd.DataFrame({'Contractor':[1,1,0,0,0,1],
                    'President':[1,0,0,0,1,1],
                   'Site(s)':['A','B','B','C','A','A'],
                   'Station(s)':[',1,2,,',',1,2,,',',,,,',',1,2,,','0,1,2,,',',,2,,'],
                   'Item 1':[1,1,0,0,1,0],
                   'Item 2':[1,0,0,1,1,1]})
results[['Contractor','President','Site(s)','Station(s)','Item 1','Item 2']]

我想最终得到这个：

for i in pos['Position']:
    sd[i]= 1 if lambda x: 'x' if x for x in pos['Site(s)'] if x in sd['Site']

基于这个逻辑：

对于每种职位：

在sd中创建一个具有该位置名称的新列。
对于满足以下条件的每一行，使其值等于1（对于其他行，则为0）：

一个。 sd ['Site']在pos ['Site（s）']
中包含至少1个值
湾sd ['Station（s）']在pos ['Station（s）''中包含至少1个号码但没有额外号码

我从这开始，但很快就被打回了提交：

<?php 

$number = htmlspecialchars($_GET['id']);



function getrecipeinfo($testing){

global $con;



$sqldescription = "SELECT category, eliquidname, image, contentnicpg, contentnicvg, description FROM vapetable where id = '{$number}' ;";

$result = mysqli_query($con, $sqldescription);

$row = $result->fetch_assoc();



     while($row = $result->fetch_assoc()){


        $eliquidtitle = $row['eliquidname'];
        $category = $row["description"];
        print $eliquidtitle;


             }

}

getrecipeinfo($testing);

?>

Answer 1

由于数据的存储方式 - 以逗号分隔的字符串形式值 - 需要代码遍历行，选择值，迭代其他DataFrame并选择其值，然后比较两个......等等，我没有看到真正改善这种情况的方法输入保留以逗号分隔的值。

考虑到这些限制，我认为su79eu7k's answer非常好。

但是，如果您认为“整洁的数据”（PDF）更好 - 如果您允许我们以简洁的格式将起点更改为DataFrames - 然后有一个不同的方法可能更高效，特别是当sd有很多时行。使用sd.apply(check, axis=1)的问题在于它使用Python循环迭代sd行。致电check一次与等效的代码相比，每一行都可能相对较慢熊猫更快的矢量化方法（如merge或groupby）的优势。但是，要使用merge和groupby，您需要数据格式整洁。

假设我们从pos和sd开始，而不是tidypos和tidysd。（在在这篇文章的末尾，你会找到一个可以将pos和sd转换为整齐的可运行示例当量。）

In [238]: tidypos
Out[238]: 
     Position Site Station
0  Contractor    A       1
1  Contractor    A       2
2  Contractor    B       1
3  Contractor    B       2
4   President    A       0
5   President    A       1
6   President    A       2
7   President    A       3
8   President    A       4

In [239]: tidysd
Out[239]: 
   index Site Station
0      0    A       1
1      0    A       2
2      1    B       1
3      1    B       2
4      3    C       1
5      3    C       2
6      4    A       0
7      4    A       1
8      4    A       2
9      5    A       2

tidypos和tidysd包含与pos和sd相同的信息（忽略Items，因为它们在此问题中不起作用。）差异主要在于tidypos和tidysd中的每一行对应一个“观察”。每次观察都是彼此独立的。从本质上讲，这归结为简单地拆分以逗号分隔的值，以便每个值最终都在一个单独的行上。

现在，我们可以根据常见列Site和Station加入两个DataFrame：

In [241]: merged = pd.merge(tidysd, tidypos, how='left'); merged
Out[241]: 
    index Site Station    Position
0       0    A       1  Contractor
1       0    A       1   President
2       0    A       2  Contractor
3       0    A       2   President
4       1    B       1  Contractor
5       1    B       2  Contractor
6       3    C       1         NaN
7       3    C       2         NaN
8       4    A       0   President
9       4    A       1  Contractor
10      4    A       1   President
11      4    A       2  Contractor
12      4    A       2   President
13      5    A       2  Contractor
14      5    A       2   President

现在，merged中的每一行代表一行tidysd和一行之间的匹配 tidypos。因此，行的存在意味着存在匹配在sd['Site']和pos['Site']之间，以及之间的匹配 tidysd['Station']和tidypos['Station']。换句话说，对于那一行， sd['Station(s)']必须包含pos['Station()']中的号码。唯一的 critera我们还不确定是否有额外的数字 sd['Station(s)']中未显示的pos['Station()']。

我们可以通过计算每个merged index中的行数来找出这一点和Position因为每个这样的行对应不同的Station。如果这数字等于该Station的可能index的总数 sd['Station(s)']不包含“额外数字”。

我们可以使用groupby/nunique来计算每个Stations和index Position的数量：

In [256]: pos_count = merged.groupby(['index', 'Position'])['Station'].nunique().unstack(); pos_count
Out[256]: 
Position  Contractor  President
index                          
0                2.0        2.0
1                2.0        NaN
4                2.0        3.0
5                1.0        1.0

我们可以为每个Station计算index的总数：

In [243]: total_count = tidysd.groupby(['index'])['Station'].nunique(); total_count
Out[243]: 
index
0    2
1    2
3    2
4    3
5    1
Name: Station, dtype: int64

最后，我们可以为Contractor和President列分配1和0，根据标准(pos_count[col] == total_count)：

pos_count = pos_count.reindex(total_count.index, fill_value=0)
for col in pos_count:
    pos_count[col] = (pos_count[col] == total_count).astype(int)
pos_count = pos_count.reindex(sd.index, fill_value=0)
# Position  Contractor  President
# 0                  1          1
# 1                  1          0
# 2                  0          0
# 3                  0          0
# 4                  0          1
# 5                  1          1

如果您真的希望，可以将此结果连接到原始sd以产生确切的预期结果：

In [246]: result = pd.concat([sd, pos_count], axis=1); result
Out[246]: 
   Item 1  Item 2 Site Station(s)  Contractor  President
0       1       1    A     ,1,2,,           1          1
1       1       0    B     ,1,2,,           1          0
2       0       0    B       ,,,,           0          0
3       0       1    C     ,1,2,,           0          0
4       1       1    A    0,1,2,,           0          1
5       0       1    A      ,,2,,           1          1

但同样，如果你认为数据应该是整洁的，你应该避免将多行数据打包成逗号分隔的字符串。

如何整理pos和sd ：

您可以使用矢量化字符串方法.str.findall和.str.split来实现将逗号分隔的字符串转换为值列表。然后使用列表理解迭代行和列表来构建tidypos和 tidysd。

全部放在一起，

import itertools as IT
import pandas as pd

pos = pd.DataFrame({'Station(s)':[',1,2,,','0,1,2,3,4'],
                    'Position':['Contractor','President'],
                    'Site(s)':['A,B','A'],
                    'Item(s)':['1','1,2']})

sd = pd.DataFrame({'Site':['A','B','B','C','A','A'],
                   'Station(s)':[',1,2,,',',1,2,,',',,,,',',1,2,,','0,1,2,,',',,2,,'],
                   'Item 1':[1,1,0,0,1,0],
                   'Item 2':[1,0,0,1,1,1]})

mypos = pos.copy()
mypos['Station(s)'] = mypos['Station(s)'].str.findall(r'(\d+)')
mypos['Site(s)'] = mypos['Site(s)'].str.split(r',')
tidypos = pd.DataFrame(
    [(row['Position'], site, station) 
     for index, row in mypos.iterrows() 
     for site, station in IT.product(
             *[row[col] for col in ['Site(s)', 'Station(s)']])], 
    columns=['Position', 'Site', 'Station'])

mysd = sd[['Site', 'Station(s)']].copy()
mysd['Station(s)'] = mysd['Station(s)'].str.findall(r'(\d+)')

tidysd = pd.DataFrame(
    [(index, row['Site'], station)
     for index, row in mysd.iterrows() 
     for station in row['Station(s)']], 
    columns=['index', 'Site', 'Station'])

merged = pd.merge(tidysd, tidypos, how='left')
pos_count = merged.groupby(['index', 'Position'])['Station'].nunique().unstack()
total_count = tidysd.groupby(['index'])['Station'].nunique()
pos_count = pos_count.reindex(total_count.index, fill_value=0)
for col in pos_count:
    pos_count[col] = (pos_count[col] == total_count).astype(int)
pos_count = pos_count.reindex(sd.index, fill_value=0)
result = pd.concat([sd, pos_count], axis=1)
print(result)

产量

   Item 1  Item 2 Site Station(s)  Contractor  President
0       1       1    A     ,1,2,,           1          1
1       1       0    B     ,1,2,,           1          0
2       0       0    B       ,,,,           0          0
3       0       1    C     ,1,2,,           0          0
4       1       1    A    0,1,2,,           0          1
5       0       1    A      ,,2,,           1          1

Answer 2

我大致尝试过，你可以改进下面的代码。

sd['Contractor'] = 0
sd['President'] = 0

def check(x):
    for p in pos['Position'].tolist():
        if x['Site'] in pos.set_index('Position').loc[p, 'Site(s)'].split(','):
            ss = pd.Series(x['Station(s)'].split(',')).replace('', np.nan).dropna()
            ps = pd.Series(pos.set_index('Position').loc[p, 'Station(s)'].split(',')).replace('', np.nan).dropna()
            if not ss.empty and ss.isin(ps).all():
                x[p] = 1

    return x

print sd.apply(check, axis=1)


   Item 1  Item 2 Site Station(s)  Contractor  President
0       1       1    A     ,1,2,,           1          1
1       1       0    B     ,1,2,,           1          0
2       0       0    B       ,,,,           0          0
3       0       1    C     ,1,2,,           0          0
4       1       1    A    0,1,2,,           0          1
5       0       1    A      ,,2,,           1          1

Pandas使用条件从其他数据框中的行创建列

2 个答案: