Question

我正在尝试在某些逻辑中组合这些字符串和行：

s1 = ['abc.txt','abc.txt','ert.txt','ert.txt','ert.txt']
s2 = [1,1,2,2,2]
s3 = ['Harry Potter','Vol 1','Lord of the Rings - Vol 1',np.nan,'Harry Potter']
df = pd.DataFrame(list(zip(s1,s2,s3)),
            columns=['file','id','book'])
df

数据预览：

file     id  book
abc.txt  1   Harry Potter
abc.txt  1   Vol 1
ert.txt  2   Lord of the Rings
ert.txt  2   NaN
ert.txt  2   Harry Potter

我有一堆具有ID的文件名列。我有“书”列，其中第1卷位于单独的行中。我知道此vol1仅与给定数据集中的“哈利·波特”相关联。基于“文件”和“标识”的分组依据，如何在行中出现“哈利·波特”字符串的同一行中合并“卷1”？请注意，某些数据行没有“哈利·波特”的vo1，在查看文件和id groupby时我只想要“ Vol 1”。

2次尝试：

第一个：不起作用

if (df['book'] == 'Harry Potter' and df['book'].str.contains('Vol 1',case=False) in df.groupby(['file','id'])):
    df.groupby(['file','id'],as_index=False).first()

2nd：这适用于每个字符串（但不希望它适用于每个“ Harry Potter”字符串。

df.loc[df['book'].str.contains('Harry Potter',case=False,na=False), 'new_book'] = 'Harry Potter - Vol 1'

这是我要查找的输出

file     id  book
abc.txt  1   Harry Potter - Vol 1
ert.txt  2   Lord of the Rings - Vol 1
ert.txt  2   NaN
ert.txt  2   Harry Potter

Answer 1

从import re开始（您将使用它）。

然后创建您的DataFrame：

df = pd.DataFrame({
    'file': ['abc.txt','abc.txt','ert.txt','ert.txt','ert.txt'],
    'id':   [1, 1, 2, 2, 2],
    'book': ['Harry Potter', 'Vol 1', 'Lord of the Rings - Vol 1',
             np.nan, 'Harry Potter']})

第一步是添加一列，我们称之为 book2 ，包含下一行的 book2 ：

df["book2"] = df.book.shift(-1).fillna('')

我添加了fillna('')，将 NaN 值替换为空字符串。

然后定义要应用于每一行的函数：

def fn(row):
    return f"{row.book} - {row.book2}" if row.book == 'Harry Potter'\
        and re.match(r'^Vol \d+$', row.book2) else row.book

此函数检查 book ==“ Harry Potter”和 book2 是否匹配 “ Vol” +数字序列。如果是，则返回 book + book2 ，否则仅返回 book 。

然后我们应用此功能并将结果保存回 book 下：

df["book"] = df.apply(fn, axis=1)

唯一剩下的就是放下：

行，其中 book 匹配 Vol \ d + ，
book2 列。

代码是：

df = df.drop(df[df.book.str.match(r'^Vol \d+$').fillna(False)].index)\
    .drop(columns=['book2'])

需要

fillna（False），因为 str.match 返回 NaN 源内容== NaN 。

Answer 2

假设标题后面的行上出现“ Vol x”，我将使用通过将book列移动-1而获得的辅助序列。这样就足以在该系列以"Vol "开头时将该Series与book列合并，并在book列以"Vol "开头的地方放行。代码可能是：

b2 = df.book.shift(-1).fillna('')
df['book'] = df.book + np.where(b2.str.match('Vol [0-9]+'), ' - ' + b2, '')
print(df.drop(df.loc[df.book.fillna('').str.match('Vol [0-9]+')].index))

如果不能保证数据框中的顺序，但是如果 Vol x 行与文件中具有相同文件和ID的另一行匹配，则可以将数据框分为两部分，其中一个包含 Vol x 行，其中包含其他行，并从前一行进行更新：

g = df.groupby(df.book.fillna('').str.match('Vol [0-9]+'))
for k, v in g:
    if k:
        df_vol = v
    else:
        df = v

for row in df_vol.iterrows():
    r = row[1]
    df.loc[(df.file == r.file)&(df.id==r.id), 'book'] += ' - ' + r['book']

Answer 3

使用merge，apply，update，drop_duplicates。

在索引{{1}的df和set_index的{{1}}之间的索引merge file上的

id和'Harry Potter'； df创建适当的字符串并将其转换为数据框

'Vol 1'

更新原始的join，df.set_index(['file', 'id'], inplace=True) df1 = df[df['book'] == 'Harry Potter'].merge(df[df['book'] == 'Vol 1'], left_index=True, right_index=True).apply(' '.join, axis=1).to_frame(name='book') Out[2059]: book file id abc.txt 1 Harry Potter Vol 1和df

drop_duplicate

Python-根据某些字符串对数据框进行分组

3 个答案: