Question

我有一个数据框，有两列，“标题”和“描述”。标题栏有一堆与临床实验室测试相关的标题。不幸的是，大多数标题都重复相同的测试，但由于标题的微小变化，标题显示为独特。

values = [('Complete blood picture', 'AB'), ('Complete BLOOD test', 'AB'), ('blood glucose', 'AB'), ('COMplete blood Profile', 'AB')]
labels = ['title', 'description']
import pandas as pd
labtest = pd.DataFrame.from_records(values, columns = labels) # Create data frame
labtest = labtest.apply(lambda x: x.astype(str).str.lower())  # Convert columns to string and lower case
labtest['title'].str.contains("blood")  # Search for blood

在：

Title                       Description
Complete blood test         AB
COMPLETE Blood test\        AB
Blood glucose               AB
Complete blood picture      AB

之后：[这就是我希望数据框看起来像]

Title                       Description
Blood test                   AB
Blood test                   AB
Blood test                   AB
Blood test                   AB

我想在每个标题中搜索“血”这个词，如果是真的，那么用“血液测试”改变整个标题。

P.S我是python的新手并使用文本数据，我只是设法找到“血液”这个词。

Answer 1

This is not an exact solution because i don't know format of your data , I am just giving you an example with txt file , You can take help from this code :

If file.txt contains :

Title                       Description
Complete blood test         ABO group
COMPLETE Blood test\        ABO group
Blood glucose               ABO group
Complete blood picture      ABO group

Code:

track_dublicate={}
with open('file.txt') as f:

    for line_no,line in enumerate(f):
        if line_no==0:
            pass
        else:
            if tuple(line.split()[-2:]) not in track_dublicate:
                track_dublicate[tuple(line.split()[-2:])]=line.split()[:-2]
            else:
                track_dublicate[tuple(line.split()[-2:])]='Blood test'

print(track_dublicate)
#you can save this data to a new file where you want.

output:

{('ABO', 'group'): 'Blood test'}

用python中的新字符串或单词完全替换字符串

1 个答案: