如何应用以下规则来转换数据?

时间:2018-02-12 23:00:04

标签: python python-3.x data-cleaning

您好我有以下文字:

blood sampling were recor-
ded.

concentration [Cmax]
values available) in which subjects who administered

values around Cmax SCALE Diabetes (Trial 2) was a randomised, double-
blind, placebo-controlled, parallel group, multicentre,

considered pharmacokinetically relevant. However, gly-
caemic status was confounded by trial as all subjects with

我想将此数据转换为以下格式,如下所示:

blood sampling were recorded. concentration [Cmax] values available) in which subjects who administered values around Cmax SCALE Diabetes (Trial 2) was a randomised, double-blind, placebo-controlled, parallel group, multicentre, considered pharmacokinetically relevant. However, glycaemic status was confounded by trial as all subjects with

所以我尝试了以下内容:

file = open('test.txt', 'r',encoding='utf-8') 

list_lines = []

for line in file:
    print(line)
    list_lines.append(line.replace('\n', ' ').replace('-\n', ''))

big_line = ''.join(list_lines)

text_file = open('changed.txt', "w",encoding='utf-8')
text_file.write(big_line)
text_file.close()
print('writing document')  

但是我得到了:

blood sampling were recor- ded.  concentration [Cmax] values available) in which subjects who administered  values around Cmax SCALE Diabetes (Trial 2) was a randomised, double- blind, placebo-controlled, parallel group, multicentre,  considered pharmacokinetically relevant. However, gly- caemic status was confounded by trial as all subjects with

我找不到自动执行此任务的方法我有以下错误:

记录 双盲 血糖

所以我真的很感谢帮助克服这个我不知道如何进行的任务,主要的问题是,如果我申请第一条规则,那么我就不能应用第二条规则,因为两者都与' \ N'

2 个答案:

答案 0 :(得分:1)

以下是使用re.sub进行回调的解决方案:

re.sub('-?\n+', lambda x: '' if '-' in x.group() else ' ', text)

返回

  记录血液采样。浓度[Cmax]值可用)   其中在Cmax SCALE糖尿病周围施用值的受试者   (试验2)是随机,双盲,安慰剂对照,平行   group,multicentre,被认为是药代动力学相关的。然而,   通过试验将血糖状态混淆为所有受试者

模式匹配一​​个或多个换行符,前面带有可选的连字符(-)。回调控制替换。如果匹配包含连字符,则将其视为下一行中单词的延续。否则,插入空格。

答案 1 :(得分:0)

使用replace():

from pip import index
import requests
finder = index.PackageFinder(
    [],
    ['https://pypi.python.org/simple'],
    session=requests.Session()
)
results = finder.find_all_candidates("package_name")
versions = [p.version for p in results]

输出:

  记录血液采样。可用的浓度[C max]值,其中施用C max SCALE糖尿病(试验2)周围的值的受试者是随机,双盲,安慰剂对照,平行组,多中心,被认为是药代动力学相关的。然而,血糖状态被所有受试者

的试验混淆