Python - 正则表达式 - 在单词之前拆分字符串

时间:2011-07-15 15:03:23

标签: python regex string split splice

我试图在特定单词之前在python中拆分字符串。例如,我想在"path:"之前拆分以下字符串。

  • "path:"
  • 之前拆分字符串
  • 输入:"path:bte00250 Alanine, aspartate and glutamate metabolism path:bte00330 Arginine and proline metabolism"
  • 输出:['path:bte00250 Alanine, aspartate and glutamate metabolism', 'path:bte00330 Arginine and proline metabolism']

我试过了

rx = re.compile("(:?[^:]+)")
rx.findall(line)

这不会将字符串拆分到任何地方。麻烦的是,"path:"之后的值永远不会被指定为整个单词。有谁知道怎么做?

4 个答案:

答案 0 :(得分:4)

使用正则表达式来拆分字符串似乎有点过分:字符串split()方法可能正是您所需要的。

无论如何,如果你真的需要匹配一个正则表达式来分割你的字符串,你应该使用re.split()方法,它在正则表达式匹配时拆分一个字符串。

另外,使用正确的正则表达式进行拆分:

>>> line = 'path:bte00250 Alanine, aspartate and glutamate metabolism path:bte00330 Arginine and proline metabolism'
>>> re.split(' (?=path:)', line)
['path:bte00250 Alanine, aspartate and glutamate metabolism', 'path:bte00330 Arginine and proline metabolism']

(?=...)组是一个先行断言:表达式匹配空格(注意表达式开头的空格),后跟字符串'path:',没有消耗空间之后的东西。

答案 1 :(得分:2)

您可以执行["path:"+s for s in line.split("path:")[1:]]而不是使用正则表达式。 (请注意,我们跳过第一场比赛,没有“路径:”前缀。

答案 2 :(得分:0)

in_str = "path:bte00250 Alanine, aspartate and glutamate metabolism path:bte00330 Arginine and proline metabolism"
in_list = in_str.split('path:')
print ",path:".join(in_list)[1:]

答案 3 :(得分:0)

这可以在没有常规表达的情况下完成。给出一个字符串:

confusion_matrix

我们可以暂时用占位符替换所需的单词。占位符是单个字符,我们用它来分割:

training set

现在字符串被拆分了,我们可以使用列表解析将原始单词重新加入每个子字符串:

# Applying k-fold Method
from sklearn.cross_validation import StratifiedKFold
kfold = 10 # no. of folds (better to have this at the start of the code)

skf = StratifiedKFold(y, kfold, random_state = 0)

# Stratified KFold: This first divides the data into k folds. Then it also makes sure that the distribution of the data in each fold follows the original input distribution 
# Note: in future versions of scikit.learn, this module will be fused with kfold

skfind = [None]*len(skf) # indices
cnt=0
for train_index in skf:
    skfind[cnt] = train_index
    cnt = cnt + 1

# skfind[i][0] -> train indices, skfind[i][1] -> test indices
# Supervised Classification with k-fold Cross Validation

from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier

conf_mat = np.zeros((2,2)) # Initializing the Confusion Matrix

n_neighbors = 1; # better to have this at the start of the code

# 10-fold Cross Validation


for i in range(kfold):
    train_indices = skfind[i][0]
    test_indices = skfind[i][1]

    clf = []
    clf = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
    X_train = X[train_indices]
    y_train = y[train_indices]
    X_test = X[test_indices]
    y_test = y[test_indices]

    # fit Training set
    clf.fit(X_train,y_train) 


    # predict Test data
    y_predcit_test = []
    y_predict_test = clf.predict(X_test) # output is labels and not indices

    # Compute confusion matrix
    cm = []
    cm = confusion_matrix(y_test,y_predict_test)
    print(cm)
    # conf_mat = conf_mat + cm