Question

我有一个熊猫数据框中的化学反应列表，我想将其分解为各个成分。方程并不那么复杂，这里有几个例子：

N2 + CH4 → HCN + NH3

H2+F2→2HF

目标是在+和→上分割字符串并获得以下内容

['N2','CH4','HCN','NH3]
[H2,'F2','HF']

这是我到目前为止所拥有的

import re

df = pd.read_csv("foo.csv") # read the csv file

convert=df['Reaction'].to_string() # convert the reaction column to a string object

result = re.split(r'(\+ →)',convert) # attempt to split on the two delimiters

# alternatively I have tried replacing the right arrow with its unicode equivalent like this

# result = re.split(r'\+\u2192)',convert)

每次我运行此代码时，我都会得到相同的完全相同的字符串，而没有任何更改。

我还试图将列保留为列表对象而不是字符串对象，然后

试图将其拆分，当我这样做时，我得到Type Error: Expected string or bytes-like object

Answer 1

因为您使用的是数据框，所以有pandas方法mktime。我们可以拆分多个字符。仅在这种情况下，在某些情况下我们才有空格，因此我们也将其考虑在内。

Series.str.split

或者如ctwheels的评论中所述，只需：

df['Reaction_new'] = df['Reaction'].str.split('\s?[+→]\s?')

df['Reaction_new'] = df['Reaction'].str.split('\W+')

Answer 2

您正在拆分文字字符串+ →，但是该字符串永远不会出现在您的数据中。

您可以使用[]来匹配多个字符中的任何一个。

result = re.split(r'\s*[+→]\s*',convert)

此外，您不应在定界符正则表达式周围放置捕获组，因为这将导致定界符被包含在结果中。

在多个定界符上分割字符串

2 个答案: