如何使用正则表达式更改数据记录的顺序并将其放到一个单独的数据框中?

时间:2019-05-15 11:13:19

标签: python regex pandas dataframe

我想知道的是如何使用带有正则表达式的上述数据框来按正确的顺序放置数据行。 如您从索引2和4所看到的,数量和件的顺序错误。 有谁有想法可以解决这个问题?

data = [['Total 8\r\r\nQuantity 2\r\r\nPiece 4'], ['Total 8\r\r\nQuantity 2\r\r\nPiece 4'],['Total 8\r\r\nPiece 2\r\r\nQuantity 4'], ['Total 8\r\r\nQuantity 2\r\r\nPiece 4'], ['Total 8\r\r\nPiece 2\r\r\nQuantity 4'],['Total 8\r\r\nQuantity 2\r\r\nPiece 4'], ['Total 8\r\r\nQuantity 2\r\r\nPiece 4'],['Total 8\r\r\nPiece 2\r\r\nQuantity 4'], ['Total 8\r\r\nQuantity 2\r\r\nPiece 4'], ['Total 8\r\r\nPiece 2\r\r\nQuantity 4']] 
df = pd.DataFrame(data, columns = ['Information']) 
df 

+-------+--------------------------------------+
| index |             Information              |
+-------+--------------------------------------+
|     0 | Total 8\r\r\nQuantity 2\r\r\nPiece 4 |
|     1 | Total 8\r\r\nQuantity 2\r\r\nPiece 4 |
|     2 | Total 8\r\r\nPiece 2\r\r\nQuantity 4 |
|     3 | Total 8\r\r\nQuantity 2\r\r\nPiece 4 |
|     4 | Total 8\r\r\nPiece 2\r\r\nQuantity 4 |
|     5 | Total 8\r\r\nQuantity 2\r\r\nPiece 4 |
|     6 | Total 8\r\r\nQuantity 2\r\r\nPiece 4 |
|     7 | Total 8\r\r\nPiece 2\r\r\nQuantity 4 |
|     8 | Total 8\r\r\nQuantity 2\r\r\nPiece 4 |
|     9 | Total 8\r\r\nPiece 2\r\r\nQuantity 4 |
+-------+--------------------------------------+


dt = pd.DataFrame(df)
data = []
for item in dt['Information']:
    regex = re.findall(r"(\d+)\D+(\d+)\D+(\d+)",item)
    quantity = re.findall(r"\bTotal\s?\d\D+(\bQuantity)",item)
    piece = re.findall(r"\bTotal\s?\d\D+(\bPiece)",item)
    regex = (map(list,regex))
    data.append(list(map(int,list(regex)[0])))
dftotal = pd.DataFrame(data, columns=['Total','Quantity','Piece'])
print(dftotal)

有了这段代码,我得到了下面的一列

+-------+----------+-------+
| Total | Quantity | Piece |
+-------+----------+-------+
|     8 |        2 |     4 |
|     8 |        2 |     4 |
|     8 |        2 |     4 |
|     8 |        2 |     4 |
|     8 |        2 |     4 |
|     8 |        2 |     4 |
|     8 |        2 |     4 |
|     8 |        2 |     4 |
|     8 |        2 |     4 |
+-------+----------+-------+ 

如何通过从“数据数组”中切换这些错误的顺序来获得如下所示的数据框,并将正确的变量放在单个数据框中?

+-------+----------+-------+   
| Total | Quantity | Piece |
+-------+----------+-------+
|     8 |        2 |     4 |
|     8 |        4 |     2 |
|     8 |        2 |     4 |
|     8 |        4 |     2 |
|     8 |        2 |     4 |
|     8 |        2 |     4 |
|     8 |        4 |     2 |
|     8 |        2 |     4 |
|     8 |        4 |     2 |
+-------+----------+-------+

2 个答案:

答案 0 :(得分:2)

这是使用str.extract

的一种方法

例如:

import pandas as pd

data = [['Total 8\r\r\nQuantity 2\r\r\nPiece 4'], ['Total 8\r\r\nQuantity 2\r\r\nPiece 4'],['Total 8\r\r\nPiece 2\r\r\nQuantity 4'], ['Total 8\r\r\nQuantity 2\r\r\nPiece 4'], ['Total 8\r\r\nPiece 2\r\r\nQuantity 4'],['Total 8\r\r\nQuantity 2\r\r\nPiece 4'], ['Total 8\r\r\nQuantity 2\r\r\nPiece 4'],['Total 8\r\r\nPiece 2\r\r\nQuantity 4'], ['Total 8\r\r\nQuantity 2\r\r\nPiece 4'], ['Total 8\r\r\nPiece 2\r\r\nQuantity 4']] 
df = pd.DataFrame(data, columns = ['Information'])

df["Total"] = df["Information"].str.extract(r"Total (\d+)")
df["Quantity"] = df["Information"].str.extract(r"Quantity (\d+)")
df["Piece"] = df["Information"].str.extract(r"Piece (\d+)")
df.drop("Information", inplace=True, axis=1)
print(df)

输出:

  Total Quantity Piece
0     8        2     4
1     8        2     4
2     8        4     2
3     8        2     4
4     8        4     2
5     8        2     4
6     8        2     4
7     8        4     2
8     8        2     4
9     8        4     2

答案 1 :(得分:1)

实际上,原始数据接近于csv文件,其中的分隔符为空格。以这种方式加载数据后,旋转数据就足以获得所需的数据。

所以我会这样做:

df = pd.read_csv(io.StringIO('\r\r\n'.join((line[0] for line in data))),
                 sep=' ', header=None)

df['n'] = (df.index / 3).astype(np.int32)

result = df.pivot('n', 0, 1)

结果是以下数据框:

0  Piece  Quantity  Total
n                        
0      4         2      8
1      4         2      8
2      2         4      8
3      4         2      8
4      2         4      8
5      4         2      8
6      4         2      8
7      2         4      8
8      4         2      8
9      2         4      8