我正在尝试将制表符分隔的文本文件读入数据帧。
这是文件在Excel中的显示方式:
CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER TRANSACTION_TYPE CUSTOMER_NUMBER CUSTOMER_NAME
5/13/2016 0:00 13867666 6892372 S 2026 CUSTOMER 1
导入df:
df = p.read_table("E:/FileLoc/ThisIsAFile.txt", encoding = "iso-8859-1")
现在它没有看到前三列作为列索引的一部分(df [0] =交易类型),并且所有标题都转移以反映这一点。
CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER
5/13/2016 0:00 13867666 6892372 S 2026 CUSTOMER 1
我正在尝试操作文本文件,然后将其作为最终结果导入mysql数据库。
答案 0 :(得分:5)
您可以将read_csv
与分隔符2和更多空格一起使用:
import pandas as pd
import io
temp=u"""CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER TRANSACTION_TYPE CUSTOMER_NUMBER CUSTOMER_NAME
5/13/2016 0:00 13867666 6892372 S 2026 CUSTOMER 1"""
#after testing replace io.StringIO(temp) to filename
df =pd.read_csv(io.StringIO(temp), sep=r'\s{2,}', engine='python', encoding = "iso-8859-1")
print (df)
CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER TRANSACTION_TYPE \
0 5/13/2016 0:00 13867666 6892372 S
CUSTOMER_NUMBER CUSTOMER_NAME
0 2026 CUSTOMER 1
如果分隔符为tabulator
,请使用sep='\t'
。
编辑:
我用你的数据测试它并且它可以工作:
import pandas as pd
df = pd.read_csv('test/AnonymizedData.txt', sep='\t')
print (df)
CUSTOMER_NUMBER CUSTOMER_NAME CUSTOMER_BRANCH_CODE CUSTOMER_BRANCH_NAME \
0 2026 CUSTOMER 1 83 SALES BRANCH 1
1 2359 CUSTOMER 2 76 SALES BRANCH 2
2 100662 CUSTOMER 3 28 SALES BRANCH 3
3 3245 CUSTOMER 4 84 SALES BRANCH 4
4 3179 CUSTOMER 5 28 SALES BRANCH 5
5 39881 CUSTOMER 6 67 SALES BRANCH 6
6 37020 CUSTOMER 7 58 SALES BRANCH 7
7 1239 CUSTOMER 8 50 SALES BRANCH 8
8 2379 CUSTOMER 9 76 SALES BRANCH 9
CUSTOMER_CITY CUSTOMER_STATE ... PRICING_PRODUCT_TYPE_CODE \
0 TOWN 1 CO ... 11
1 TOWN 2 OH ... 11
2 TOWN 3 ME ... 11
3 TOWN 4 IL ... 11
4 TOWN 5 NH ... 11
5 TOWN 6 TX ... 11
6 TOWN 7 NC ... 11
7 TOWN 8 NY ... 11
8 TOWN 9 OH ... 11
PRICING_PRODUCT_TYPE ORGANIZATION_ID ORGANIZATION_NAME PRODUCT_LINE_CODE \
0 DISPOSABLES 83 ORGANIZATIONNAME 891
1 DISPOSABLES 83 ORGANIZATIONNAME 891
2 DISPOSABLES 83 ORGANIZATIONNAME 891
3 DISPOSABLES 83 ORGANIZATIONNAME 891
4 DISPOSABLES 83 ORGANIZATIONNAME 891
5 DISPOSABLES 83 ORGANIZATIONNAME 891
6 DISPOSABLES 83 ORGANIZATIONNAME 891
7 DISPOSABLES 83 ORGANIZATIONNAME 891
8 DISPOSABLES 83 ORGANIZATIONNAME 891
PRODUCT_LINE ROBOTIC_FLAG Unnamed: 52 Unnamed: 53 Unnamed: 54
0 PRODUCTNAME N N NaN 3
1 PRODUCTNAME N N NaN 3
2 PRODUCTNAME N N NaN 2
3 PRODUCTNAME N N NaN 7
4 PRODUCTNAME N N NaN 1
5 PRODUCTNAME N N NaN 4
6 PRODUCTNAME N N NaN 3
7 PRODUCTNAME N N NaN 5
8 PRODUCTNAME N N NaN 3
[9 rows x 55 columns]