Question

我有以下格式的数据集。

row_num; locale; day_of_week; hour_of_day; agent_id; entry_page; path_id_set; traffic_type; session_durantion; hits
“ 988681; L6; Monday; 17; 1; 2111;”“ 31672; 0”“; 6; 7037; \ N” “ 988680; L2; Thursday; 22; 10; 2113;”“” 31965; 0“”; 2; 49; 14“ “ 988679; L4;星期六； 21; 2; 2100;”“ 0; 78464”“; 1; 1892; 14” “ 988678; L3; Saturday; 19; 8; 2113; 51462; 6; 0; 1; \ N”

我希望它采用以下格式：

行数区域设置day_of_week hour_of_day agent_id entry_page path_id_set traffic_type session_durantion hits
      988681 L6星期一17 1 2111 31672 0 6 7037 N
      988680 L2星期四22 10 2113 31965 0 2 49 14
      988679 L4星期六21 2 2100 0 78464 1 1892 14
      988678 L3星期六19 8 2113 51462 6 0 1 N

我尝试使用以下代码：

import pandas as pd

df = pd.read_csv("C:\Users\Rahhy\Desktop\trivago.csv", delimiter = ";")

但是我得到一个错误：

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

Answer 1

使用replace()：

with open("data_test.csv", "r") as fileObj:
    contents = fileObj.read().replace(';',' ').replace('\\', '').replace('"', '')
print(contents)

输出：

row_num locale day_of_week hour_of_day agent_id entry_page path_id_set traffic_type session_durantion hits
988681 L6 Monday 17 1 2111 31672 0 6 7037 N 988680 L2 Thursday 22 10 2113 31965 0 2 49 14 988679 L4 Saturday 21 2 2100 0 78464 1 1892 14 988678 L3 Saturday 19 8 2113 51462 6 0 1 N

编辑：

您可以打开文件，读取文件内容，替换不需要的字符。将新内容写入文件，然后通过pd.read_csv进行读取：

with open("data_test.csv", "r") as fileObj:
    contents = fileObj.read().replace(';',' ').replace('\\', '').replace('"', '')
# print(contents)

with open("data_test.csv", "w+") as fileObj2:
    fileObj2.write(contents)

import pandas as pd
df = pd.read_csv(r"data_test.csv", index_col=False)
print(df)

输出：

row_num locale day_of_week hour_of_day agent_id entry_page path_id_set traffic_type session_durantion hits
988681 L6 Monday 17 1 2111 31672 0 6 7037 N 988680 L2 Thursday 22 10 2113 31965 0 2 49 14 988679 L4 Saturday 21 2 2100 0 78464 1 1892 14 988678 L3 Saturday 19 8 2113 51462 6 0 1 N

Answer 2

@change="clickCheckbox"

由于有11个数据字段和10个标头，因此仅使用前10个字段。您必须弄清楚您要如何处理最后一个（值：\ N，14）

输出：

import pandas as pd
from io import StringIO

# Load the file to a string (prefix r (raw) to not use \ for escaping)
filename = r'c:\temp\x.csv'
with open(filename, 'r') as file:
    raw_file_content = file.read()

# Remove the quotes which break the CSV file
file_content_without_quotes = raw_file_content.replace('"','')

# Simulate a file with the corrected CSV content
simulated_file = StringIO(file_content_without_quotes)

# Get the CSV as a table with pandas
# Since the first field in each data row shall not be used for indexing we need to set index_col=False
csv_data = pd.read_csv(simulated_file, delimiter = ';', index_col=False)
print(csv_data['hits']) # print some column
csv_data

请参见https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

如何在python中打开csv？

2 个答案: