我有一个文件要读取到Pandas DataFrame中,该文件的列中包含复杂的字符串。该字符串包含HTML输出,类似于以下内容:
"<!DOCTYPE html PUBLIC \\"-//W3C//DTD HTML 4.0 Transitional//EN\\">\n', '<html>\n', '<head>\n', '<meta http-equiv=\\"Content-Type\\" content=\\"text/html; charset=UTF-8\\">\n', '<meta charset=\\"utf-8\\">\n', '<title>An Amazon.com Gift Card you sent has been redeemed</title>\n', '</head>\n', '<body>\n',
到目前为止,我已经尝试了以下方法:
df = pd.read_csv("<filename>",nrows = 50)
哪个返回以下.head()
:
我尝试使用"escapechar= "
,但一定不能正确使用语法。
要清楚,此HTML字符串将是整个CSV文件的一部分,而上面的字符串将仅是给定行的一个单元格。请参阅下面的CSV文件示例行。此CSV提供了24列:
"241279","EMAIL_ADDRESS","EMAIL_ADDRESS","1607be7d4f2d66af","<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"URL\">
<html>
<head>
<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">
<meta charset=\"utf-8\">
<title>An Amazon.com Gift Card you sent has been redeemed</title>
</head>
<body>
<img width=\"1\" height=\"1\" src=\"URL\">
Greetings from Amazon.com,<br><br>
We wanted to let you know you that an Amazon.com Gift Card you sent has been redeemed.<br><br>
The gift card was emailed by Amazon to EMAIL_ADDRESS on DATE.<br><br>
Details:<br><br>
Order # NUMBER<br>
Sent to: EMAIL_ADDRESS<br>
Date sent: DATE<br>
Message: Here is a \"thank you\" for ... <br><br>
Please note: This email was sent from a notification-only address that cannot accept incoming email.
Please do not reply to this message.<br><br>
<img width=\"1\" height=\"1\" src=\"URL\">
</body>
</html>
","DATE 01:47:58","gmail","email",,,"An Amazon.com Gift Card you sent has been redeemed","DATE","DATE","f","23",,"EMAIL_ADDRESS","EMAIL_ADDRESS",,"f","EMAIL_ADDRESS","EMAIL_ADDRESS","9","f"
答案 0 :(得分:0)
由于quotechar
的默认pd.read_csv
是"
,因此您应该使用quotechar="'"
。
答案 1 :(得分:0)
数据的转义字符为\
,这不是默认字符。具有以下内容:
df = pd.read_csv(<filename>,header=None,escapechar='\\')
我获得了:
>>> df
0 1 2 3 \
0 \n"241279" EMAIL_ADDRESS EMAIL_ADDRESS 1607be7d4f2d66af
4 5 6 \
0 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Tr... DATE 01:47:58 gmail
7 8 9 ... 14 15 16 17 18 19 \
0 email NaN NaN ... 23 NaN EMAIL_ADDRESS EMAIL_ADDRESS NaN f
20 21 22 23
0 EMAIL_ADDRESS EMAIL_ADDRESS 9 f
[1 rows x 24 columns]