我正在尝试从Kaggle读取此数据集:Amazon sales rank data for print and kindle books
文件amazon_com_extras.csv
有一个名为“标题”的列,该列有时包含一个逗号“,”因此该.csv中的所有字段都用引号引起来:
"ASIN","GROUP","FORMAT","TITLE","AUTHOR","PUBLISHER"
"022640014X","book","hardcover","The Diversity Bargain: And Other Dilemmas of Race, Admissions, and Meritocracy at Elite Universities","Natasha K. Warikoo","University Of Chicago Press"
我已经阅读了与此问题有关的其他问题,但没有一个能够解决。例如,我尝试过:
df = pd.read_csv("amazon_com_extras.csv",engine="python",sep=',')
df = pd.read_csv("amazon_com_extras.csv",engine="python",sep=',',quotechar='"')
但是似乎没有任何效果。 我正在使用Python 3.7.2和pandas 0.24.1。
答案 0 :(得分:2)
问题在于 <TechnicalProfile Id="LocalAccountSignUp-GetInitialSetOfClaims">
<Metadata>
<Item Key="IpAddressClaimReferenceId">IpAddress</Item>
<Item Key="language.button_continue">Next</Item>
<Item Key="UserMessageIfClaimsTransformationInvalidPhoneNumber">The telephone number you have entered is not valid.</Item>
</Metadata>
<OutputClaims>
<OutputClaim ClaimTypeReferenceId="objectId" />
<OutputClaim ClaimTypeReferenceId="newPassword" Required="true" />
<OutputClaim ClaimTypeReferenceId="reenterPassword" Required="true" />
<OutputClaim ClaimTypeReferenceId="givenName" Required="true" />
<OutputClaim ClaimTypeReferenceId="extension_middleName" />
<OutputClaim ClaimTypeReferenceId="surname" Required="true" />
<OutputClaim ClaimTypeReferenceId="telephoneNumberString" Required="true" />
<OutputClaim ClaimTypeReferenceId="country" Required="true" />
<OutputClaim ClaimTypeReferenceId="newUser" />
<OutputClaim ClaimTypeReferenceId="telephoneNumber" />
</OutputClaims>
<ValidationTechnicalProfiles>
<ValidationTechnicalProfile ReferenceId="ValidateTelephoneNumber" ContinueOnError="false" />
<ValidationTechnicalProfile ReferenceId="AAD-UserWriteUsingLogonEmail" />
</ValidationTechnicalProfiles>
<IncludeTechnicalProfile ReferenceId="LocalAccountBase" />
</TechnicalProfile>
<TechnicalProfile Id="ValidateTelephoneNumber">
<DisplayName>Unlink Facebook</DisplayName>
<Protocol Name="Proprietary" Handler="Web.TPEngine.Providers.ClaimsTransformationProtocolProvider, Web.TPEngine, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null" />
<InputClaims>
<InputClaim ClaimTypeReferenceId="telephoneNumberString" />
</InputClaims>
<OutputClaims>
<OutputClaim ClaimTypeReferenceId="telephoneNumber" />
</OutputClaims>
<OutputClaimsTransformations>
<OutputClaimsTransformation ReferenceId="ConvertStringToPhoneNumber" />
</OutputClaimsTransformations>
</TechnicalProfile>
<TechnicalProfile Id="LocalAccountBase">
<DisplayName>Email signup</DisplayName>
<Protocol Name="Proprietary" Handler="Web.TPEngine.Providers.SelfAssertedAttributeProvider, Web.TPEngine, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null" />
<Metadata>
<Item Key="ContentDefinitionReferenceId">api.localaccountsignup</Item>
</Metadata>
<UseTechnicalProfileForSessionManagement ReferenceId="SM-AAD" />
</TechnicalProfile>
将字符 pandas
处理为队列,并期望在单元格中的每个 "
之后出现 "
,这不会发生在这个 {{ 1}}。
要使 "
不将其视为引号,请在 csv
函数内传递参数 pandas
。
答案 1 :(得分:0)
发生这种情况是因为文档中的字段在加引号的文本内包含未转义的引号。
我不知道一种指示csv解析器在不进行预处理的情况下进行处理的方法。
如果您不在乎这些列,则可以使用
pd.read_csv("amazon_com_extras.csv", engine="python", sep=',', quotechar='"', error_bad_lines=False)
这将禁止引发Exception,但是它将删除受影响的行(您将在控制台中看到该行)。
此类行的示例:
"1405246510","book","hardcover",""Hannah Montana" Annual 2010","Unknown","Egmont Books Ltd"
注意报价。
相反,更标准的csv方言将呈现:
1405246510,"book","hardcover","""Hannah Montana"" Annual 2010","Unknown","Egmont Books Ltd"
例如,您可以使用Libreoffice加载文件,然后再次将其重新保存为CSV,以获得有效的CSV方言或使用其他预处理技术。
答案 2 :(得分:0)
这对我有效Sniffer:
import requests
import csv
with open('spotify_dataset.csv') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(14734))
df = pd.read_csv('spotify_dataset.csv', engine='python', dialect=dialect, error_bad_lines=False)