熊猫随机插入不存在的分隔符

时间:2015-09-13 18:06:31

标签: python pandas delimiter

我真的在这个问题上摸不着头脑,但这对我来说毫无意义。我使用pandas是一种非常简单的方式,可以阅读tsv。这是最小的代码:

source = pd.read_csv("neimanmarcus.csv", sep="\t")
images = source["image_link"]

此文件中的所有行都有53个制表符号。出于某种原因,熊猫相信大约2%的人拥有正好72个标签符号。这会导致以下错误:

  

pandas.parser.CParserError:标记数据时出错。 C错误:第x行预计54个字段,见73

那就是说,经过人工检查,我发现受影响的行没有任何区别。在这种情况下跳过行是非常有问题的,所以我试图解决这个问题,但我在我的智慧结束时。对不起,如果这是愚蠢的事,但这里有一些例子"正确"和"不正确"行。

正确:

sku157001669    Tango Dancer-Print A-Line Dress, Size: 4, TANGO - Carolina Herrera  Carolina Herrera Tango Dancer-Print A-Line Dress Details Carolina Herrera tango dancer-print woven dress. Approx. measurements: 35.5"L center back to hem, 35.5"L center front to hem. V'd jewel neckline. Cap sleeves. Self-tie belt at natural waist; ties at left. Inverted center pleat at A-line skirt. Straight hem. Fit and flare silhouette. Hidden back zip. Cotton/spandex; dry clean. Made in Italy. Model's measurements: Height 5'10"/177cm, bust 34"/86cm, waist 26"/66cm, hips 35.5"/90cm, dress size US 2. Designer About Carolina Herrera: The empress of classically refined looks for both day and evening, Carolina Herrera launched her eponymous line in 1980 after encouragement from her friend, legendary Vogue editor Diana Vreeland. Over the years she has collected a number of fashion's highest accolades as well as a star-studded client list. With both a global focus and adoration for the sum of all things beautiful, Carolina Herrera has been hailed as "Fashion's First Lady." Size: 4. Color: TANGO. Age Group: Adult. Material: 97% COTTON, 3% ELASTANE. Apparel & Accessories > Clothing > Dresses  Women's Apparel > Mid-Length > Daytime Dresses > Mid    1390.00 USD 1390.00 USD     http://www.neimanmarcus.com/en-us/Carolina-Herrera-Tango-Dancer-Print-A-Line-Dress/prod177890243/p.prod     http://images.neimanmarcus.com/product_assets/B/2/W/Y/K/NMB2WYK_mz.jpg  http://images.neimanmarcus.com/product_assets/B/2/W/Y/K/NMB2WYK_az.jpg  Carolina Herrera    07667702164817  prod177890243       new in stock        prod177890243   TANGO   97% COTTON, 3% ELASTANE     4           female  Adult       US::Ground:0.00 USD                                                                                             

不正确:

sku158601482    Sleeveless Faux-Wrap Jersey Dress, Women's, Size: 2X, BLACK - Eileen Fisher Eileen Fisher Sleeveless Faux-Wrap Jersey Dress, Women's Details Eileen Fisher jersey dress in your choice of color. Round neckline; sleeveless. Faux-wrap style. Shift silhouette. Viscose/spandex; machine wash. Made in USA of imported materials. Model's measurements: Height 5'10.5"/179cm, bust 32"/81cm, waist 24"/61cm, hips 35.5"/90cm, dress size US 2/4. Necklace not included. Designer Please note: Apparel may be available in more sizes: Shop Eileen Fisher Petite Shop Eileen Fisher Women's About Eileen Fisher: Former interior and graphic designer Eileen Fisher launched her self-named collection in 1984. The acclaimed designer made her mark with clean lines, simple shapes, and a timeless, functional style. Size: 2X. Color: BLACK. Age Group: Adult. Material: " 92% Viscose/8% Spandex F4VF-D3502 / D2502X: Body: 92% Viscose, 8% Spandex Hem: 80% Recycled Polyester, 20% Lycra? F4VF-S1496: Body: 92% Viscose, 8% Spandex Hem Panel: 80% Recycled Polyester, 20% Lycra?. Apparel & Accessories > Clothing > Dresses  Women's Apparel > Women's > Special Sizes > Mid 198.00 USD  198.00 USD      http://www.neimanmarcus.com/en-us/Eileen-Fisher-Sleeveless-Faux-Wrap-Jersey-Dress-Women-s/prod179830418/p.prod      http://images.neimanmarcus.com/product_assets/T/A/6/X/8/NMTA6X8_mz.jpg  http://images.neimanmarcus.com/product_assets/T/A/6/X/8/NMTA6X8_az.jpg  Eileen Fisher   00713259663697  prod179830418       new in stock        prod179830418   BLACK   " 92% Viscose/8% Spandex F4VF-D3502 / D2502X: Body: 92% Viscose, 8% Spandex Hem: 80% Recycled Polyester, 20% Lycra? F4VF-S1496: Body: 92% Viscose, 8% Spandex Hem Panel: 80% Recycled Polyester, 20 Graphic 2X          female  Adult       US::Ground:0.00 USD                                 

在这种情况下,简单地调用line.split('\t')按预期工作,大熊猫似乎因某种原因而中断。

1 个答案:

答案 0 :(得分:2)

您的数据包含不匹配的引号字符(似乎使用"来表示Height 5'10.5"之类的字母。这使得解析器认为存在引用字段,但导致数据损坏,因为引号未配对。

尝试将quoting=csv.QUOTE_NONE作为read_csv的附加参数传递。 (您需要先import csv。或者您可以通过quoting=3。)