Question

我正在抓网，由于内容抓取，我希望拥有一个Pandas数据框。我可以将UTF-8字符串作为Pandas数据框读取，但是我不确定该怎么做，并且希望避免输出到CSV并读回。我该怎么办？

例如

string='term_ID,description,frequency,plot_X,plot_Y,plot_size,uniqueness,dispensability,representative,eliminated\r\nGO:0006468,"protein phosphorylation",4.137%, 4.696, 0.927,5.725,0.430,0.000,6468,0\r\nGO:0050821,"protein stabilization, positive",0.045%,-4.700, 0.494,3.763,0.413,0.000,50821,0\r\n'

我正在用

分割字符串

fcsv_content=[x.split(',') for x in string.split("\r\n")]

但是，由于某些字段内部带有逗号，因此无法使用。我能做什么？我可以更改解码以便固定吗？在某些背景下，我使用robobrowser对网页进行解码。

Answer 1

您可以使用python csv模块读取和吐出您的csv。它将处理诸如逗号括在引号中的字符串之类的事情，并且知道不要拆分它们。下面是使用输入字符串的一个小示例。正如您将在下面的示例中看到的那样，字段protein stabilization, positive不会被分隔为单独的列，因为它是用引号引起来的字符串

import csv

string = 'term_ID,description,frequency,plot_X,plot_Y,plot_size,uniqueness,dispensability,representative,eliminated\r\nGO:0006468,"protein phosphorylation",4.137%, 4.696, 0.927,5.725,0.430,0.000,6468,0\r\nGO:0050821,"protein stabilization, positive",0.045%,-4.700, 0.494,3.763,0.413,0.000,50821,0\r\n'
csv_reader = csv.reader(string.splitlines())
for record in csv_reader:
    print(f'number of fields: {len(record)}, Record: {record}'

输出

number of fields: 10, Record: ['term_ID', 'description', 'frequency', 'plot_X', 'plot_Y', 'plot_size', 'uniqueness', 'dispensability', 'representative', 'eliminated']
number of fields: 10, Record: ['GO:0006468', 'protein phosphorylation', '4.137%', ' 4.696', ' 0.927', '5.725', '0.430', '0.000', '6468', '0']
number of fields: 10, Record: ['GO:0050821', 'protein stabilization, positive', '0.045%', '-4.700', ' 0.494', '3.763', '0.413', '0.000', '50821', '0']

Web将字符串字符串抓取并解码为DF

1 个答案: