类型错误:从 HDFS 读取时,“int”对象不可下标

时间:2021-01-04 20:32:52

标签: python hadoop hdfs

我正在从 HDFS 读取文件,但不断收到此错误: TypeError: 'int' object is not subscriptable

.csv 文件:

CLAIM_NUM,BEN_ST,AGE,MEDICAL_ONLY_IND,TTL_MED_LOSS,TTL_IND_LOSS,TTL_MED_EXP,TTL_IND_EXP,BP_CD,NI_CD,legalrep,depression,cardiac,diabetes,hypertension,obesity,smoker,subabuse,arthritis,asthma,CPT_codes,D,P,NDC_codes
123456789,IL,99,1,2201.26,0,97.16,0,31,4,1,0,0,0,0,0,0,0,0,0,NA,8409~71941,NA,NA
987654321,AL,98,1,568.12,0,20.82,0,42,52,1,0,0,0,0,0,0,0,0,0,NA,7242~8472~E9273,NA,NA

我的代码:

with hdfs.open("/user/ras.csv") as f: 
    reader = f.read()
    
    for i, row in enumerate(reader, start=1):
        root = ET.Element('cbcalc')
        icdNode = ET.SubElement(root, "icdcodes")
        
        for code in row['D'].split('~'):
            ET.SubElement(icdNode, "code").text = code
        ET.SubElement(root, "clientid").text = row['CLAIM_NUM']
        ET.SubElement(root, "state").text = row['BEN_ST']
        ET.SubElement(root, "country").text = "US"  
        ET.SubElement(root, "age").text = row['AGE']
        ET.SubElement(root, "jobclass").text = "1" 
        ET.SubElement(root, "fulloutput").text ="Y"
        
        cfNode = ET.SubElement(root, "cfactors")
        for k in ['legalrep', 'depression', 'diabetes',
                 'hypertension', 'obesity', 'smoker', 'subabuse']:
            ET.SubElement(cfNode, k.lower()).text = str(row[k])
        
        psNode = ET.SubElement(root, "prosummary")
        
        psicdNode = ET.SubElement(psNode, "icd")
        for code in row['P'].split('~'):
            ET.SubElement(psNode, "code").text = code
            
        psndcNode = ET.SubElement(psNode, "ndc")
        for code in row['NDC_codes'].split('~'):
            ET.SubElement(psNode, "code").text = code 

        cptNode = ET.SubElement(psNode, "cpt")
        for code in row['CPT_codes'].split('~'):
            ET.SubElement(cptNode, "code").text = code

        ET.SubElement(psNode, "hcpcs")
        
        doc = ET.tostring(root, method='xml', encoding="UTF-8")
        
        response = requests.post(target_url, data=doc, headers=login_details)
        response_data = json.loads(response.text)
        if type(response_data)==dict and 'error' in response_data.keys():
            error_results.append(response_data)
        else:
            api_results.append(response_data)

我需要更改什么才能循环遍历 csv 文件并将数据转换为 xml 格式以进行 API 调用?

我已经在 python 中测试了这段代码,它似乎可以工作,但是一旦我将文件放入 HDFS,它就开始崩溃。

1 个答案:

答案 0 :(得分:0)

问题是(可能;我没有安装这个库)f.read() 正在返回一个 bytes 对象。如果您对其进行迭代(例如使用 enumerate),您将检查 int(文件的每个字符一个,取决于上下文),而不是任何类型的结构化“行”对象。< /p>

在开始要编写的循环之前,需要进行额外的处理。

这样的事情可能做你想做的事:

import pydoop.hdfs as hdfs
from io import TextIOWrapper
from csv import DictReader

with hdfs.open("/user/ras.csv") as h,
     TextIOWrapper(h, *unknown_settings) as w,
     DictReader(w, *defaults_are_probably_ok) as dict_reader:
    for row in dict_reader:
        ...