Question

我很抱歉重申这个问题，但仍有待解决。

这不是一个非常复杂的问题，我确信它是相当简单的，但我根本看不出这个问题。

我通过XML文件解析的代码是打开的，并以我想要的格式读取 - 最终for循环中的 print 语句证明了这一点。

作为一个例子，它输出：

透视支撑手柄D0584129 20090106 US

Hinge D0584130 20090106 US

Deadbolt变速器D0584131 20090106 US

这正是我希望将数据写入CSV文件的方式。但是，当我尝试将这些行实际写入CSV本身时，它只打印XML文件中的最后一行，并以这种方式：

手电筒包装，D0584138,20090106，美国

这是我的整个代码，因为它可能有助于理解整个过程，感兴趣的区域是sepact_xml 中的 for xml_string：

from bs4 import BeautifulSoup import csv import unicodecsv as csv infile = "C:\\Users\\Grisha\\Documents\\Inventor\\2009_Data\\Jan\\ipg090106.xml" # The first line of code defines a function "separated_xml" that will allow us to separate, read, and then finally parse the data of interest with def separated_xml(infile): # Defining the data reading function for each xml section - This breaks apart the xml from the start (root element <?xml...) to the next iteration of the root element file = open(infile, "r") # Used to open the xml file buffer = [file.readline()] # Used to read each line and placing inside vector # The first for-loop is used to slice every section of the USPTO XML file to be read and parsed individually # It is necessary because Python wishes to read only one instance of a root element but this element is found many times in each file which causes reading errors for line in file: # Running for-loop for the opened file and searches for root elements if line.startswith("<?xml "): yield "".join(buffer) # 1) Using "yield" allows to generate one instance per run of a root element and 2) .join takes the list (vector) "buffer" and connects an empty string to it buffer = [] # Creates a blank list to store the beginning of a new 'set' of data in beginning with the root element buffer.append(line) # Passes lines into list yield "".join(buffer) # Outputs file.close() # The second nested set of for-loops are used to parse the newly reformatted data into a new list for xml_string in separated_xml(infile): # Calls the output of the separated and read file to parse the data soup = BeautifulSoup(xml_string, "lxml") # BeautifulSoup parses the data strings where the XML is converted to Unicode pub_ref = soup.findAll("publication-reference") # Beginning parsing at every instance of a publication lst = [] # Creating empty list to append into with open('./output.csv', 'wb') as f: writer = csv.writer(f, dialect = 'excel') for info in pub_ref: # Looping over all instances of publication # The final loop finds every instance of invention name, patent number, date, and country to print and append into for inv_name, pat_num, date_num, country in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("date"), soup.findAll("country")): print(inv_name.text, pat_num.text, date_num.text, country.text) lst.append((inv_name.text, pat_num.text, date_num.text, country.text)) writer.writerow([inv_name.text, pat_num.text, date_num.text, country.text])

我也尝试将开放和编写器放在for循环之外，以检查问题出现的地方，但无济于事。我知道文件一次只写一行并反复覆盖同一行（这就是为什么CSV文件中只剩下1行），我只是看不到它。

非常感谢您的帮助。

Answer 1

我相信（无论如何，第一个工作原理）你的问题的基础是你的with open语句属于你的for循环，并使用“wb”模式覆盖文件（如果它已经存在）。这意味着每次你的for循环运行它都会覆盖之前的所有内容，并且一旦完成就只留下一行输出。

有两种方法我可以看到你处理这个问题。更正确的方法是将文件open语句移到最外层for循环之外。我知道你提到你已经尝试过了，但魔鬼在细节中。这会使您的更新代码看起来像这样：

    with open('./output.csv', 'wb') as f:
      writer = csv.writer(f, dialect='excel')

      for xml_string in separated_xml(infile):
        soup = BeautifulSoup(xml_string, "lxml")
        pub_ref = soup.findAll("publication-reference")
        lst = []

        for info in pub_ref:

          for inv_name, pat_num, date_num, country in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("date"), soup.findAll("country")):
            print(inv_name.text, pat_num.text, date_num.text, country.text)
            lst.append((inv_name.text, pat_num.text, date_num.text, country.text))
            writer.writerow([inv_name.text, pat_num.text, date_num.text, country.text])

hacky，但更快更简单的方法是简单地将打开调用中的模式更改为“ab”（追加，二进制）而不是“wb”（写入二进制文件，它会覆盖任何现有数据）。由于你每次都通过for循环重新打开文件，效率要低得多，但它可能会有效。

我希望这有帮助！

Answer 2

with open('./output.csv', 'wb') as f:

只需要改变'wb' - ＆gt; 'ab'不会覆盖。

第一次没有工作，但在最后2个循环修复此项之前移动了打开功能。感谢那些帮助过的人。

Python只为CSV文件写入1行

2 个答案: