Question

出于GDPR目的，我正在尝试识别目录中的所有电子表格，其中包含数据中的特定字符串。

以下代码适用于小文件，但对于任何较大的电子表格（1000多行）而言实用时间太长。

值得一提的是，我不知道这些字符串会出现在哪些列中，因此无法使用单元格位置来提高效率。

如果有更好的方法可以执行以下操作，请分享一下吗？

    def list_files(dir):
    r = []
    for root, dirs, files in os.walk(dir):
    for name in files:
       r.append(os.path.join(root, name))
    return r


    all_files = list_files("filepath")


    filenames = []

    for f in all_files:
    if not f.endswith((".xls", ".xlsx")): continue
    wb = load_workbook(f)

    for sheet in wb.worksheets:
    for a in range(1, sheet.max_row+1):
    for b in range(sheet.max_column):
    if sheet[a][b].value:
    if str(sheet[a][b].value).upper() in("STRING_1","STRING_2","STRING_3"):
    filenames.append(f)


    set(filenames)

Answer 1

如果有更好的方法可以执行以下操作，请分享一下吗？

如果你有Linux，那就有。如果你不想在Windows中只需要像Cygwin或Baboon这样的终端模拟器。

我创建了两个文件：example.xlsx（更新的xls格式）和example.csv，并将数据添加到每个文件中。

在这些文件中我添加了一些字符串：dodo，lolo，string1，string2等。

mortiz@alberta:~/Documents/test$ ls -ltr
total 20
-rw-r--r-- 1 mortiz mortiz 4822 Apr 20 16:34 example.xlsx
-rw-r--r-- 1 mortiz mortiz   68 Apr 20 16:34 example.csv

在几乎所有的Linux发行版中都有一个实用工具叫做＃grep＆＃34;它让你搜索＆＃34;字符串＆＃34;里面几乎任何东西。

有两种简单的方法：

解压缩xlsx文件并使用grep

当您解压缩xlsx时，您将找到一个名为＆＃34; lx＆＃34;的目录。在其中包含一个名为＆＃34; sharedStrings.xml＆＃34;

的数据的xml文件

mortiz@alberta:~/Documents/test$ unzip example.xlsx 
Archive:  example.xlsx
  inflating: _rels/.rels             
  inflating: docProps/app.xml        
  inflating: docProps/core.xml       
  inflating: xl/_rels/workbook.xml.rels  
  inflating: xl/workbook.xml         
  inflating: xl/styles.xml           
  inflating: xl/worksheets/sheet1.xml  
  **inflating: xl/sharedStrings.xml**    
  inflating: [Content_Types].xml

到目前为止我们只是解压缩了文件，我们将在下面看到grep的输出。

将xlsx转换为xls并使用grep

xls格式更容易阅读，因此使用此命令我们将在xls中生成xlsx的版本，然后使用字符串来读取其内容。

   mortiz@alberta:~/Documents/test$ libreoffice --headless --convert-to xls example.xlsx
    convert /home/mortiz/Documents/test/example.xlsx -> /home/mortiz/Documents/test/example.xls using filter : MS Excel 97

现在使用grep查找具有特定字符串的文件

使用此实用程序查找字符串将是我所知道的更简单，最快捷的方法。

因为您需要知道哪些文件具有这些字符串，所以必须使用grep，-R（递归）和i（不敏感，对于驼峰情况并不重要）。

 mortiz@alberta:~/Documents/test$ grep -Ri lolo *
    example.csv:no data,lolo,foo
    Binary file example.xls matches
    xl/sharedStrings.xml:<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="12" uniqueCount="12"><si><t xml:space="preserve">string1</t></si><si><t xml:space="preserve">string</t></si><si><t xml:space="preserve">string2</t></si><si><t xml:space="preserve">no data</t></si><si><t xml:space="preserve">lolo</t></si><si><t xml:space="preserve">foo</t></si><si><t xml:space="preserve">test</t></si><si><t xml:space="preserve">mia</t></si><si><t xml:space="preserve">ami</t></si><si><t xml:space="preserve">nono</t></si><si><t xml:space="preserve">toto</t></si><si><t xml:space="preserve">dodo</t></si></sst>

要仅打印文件名，请添加参数＆＃34; l＆＃34;，这将使用字符串＆＃34; lolo＆＃34;打印每个文件。内：

mortiz@alberta:~/Documents/test$ grep -Ril lolo * 
example.csv
example.xls
xl/sharedStrings.xml

Grep适用于xml，csv，xls或纯文本文件，这就是我们需要解压缩或转换xlsx的原因：）

Answer 2

感谢Miguel的帮助，我想尝试一下你的解决方案，但我对这种脚本很新，而且我的工作机器上没有这些工具。

我看到了这个Python代码，它适用于我的xlsx文件。（我正在搜索的字段名称通常粘贴在电子表格的第一行中;此代码在找到匹配项时返回）。

该过程仍有改进的余地。将文件转换为csv（或xls？）并解析为单个字符串会加快速度。

import xlrd
import glob



def search_excel(path, phrases):
    try:
        wb = xlrd.open_workbook(path)
    except xlrd.biffh.XLRDError:
    # Assumed False if document fails to open (e.g. invalid format/corrupt)
        return False
    for sheet in wb.sheets():
        for row in sheet.get_rows():
            for cell in filter(lambda x: x.ctype==1, row):
                # Only operate on string typed cells
                try:
                    if any(phrase in cell.value for phrase in phrases):
                        # Return as soon as possible if criteria fulfilled
                        return path 
                except TypeError:
                    pass


if __name__ == "__main__":
    # Recursively searches all nested directories/files for files
    # Change the path here.
    file_paths = glob.glob("...path...", recursive=True)
    search_phrases = ['string_1','STRING_1','string_2','STRING_2']
    list_comp = [search_excel(path, search_phrases) for path in file_paths]

识别xlsx＆amp; amp;的有效方法目录中的csv文件，其中包含数据中的特定字符串

2 个答案: