从多个7-zip文件中提取特定文件扩展名

时间:2017-01-26 18:54:09

标签: windows cygwin extract 7zip compression

我有一个RAR文件和一个ZIP文件。在这两个文件夹中有一个文件夹。在文件夹内有几个7-zip(.7z)文件。在每7z内部,有多个文件具有相同的扩展名,但其名称各不相同。

RAR or ZIP file
  |___folder
        |_____Multiple 7z
                  |_____Multiple files with same extension and different name

我想从数千个文件中提取出我需要的文件...... 我需要那些名称包含某个子字符串的文件。例如,如果压缩文件的名称在名称中包含'[!]',或'(U)''(J)'包含确定要提取的文件的标准。

我可以毫无问题地提取文件夹,所以我有这个结构:

folder
   |_____Multiple 7z
                |_____Multiple files with same extension and different name

我在Windows环境中,但我安装了Cygwin。 我想知道如何轻松提取我需要的文件?也许使用一个命令行。

请帮帮我这个。谢谢!

更新 感谢大家帮助我。还有一些规范可以改善这个问题:

  • 内部7z文件及其各自的文件名称中可以包含空格。
  • 有7z个文件,其中只有一个文件不符合给定的标准。因此,作为唯一可能的文件,它们也必须被提取出来。

感谢大家。 bash解决方案帮助了我。我无法测试Python3解决方案,因为我在尝试使用pip安装库时遇到问题。我没有使用Python,所以我必须研究并克服我在这些解决方案中遇到的错误。现在,我找到了合适的答案。谢谢大家。

4 个答案:

答案 0 :(得分:1)

使用此命令行怎么样:

7z -e c:\myDir\*.7z -oc:\outDir "*(U)*.ext" "*(J)*.ext" "*[!]*.ext" -y

其中:

  • myDir是您的解压缩文件夹
  • outDir是您的输出目录
  • ext是您的文件扩展名

-y选项用于强制覆盖,以防您在不同的存档中具有相同的文件名。

答案 1 :(得分:1)

此解决方案基于bash,grep和awk,适用于Cygwin和Ubuntu。

由于您需要首先搜索(X) [!].ext个文件,如果没有这样的文件,那么查找(X).ext个文件,我不认为可以写一些单个表达式处理这个逻辑。

解决方案应该有一些if / else条件逻辑来测试存档中的文件列表并决定要提取哪些文件。

这是zip / rar存档中的初始结构我测试了我的脚本(我做了script来准备这个结构):

folder
├── 7z_1.7z
│   ├── (E).txt
│   ├── (J) [!].txt
│   ├── (J).txt
│   ├── (U) [!].txt
│   └── (U).txt
├── 7z_2.7z
│   ├── (J) [b1].txt
│   ├── (J) [b2].txt
│   ├── (J) [o1].txt
│   └── (J).txt
├── 7z_3.7z
│   ├── (E) [!].txt
│   ├── (J).txt
│   └── (U).txt
└── 7z 4.7z
    └── test.txt

输出是这样的:

output
├── 7z_1.7z           # This is a folder, not an archive
│   ├── (J) [!].txt   # Here we extracted only files with [!]
│   └── (U) [!].txt
├── 7z_2.7z
│   └── (J).txt       # Here there are no [!] files, so we extracted (J)
├── 7z_3.7z
│   └── (E) [!].txt   # We had here both [!] and (J), extracted only file with [!]
└── 7z 4.7z
    └── test.txt      # We had only one file here, extracted it

这是进行提取的script

#!/bin/bash

# Remove the output (if it's left from previous runs).
rm -r output
mkdir -p output

# Unzip the zip archive.
unzip data.zip -d output
# For rar use
#  unrar x data.rar output
# OR
#  7z x -ooutput data.rar

for archive in output/folder/*.7z
do
  # See https://stackoverflow.com/questions/7148604
  # Get the list of file names, remove the extra output of "7z l"
  list=$(7z l "$archive" | awk '
      /----/ {p = ++p % 2; next}
      $NF == "Name" {pos = index($0,"Name")}
      p {print substr($0,pos)}
  ')
  # Get the list of files with [!].
  extract_list=$(echo "$list" | grep "[!]")
  if [[ -z $extract_list ]]; then
    # If we don't have files with [!], then look for ([A-Z]) pattern
    # to get files with single letter in brackets.
    extract_list=$(echo "$list" | grep "([A-Z])\.")
  fi
  if [[ -z $extract_list ]]; then
    # If we only have one file - extract it.
    if [[ ${#list[@]} -eq 1 ]]; then
      extract_list=$list
    fi
  fi
  if [[ ! -z $extract_list ]]; then
    # If we have files to extract, then do the extraction.
    # Output path is output/7zip_archive_name/
    out_path=output/$(basename "$archive")
    mkdir -p "$out_path"
    echo "$extract_list" | xargs -I {} 7z x -o"$out_path" "$archive" {}
  fi
done

这里的基本思路是使用7z l命令(文件列表)查看7zip存档并获取每个文件的列表。

命令的输出如果非常详细,那么我们使用awk来清理它并获取文件名列表。

之后,我们使用grep过滤此列表,以获取[!]个文件列表或(X)个文件列表。 然后我们将此列表传递给7zip以提取我们需要的文件。

答案 2 :(得分:0)

经过一些尝试,这是某种方式的最终版本。以前没用,所以我删除它,而不是追加。阅读直到最后,因为最终解决方案可能不需要一切。

关于主题。我会用Python。如果这是一次性任务,那么它可能过度,但在任何其他情况下 - 您可以记录所有步骤以供将来调查,正则表达式,编排一些命令,同时提供输入,以及获取和处理输出 - 每次。在Python中,所有这些情况都非常简单。但是如果你拥有它。

现在,我将写下如何做env。配置。并非所有都是强制性的,但尝试安装会做一些步骤,也许过程的描述本身可能是有益的。

我有MinGW - 32位版本。然而,提取7zip并不是强制性的。安装后,转到C:\MinGW\bin并运行mingw-get.exe

  • Basic Setup我安装了msys-base(右键单击,标记安装,从安装菜单 - 应用更改)。这样我就有bash,sed,grep等等。
  • All Packages中,mingw32-libarchive with dll as class. Since python libarchive`包只是一个包装器,你需要这个dll实际上有二进制包装。

示例适用于Python 3.我使用的是32位版本。你可以从他们的主页fetch。我已安装在默认目录中,这很奇怪。所以建议安装在磁盘的根目录下 - 比如mingw。

其他事情 - conemu比默认控制台要好得多。

用Python安装软件包。 pip用于此目的。从您的控制台转到Python主页,那里有Scripts子目录。对我而言:c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\Scripts。您可以使用pip search archive进行搜索,然后使用pip install libarchive-c安装:

> pip.exe install libarchive-c
Collecting libarchive-c
  Downloading libarchive_c-2.7-py2.py3-none-any.whl
Installing collected packages: libarchive-c
Successfully installed libarchive-c-2.7

cd ..调用python后,可以使用/导入新库:

>>> import libarchive
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\__init__.py", line 1, in <module>
    from .entry import ArchiveEntry
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\entry.py", line 6, in <module>
    from . import ffi
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\ffi.py", line 27, in <module>
    libarchive = ctypes.cdll.LoadLibrary(libarchive_path)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 426, in LoadLibrary
   return self._dlltype(name)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 348, in __init__
    self._handle = _dlopen(self._name, mode)
TypeError: LoadLibrary() argument 1 must be str, not None

所以它失败了。我试图解决这个问题,但失败了:

>>> import libarchive
read format "cab" is not supported
read format "7zip" is not supported
read format "rar" is not supported
read format "lha" is not supported
read filter "uu" is not supported
read filter "lzop" is not supported
read filter "grzip" is not supported
read filter "bzip2" is not supported
read filter "rpm" is not supported
read filter "xz" is not supported
read filter "none" is not supported
read filter "compress" is not supported
read filter "all" is not supported
read filter "lzma" is not supported
read filter "lzip" is not supported
read filter "lrzip" is not supported
read filter "gzip" is not supported
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\__init__.py", line 1, in <module>
    from .entry import ArchiveEntry
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\entry.py", line 6, in <module>
    from . import ffi
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\ffi.py", line 167, in <module>
    c_int, check_int)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\ffi.py", line 92, in ffi
    f = getattr(libarchive, 'archive_'+name)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 361, in __getattr__
    func = self.__getitem__(name)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 366, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: function 'archive_read_open_filename_w' not found

尝试使用set命令直接提供信息,但失败了......所以我搬到pylzma - 因为不需要mingw。 pip安装失败:

> pip.exe install pylzma
Collecting pylzma
  Downloading pylzma-0.4.9.tar.gz (115kB)
    100% |--------------------------------| 122kB 1.3MB/s
Installing collected packages: pylzma
  Running setup.py install for pylzma ... error
    Complete output from command c:\users\texxas\appdata\local\programs\python\python36-32\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\texxas\\AppData\\Local\\Temp\\pip-build-99t_zgmz\\pylzma\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record C:\Users\texxas\AppData\Local\Temp\pip-ffe3nbwk-record\install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build\lib.win32-3.6
    copying py7zlib.py -> build\lib.win32-3.6
    running build_ext
    adding support for multithreaded compression
    building 'pylzma' extension
    error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

再次失败。但这很简单 - 我已经安装了2015年的visual studio构建工具,这很有效。我安装了sevenzip,因此我创建了示例存档。所以最后我可以启动python并执行:

from py7zlib import Archive7z
f = open(r"C:\Users\texxas\Desktop\try.7z", 'rb')
a = Archive7z(f)
a.filenames

得到空名单。仔细观察......可以更好地理解 - pylzma不考虑空文件 - 只是为了让您意识到这一点。因此,将一个字符放入我的示例文件中,最后一行给出:

>>> a.filenames
['try/a/test.txt', 'try/a/test1.txt', 'try/a/test2.txt', 'try/a/test3.txt', 'try/a/test4.txt', 'try/a/test5.txt', 'try/a/test6.txt', 'try/a/test7.txt', 'try/b/test.txt', 'try/b/test1.txt', 'try/b/test2.txt', 'try/b/test3.txt', 'try/b/test4.txt', 'try/b/test5.txt', 'try/b/test6.txt', 'try/b/test7.txt', 'try/c/test.txt', 'try/c/test1.txt', 'try/c/test11.txt', 'try/c/test2.txt', 'try/c/test3.txt', 'try/c/test4.txt', 'try/c/test5.txt', 'try/c/test6.txt', 'try/c/test7.txt']

所以......休息是小菜一碟。实际上,这是原始帖子的一部分:

import os
import py7zlib

for folder, subfolders, files in os.walk('.'):
    for file in files:
        if file.endswith('.7z'):
            # sooo 7z archive - extract needed.
            try:
                with open(file, 'rb') as f:
                    z = py7zlib.Archive7z(f)
                    for file in z.list():
                        if arch.getinfo(file).filename.endswith('*.py'):
                            arch.extract(file, './dest')
            except py7zlib.FormatError as e:
                print ('file ' + file)
                print (str(e))  

作为旁注 - Anaconda是一个很棒的工具,但是完全安装需要500 + MB,所以这太过分了。

另外,让我分享wmctrl.py工具,来自我的github:

cmd = 'wmctrl -ir ' + str(active.window) + \
      ' -e 0,' + str(stored.left) + ',' + str(stored.top) + ',' + str(stored.width) + ',' + str(stored.height)
print cmd
res = getoutput(cmd)

这样你可以编排不同的命令 - 这里是wmctrl。可以以允许数据处理的方式处理结果。

答案 3 :(得分:0)

你声明可以使用linux,在问题的bounty页脚中。而且我也不使用Windows。对于那个很抱歉。我正在使用Python3,你必须在linux环境中(我会尽快在Windows上测试这个)。

档案结构

datadir.rar
          |
          datadir/
                 |
                 zip1.7z
                 zip2.7z
                 zip3.7z
                 zip4.7z
                 zip5.7z

提取结构

extracted/
├── zip1
│   ├── (E) [!].txt
│   ├── (J) [!].txt
│   └── (U) [!].txt
├── zip2
│   ├── (E) [!].txt
│   ├── (J) [!].txt
│   └── (U) [!].txt
├── zip3
│   ├── (J) [!].txt
│   └── (U) [!].txt
└── zip5
    ├── (J).txt
    └── (U).txt

我是这样做的。

import libarchive.public
import os, os.path
from os.path import basename
import errno
import rarfile

#========== FILE UTILS =================

#Make directories
def mkdir_p(path):
    try:
        os.makedirs(path)
    except OSError as exc: # Python >2.5
        if exc.errno == errno.EEXIST and os.path.isdir(path):
            pass
        else: raise

#Open "path" for writing, creating any parent directories as needed.
def safe_open_w(path):
    mkdir_p(os.path.dirname(path))
    return open(path, 'wb')

#========== RAR TOOLS ==================

# List
def rar_list(rar_archive):
    with rarfile.RarFile(rar_archive) as rf:
        return rf.namelist()

# extract
def rar_extract(rar_archive, filename, path):
    with rarfile.RarFile(rar_archive) as rf:
        rf.extract(filename,path)

# extract-all
def rar_extract_all(rar_archive, path):
    with rarfile.RarFile(rar_archive) as rf:
        rf.extractall(path)

#========= 7ZIP TOOLS ==================

# List
def zip7_list(zip7file):
    filelist = []
    with open(zip7file, 'rb') as f:
        for entry in libarchive.public.memory_pour(f.read()):
            filelist.append(entry.pathname.decode("utf-8"))
    return filelist

# extract
def zip7_extract(zip7file, filename, path):
    with open(zip7file, 'rb') as f:
        for entry in libarchive.public.memory_pour(f.read()):
            if entry.pathname.decode("utf-8") == filename:
                with safe_open_w(os.path.join(path, filename)) as q:
                    for block in entry.get_blocks():
                        q.write(block)
                break

# extract-all
def zip7_extract_all(zip7file, path):
    with open(zip7file, 'rb') as f:
        for entry in libarchive.public.memory_pour(f.read()):
            if os.path.isdir(entry.pathname.decode("utf-8")):
                continue
            with safe_open_w(os.path.join(path, entry.pathname.decode("utf-8"))) as q:
                for block in entry.get_blocks():
                    q.write(block)

#============ FILE FILTER =================

def exclamation_filter(filename):
    return ("[!]" in filename)

def optional_code_filter(filename):
    return not ("[" in filename)

def has_exclamation_files(filelist):
    for singlefile in filelist:
        if(exclamation_filter(singlefile)):
            return True
    return False

#============ MAIN PROGRAM ================

print("-------------------------")
print("Program Started")
print("-------------------------")

BIG_RAR = 'datadir.rar'
TEMP_DIR = 'temp'
EXTRACT_DIR = 'extracted'
newzip7filelist = []

#Extract big rar and get new file list
for zipfilepath in rar_list(BIG_RAR):
    rar_extract(BIG_RAR, zipfilepath, TEMP_DIR)
    newzip7filelist.append(os.path.join(TEMP_DIR, zipfilepath))

print("7z Files Extracted")
print("-------------------------")

for newzip7file in newzip7filelist:
    innerFiles = zip7_list(newzip7file)
    for singleFile in innerFiles:
        fileSelected = False
        if(has_exclamation_files(innerFiles)):
            if exclamation_filter(singleFile): fileSelected = True
        else:
            if optional_code_filter(singleFile): fileSelected = True
        if(fileSelected):
            print(singleFile)
            outputFile = os.path.join(EXTRACT_DIR, os.path.splitext(basename(newzip7file))[0])
            zip7_extract(newzip7file, singleFile, outputFile)

print("-------------------------")
print("Extraction Complete")
print("-------------------------")

在主程序之上,我已准备好所有必需的功能。我没有全部使用它们,但是我保留它们以防你需要它们。

我在python3使用了几个python库,但您只需使用pip安装libarchiverarfile,其他人都是内置库。

这是一个copy of my source tree

控制台输出

这是运行此python文件时的控制台输出,

-------------------------
Program Started
-------------------------
7z Files Extracted
-------------------------
(J) [!].txt
(U) [!].txt
(E) [!].txt
(J) [!].txt
(U) [!].txt
(E) [!].txt
(J) [!].txt
(U) [!].txt
(J).txt
(U).txt
-------------------------
Extraction Complete
-------------------------

问题

到目前为止我遇到的唯一问题是,程序根目录中有一些临时文件。它无论如何都不会影响程序,但我会尝试解决这个问题。

修改

你必须运行

sudo apt-get install libarchive-dev

安装实际的libarchive程序。 Python库只是一个包装器。看看official documentation