Question

我有一个大的文件A （包含电子邮件），每封邮件一行。我还有另一个包含另一组邮件的文件B 。

我将使用哪个命令从文件A中删除文件B中出现的所有地址。

因此，如果文件A包含：

A
B
C

和文件B包含：

B    
D
E

然后文件A应该留下：

A
C

现在我知道这是一个可能经常被问到的问题，但我发现one command online给了我错误的分隔符。

任何帮助将不胜感激！有人肯定会想出一个聪明的单行，但我不是外壳专家。

Answer 1

comm -23 file1 file2

-23会抑制两个文件中的行，或仅抑制文件2中的行。文件必须进行排序（它们在您的示例中），但如果不是，则首先通过sort管道... < / p>

请参阅man page here

Answer 2

<强> grep -Fvxf <lines-to-remove> <all-lines>

适用于未排序的文件
维护订单
is POSIX

示例：

cat <<EOF > A
b
1
a
0
01
b
1
EOF

cat <<EOF > B
0
1
EOF

grep -Fvxf B A

输出：

b
a
01
b

说明：

-F：使用文字字符串代替默认的BRE
-x：仅考虑与整行匹配的匹配
-v：打印不匹配的
-f file：从给定文件中获取模式

此方法在预排序文件上比其他方法慢，因为它更通用。如果速度也很重要，请参阅：Fast way of finding lines in one file that are not in another?

另请参阅：https://unix.stackexchange.com/questions/28158/is-there-a-tool-to-get-the-lines-in-one-file-that-are-not-in-another

Answer 3

要求救援！

此解决方案不需要排序输入。你必须先提供fileB。

awk 'NR==FNR{a[$0];next} !($0 in a)' fileB fileA

返回

A
C

它是如何运作的？

NR==FNR{a[$0];next} idiom用于将第一个文件存储在关联数组中，作为以后“包含”测试的键。

NR==FNR正在检查我们是否正在扫描第一个文件，其中全局行计数器（NR）等于当前文件行计数器（FNR）。

a[$0]将当前行添加到关联数组作为键，请注意，它的行为类似于一个集合，其中不会有任何重复值（键）

!($0 in a)我们现在在下一个文件中，in是一个包含测试，这里检查当前行是否在我们在第一个文件的第一步中填充的集合中，!否定了这个条件。这里缺少的是操作，默认情况下为{print}，通常不会明确写入。

请注意，现在可以使用此功能删除列入黑名单的字词。

$ awk '...' badwords allwords > goodwords

稍作更改就可以清理多个列表并创建已清理的版本。

$ awk 'NR==FNR{a[$0];next} !($0 in a){print > FILENAME".clean"}' bad file1 file2 file3 ...

Answer 4

另一种做同样事情的方法（也需要排序输入）：

join -v 1 fileA fileB

在Bash中，如果文件未预先排序：

join -v 1 <(sort fileA) <(sort fileB)

Answer 5

除非文件已排序，否则您可以执行此操作

diff file-a file-b --new-line-format="" --old-line-format="%L" --unchanged-line-format="" > file-a

--new-line-format适用于文件b中但不在文件中的行 --old-..适用于文件a但不在b中的行 --unchanged-..适用于两者中的行。 %L使得线条完全打印出来。

man diff

了解更多详情

Answer 6

对于非常大的文件，@ karakfa的优秀答案可能会明显加快。与该答案一样，这两个文件都不需要排序，但是凭借awk的关联数组可以确保速度。只有查找文件保存在内存中。

此公式还允许在比较中仅使用输入文件中的一个特定字段（$ N）。

# Print lines in the input unless the value in column $N
# appears in a lookup file, $LOOKUP;
# if $N is 0, then the entire line is used for comparison.

awk -v N=$N -v lookup="$LOOKUP" '
  BEGIN { while ( getline < lookup ) { dictionary[$0]=$0 } }
  !($N in dictionary) {print}'

（这种方法的另一个优点是可以很容易地修改比较标准，例如修剪前导和尾随空格。）

Answer 7

您可以使用Python：

private Bitmap findRoiBlack(Bitmap sourceBitmap) {
    Bitmap roiBitmap = null;
    Scalar green = new Scalar(0, 255, 0, 255);
    Mat sourceMat = new Mat(sourceBitmap.getWidth(), sourceBitmap.getHeight(), CvType.CV_8UC3);
    Utils.bitmapToMat(sourceBitmap, sourceMat);
    Mat roiTmp = sourceMat.clone();

    final Mat hsvMat = new Mat();
    sourceMat.copyTo(hsvMat);

    // convert mat to HSV format for Core.inRange()
    Imgproc.cvtColor(hsvMat, hsvMat, Imgproc.COLOR_RGB2HSV);

    Scalar lowerb = new Scalar(0, 0, 0);         // lower color border for BLACK
    Scalar upperb = new Scalar(180, 255, 30);      // upper color border for BLACK

    //Scalar lowerb = new Scalar(0, 0, 200);         // lower color border for WHITE
    //Scalar upperb = new Scalar(180, 255, 255);      // upper color border for WHITE
    Core.inRange(hsvMat, lowerb, upperb, roiTmp);   // select only blue pixels

    // find contours
    List<MatOfPoint> contours = new ArrayList<>();
    List<RotatedRect> boundingRects = new ArrayList<>();
    Imgproc.findContours(roiTmp, contours, new Mat(), Imgproc.RETR_LIST, Imgproc.CHAIN_APPROX_SIMPLE);

    // find appropriate bounding rectangles
    for (MatOfPoint contour : contours) {
        MatOfPoint2f areaPoints = new MatOfPoint2f(contour.toArray());
        RotatedRect boundingRect = Imgproc.minAreaRect(areaPoints);

        double rectangleArea = boundingRect.size.area();

        // test min ROI area in pixels
        if (rectangleArea > 400) {
            Point rotated_rect_points[] = new Point[4];
            boundingRect.points(rotated_rect_points);

            Rect rect = Imgproc.boundingRect(new MatOfPoint(rotated_rect_points));

            // test vertical ROI orientation
            if (rect.height > rect.width) {
                Imgproc.rectangle(sourceMat, rect.tl(), rect.br(), green, 3);
            }
        }
    }

    roiBitmap = Bitmap.createBitmap(sourceMat.cols(), sourceMat.rows(), Bitmap.Config.ARGB_8888);
    Utils.matToBitmap(sourceMat, roiBitmap);
    return roiBitmap;
}

Answer 8

你可以使用 - int &

这适用于未排序的文件。

Answer 9

只是为了添加到上面用户的 Python 答案中，这里有一个更快的解决方案：

    python -c '
lines_to_remove = None
with open("partial file") as f:
    lines_to_remove = {line.rstrip() for line in f.readlines()}

remaining_lines = None
with open("full file") as f:
    remaining_lines = {line.rstrip() for line in f.readlines()} - lines_to_remove

with open("output file", "w") as f:
    for line in remaining_lines:
        f.write(line + "\n")
    '

提高集合减法的幂。

Answer 10

要删除两个文件之间的公共行，可以使用grep，comm或join命令。

grep仅适用于小文件。与-f一起使用-v。

grep -vf file2 file1

这将显示文件1中与文件2中的任何行都不匹配的行。

comm是一个实用程序命令，适用于按词法排序的文件。它将两个文件作为输入并产生三个文本列作为输出：仅在第一个文件中的行；仅在第二个文件中的行；和线在两个文件中。您可以使用-1，-2禁止打印任何列或-3选项。

comm -1 -3 file2 file1

这将显示文件1中与文件2中的任何行都不匹配的行。

最后，有一个join，一个执行相等的实用程序命令加入指定的文件。其-v选项还允许删除两个文件之间的共同点。

join -v1 -v2 file1 file2

Answer 11

删除出现在另一个文件中的行后获取文件

comm -23 <(sort bigFile.txt) <(sort smallfile.txt) > diff.txt

如何从另一个文件A中删除文件B上出现的行？

11 个答案: