Page 1 of 2 12 LastLast
Results 1 to 10 of 13

Thread: Find duplicated filenames

  1. #1
    Join Date
    Aug 2008
    Location
    Mexico and Sweden
    Posts
    1,356

    Default Find duplicated filenames

    openSUSE 13.1 64bit KDE

    I have a naive user who has unknowingly copied image files from one directory to another instead of moving them. Thousands of them. I need to find a utility that can recursively (from his home directory) identify duplicate filenames and perhaps verify that they are the same size (as some might be shrunk versions of the same image with the same name). I don't need a GUI version. But something that works without needing a number of non-SUSE dependencies.

    Thanks in advance.

  2. #2
    Join Date
    Feb 2009
    Location
    Spain
    Posts
    25,547

    Default Re: Find duplicated filenames

    On 2014-05-28 22:46, ionmich wrote:
    >
    > openSUSE 13.1 64bit KDE
    >
    > I have a naive user who has unknowingly copied image files from one
    > directory to another instead of moving them. Thousands of them. I need
    > to find a utility that can recursively (from his home directory)
    > identify duplicate filenames and perhaps verify that they are the same
    > size (as some might be shrunk versions of the same image with the same
    > name). I don't need a GUI version. But something that works without
    > needing a number of non-SUSE dependencies.


    I was thinking about creating a similar tool myself...

    You need to locate files that may have different names, but the exact
    same content, or the same name, but possibly different content?

    To locate the former, I think the trick is to generate a checksum of the
    tree, and compare checksums.

    --
    Cheers / Saludos,

    Carlos E. R.
    (from 13.1 x86_64 "Bottle" at Telcontar)

  3. #3
    Join Date
    Jul 2008
    Location
    Seattle, WA
    Posts
    17,109

    Default Re: Find duplicated filenames

    On Wed, 28 May 2014 20:46:01 +0000, ionmich wrote:

    > openSUSE 13.1 64bit KDE
    >
    > I have a naive user who has unknowingly copied image files from one
    > directory to another instead of moving them. Thousands of them. I need
    > to find a utility that can recursively (from his home directory)
    > identify duplicate filenames and perhaps verify that they are the same
    > size (as some might be shrunk versions of the same image with the same
    > name). I don't need a GUI version. But something that works without
    > needing a number of non-SUSE dependencies.
    >
    > Thanks in advance.


    Maybe:

    find . -type f | awk '{system("md5sum \047" $0 "\047")}' | sort

    That'll give you a list in the format:

    checksum filename

    which you can then manipulate however you like. Since it's sorted by
    checksum, you can identify the duplicates by checking the line before/
    line after each line.

    You can also limit the scope of the find command in whatever way you
    want, either by grepping (say the image files are all png files - you
    could limit the scope that way)

    Jim



    --
    Jim Henderson
    openSUSE Forums Administrator
    Forum Use Terms & Conditions at http://tinyurl.com/openSUSE-T-C

  4. #4

    Default Re: Find duplicated filenames

    The fdupes utility might be a solution for you

    fdupes --help should give you a hint already.

    man fdupes is not that long too
    "Unfortunately time is always against us" -- [Morpheus]

    .:https://github.com/Jetchisel:.

  5. #5
    Join Date
    Jun 2008
    Location
    Netherlands
    Posts
    25,925

    Default Re: Find duplicated filenames

    And the diff utility might be part of a solution.
    Henk van Velden

  6. #6
    Join Date
    Feb 2009
    Location
    Spain
    Posts
    25,547

    Default Re: Find duplicated filenames

    On 2014-05-29 03:59, Jim Henderson wrote:

    > Maybe:
    >
    > find . -type f | awk '{system("md5sum \047" $0 "\047")}' | sort
    >
    > That'll give you a list in the format:
    >
    > checksum filename


    Interesting.

    Question: as I don't know awk, what does it do above? I did a quick
    test, and I got:

    Code:
    cer@Telcontar:~/tmp/dupe> find . -type f | \
    awk '{system("md5sum \047" $0 "\047")}' | sort
    2c14ab52aeed6dc0d5163280fe0e108b  ./dos/dos_b
    2c14ab52aeed6dc0d5163280fe0e108b  ./uno/uno_b
    c4f42bbe9668a02501c74ba7fffa2d39  ./dos/dos_a
    c4f42bbe9668a02501c74ba7fffa2d39  ./uno/uno_a
    
    cer@Telcontar:~/tmp/dupe> md5sum */* | sort
    2c14ab52aeed6dc0d5163280fe0e108b  dos/dos_b
    2c14ab52aeed6dc0d5163280fe0e108b  uno/uno_b
    c4f42bbe9668a02501c74ba7fffa2d39  dos/dos_a
    c4f42bbe9668a02501c74ba7fffa2d39  uno/uno_a
    The only visible difference is the absence of the preceding "./" :-?

    --
    Cheers / Saludos,

    Carlos E. R.
    (from 13.1 x86_64 "Bottle" at Telcontar)

  7. #7
    Join Date
    Aug 2008
    Location
    Mexico and Sweden
    Posts
    1,356

    Default Re: Find duplicated filenames

    Thanks for all the reponses. I believe I have not described my problem accurately. English is not my native language. I'll try again using HYPOTHETICAL examples.

    My Mamiya camera produces 80 Megapixel images which I copy to my hard drive. They have long numerical names and come grouped in variously named directories on the camera. I wanted to E-mail some of them them to people who might steal them and sell them. I used a utility that shrank them, but since it overwrites the original file I copied the 80 MP images to a separate directory. I reduced them in that directory. I was left with two directories holding identical filenames but different sizes. Years later I find that I have 30,000 images, some large, some small. I want to delete the small ones. If I had been intelligent I would have named the shrink directories with names that included "shrunk" instead of various unrelated names like "DCIM" or "photos". Then I could easile search and delete.

    If I can find duplicated filenames (recursively) I can easily delete the directories that have the shrunk versions using file size since ALL the files in any particular directory will be shrunk.

    I promise that my naive user will be punished.

  8. #8
    Join Date
    Aug 2008
    Location
    Mexico and Sweden
    Posts
    1,356

    Default Re: Find duplicated filenames

    Quote Originally Posted by hendersj View Post

    Maybe:

    find . -type f | awk '{system("md5sum \047" $0 "\047")}' | sort

    That'll give you a list in the format:

    checksum filename

    which you can then manipulate however you like. Since it's sorted by
    checksum, you can identify the duplicates by checking the line before/
    line after each line.

    You can also limit the scope of the find command in whatever way you
    want, either by grepping (say the image files are all png files - you
    could limit the scope that way)

    Jim



    --
    Jim Henderson
    openSUSE Forums Administrator
    Forum Use Terms & Conditions at http://tinyurl.com/openSUSE-T-C
    I tried it in a test directory of only 24 files. You will note that file DSC03267.JPG exists in the current directory and in the subdirectory TEST. Easy enough to match up, but I have 30,000 files of which I suspect some 20% are duplicated.


    Code:
    > find . -type f | awk '{system("md5sum \047" $0 "\047")}' | sort
    00728dad46f6f061858ea7a77e30639c  ./DSC03245.JPG
    00a74b329e44a689e8cc87c27a42c4fc  ./DSC03267.JPG
    0c6b8d49fb512ad4047695c4f35ec092  ./DSC03161.JPG
    1376835a329b1392706688fb9a99a582  ./DSC03264.JPG
    15dbe6cb79df1f8c9e60c6c9980f26bb  ./DSC03240.JPG
    1e38d157827a95f7e09842b84176155c  ./IMG_0555.MOV
    2f3492e09e0d0f841cd800532db9fdf0  ./DSC03243.JPG
    371d89263d00f755f94bfe28a8b4e61b  ./DSC03233.JPG
    45d8ea75de2c35d1f6518a6bc8731634  ./DSC03260.JPG
    48694f2945928d30cde4e0f7ee1e0d18  ./DSC03242.JPG
    56c8028b4618de57c46366c4f1a7e307  ./DSC03238.JPG
    76e67072b6042d88240136111acd81cd  ./DSC03244.JPG
    77476f3dbd8be041b341a7b719a6e849  ./DSC03162.JPG
    77b9b70a8be7609964ba2cdca9ad1f0a  ./DSC03266.JPG
    7eb8070eaf0a2bdc5e0be23383116260  ./DSC03237.JPG
    861aaabf43a62670aa81742e89b305c8  ./DSC03160.JPG
    88e597364347c6617fb5bc69f26e1f4d  ./DSC03163.JPG
    957ffdbf1861425499ea25af76319aac  ./DSC03246.JPG
    9ad4330d7a67a8d17a6a8d3ee75a1d78  ./DSC03265.JPG
    b3cb5d8f0a8ac59387d09c69ffd99c71  ./DSC03239.JPG
    d46129c8d0a2d2bb765c5250700e8e14  ./DSC03158.JPG
    decf372347ecaa4608d9fb967e49b644  ./DSC03235.JPG
    e0d2252acfe4913da7b1ac81a14cfad8  ./DSC03159.JPG
    f4990f3ce0d73e00f5651b6effea42aa  ./DSC03263.JPG
    f61c5f51a93f51c6c1ae95b33e21b43a  ./TEST/DSC03267.JPG

  9. #9
    Join Date
    Jun 2008
    Location
    San Diego, Ca, USA
    Posts
    12,004
    Blog Entries
    2

    Default Re: Find duplicated filenames

    I'm not clear about the actual difficulty, is it that the User's copies are mixed with other, unique files? Are you concerned that some files might be named the same but actually be altogether different files and not simply a compressed or otherwise altered but still same file?

    If you can specify <exactly> what are your concerns then only after that a solution can be described...

    My guess is that diff could compare and find identical filenames by itself,
    but,
    I depending on how valuable you consider your pics and the possibility that your compressed or otherwise modified files might be faulty (it happens, particularly if the data is old)

    I'd instead recommend
    - Verifying all your original source files are usable
    - Then create new copies of your original source files in a new location, using whatever method you want to verify integrity. If only the filename has been changed, then you can automate using checksums (eg the code from Jim)

    Then, you'd simply delete the "old" file copies with their naming difficulties.
    Assumes more or less that all existing copies of files are in their own directory separate from the source.

    TSU

  10. #10

    Default Re: Find duplicated filenames

    You can try the comm(1) utility (which was written by the one and only RMS ) and with find,md5sum and sort.

    Compare Directory1 and Directory2
    Code:
    comm -13 <(find Directory1 -type f -iname '*.jpg' -execdir md5sum {} + |sort) <(find Directory2 -type f -iname '*.jpg' -execdir md5sum {} + | sort)
    For more info about comm(1) see, it is also a very short manual.
    Code:
    man comm
    Any way, you can play with those numbers with comm and see what will make you happy , of course that example is just for jpg files that do not care about the case i.e JPG,JpG,JPg,jPG,jpg
    "Unfortunately time is always against us" -- [Morpheus]

    .:https://github.com/Jetchisel:.

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •