Find duplicated filenames

ionmich · May 28, 2014, 10:41pm

openSUSE 13.1 64bit KDE

I have a naive user who has unknowingly copied image files from one directory to another instead of moving them. Thousands of them. I need to find a utility that can recursively (from his home directory) identify duplicate filenames and perhaps verify that they are the same size (as some might be shrunk versions of the same image with the same name). I don’t need a GUI version. But something that works without needing a number of non-SUSE dependencies.

Thanks in advance.

robin_listas · May 28, 2014, 11:08pm

On 2014-05-28 22:46, ionmich wrote:
>
> openSUSE 13.1 64bit KDE
>
> I have a naive user who has unknowingly copied image files from one
> directory to another instead of moving them. Thousands of them. I need
> to find a utility that can recursively (from his home directory)
> identify duplicate filenames and perhaps verify that they are the same
> size (as some might be shrunk versions of the same image with the same
> name). I don’t need a GUI version. But something that works without
> needing a number of non-SUSE dependencies.

I was thinking about creating a similar tool myself…

You need to locate files that may have different names, but the exact
same content, or the same name, but possibly different content?

To locate the former, I think the trick is to generate a checksum of the
tree, and compare checksums.

–
Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 “Bottle” at Telcontar)

hendersj · May 29, 2014, 3:59am

On Wed, 28 May 2014 20:46:01 +0000, ionmich wrote:

> openSUSE 13.1 64bit KDE
>
> I have a naive user who has unknowingly copied image files from one
> directory to another instead of moving them. Thousands of them. I need
> to find a utility that can recursively (from his home directory)
> identify duplicate filenames and perhaps verify that they are the same
> size (as some might be shrunk versions of the same image with the same
> name). I don’t need a GUI version. But something that works without
> needing a number of non-SUSE dependencies.
>
> Thanks in advance.

Maybe:

find . -type f | awk ‘{system(“md5sum \047” $0 “\047”)}’ | sort

That’ll give you a list in the format:

checksum filename

which you can then manipulate however you like. Since it’s sorted by
checksum, you can identify the duplicates by checking the line before/
line after each line.

You can also limit the scope of the find command in whatever way you
want, either by grepping (say the image files are all png files - you
could limit the scope that way)

Jim

–
Jim Henderson
openSUSE Forums Administrator
Forum Use Terms & Conditions at http://tinyurl.com/openSUSE-T-C

jetchisel · May 29, 2014, 5:02am

The fdupes utility might be a solution for you https://forums.opensuse.org/images/smiliesnew/smile.png

fdupes --help should give you a hint already.

**man fdupes ** is not that long too

hcvv · May 29, 2014, 9:38am

And the diff utility might be part of a solution.

robin_listas · May 29, 2014, 11:13am

On 2014-05-29 03:59, Jim Henderson wrote:

> Maybe:
>
> find . -type f | awk ‘{system(“md5sum \047” $0 “\047”)}’ | sort
>
> That’ll give you a list in the format:
>
> checksum filename

Interesting.

Question: as I don’t know awk, what does it do above? I did a quick
test, and I got:


cer@Telcontar:~/tmp/dupe> find . -type f | \
awk '{system("md5sum \047" $0 "\047")}' | sort
2c14ab52aeed6dc0d5163280fe0e108b  ./dos/dos_b
2c14ab52aeed6dc0d5163280fe0e108b  ./uno/uno_b
c4f42bbe9668a02501c74ba7fffa2d39  ./dos/dos_a
c4f42bbe9668a02501c74ba7fffa2d39  ./uno/uno_a

cer@Telcontar:~/tmp/dupe> md5sum */* | sort
2c14ab52aeed6dc0d5163280fe0e108b  dos/dos_b
2c14ab52aeed6dc0d5163280fe0e108b  uno/uno_b
c4f42bbe9668a02501c74ba7fffa2d39  dos/dos_a
c4f42bbe9668a02501c74ba7fffa2d39  uno/uno_a

The only visible difference is the absence of the preceding “./” :-?

–
Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 “Bottle” at Telcontar)

ionmich · May 29, 2014, 12:43pm

Thanks for all the reponses. I believe I have not described my problem accurately. English is not my native language. I’ll try again using HYPOTHETICAL examples.

My Mamiya camera produces 80 Megapixel images which I copy to my hard drive. They have long numerical names and come grouped in variously named directories on the camera. I wanted to E-mail some of them them to people who might steal them and sell them. I used a utility that shrank them, but since it overwrites the original file I copied the 80 MP images to a separate directory. I reduced them in that directory. I was left with two directories holding identical filenames but different sizes. Years later I find that I have 30,000 images, some large, some small. I want to delete the small ones. If I had been intelligent I would have named the shrink directories with names that included “shrunk” instead of various unrelated names like “DCIM” or “photos”. Then I could easile search and delete.

If I can find duplicated filenames (recursively) I can easily delete the directories that have the shrunk versions using file size since ALL the files in any particular directory will be shrunk.

I promise that my naive user will be punished.

ionmich · May 29, 2014, 6:11pm

I tried it in a test directory of only 24 files. You will note that file DSC03267.JPG exists in the current directory and in the subdirectory TEST. Easy enough to match up, but I have 30,000 files of which I suspect some 20% are duplicated.

> find . -type f | awk '{system("md5sum \047" $0 "\047")}' | sort
00728dad46f6f061858ea7a77e30639c  ./DSC03245.JPG
00a74b329e44a689e8cc87c27a42c4fc  ./DSC03267.JPG
0c6b8d49fb512ad4047695c4f35ec092  ./DSC03161.JPG
1376835a329b1392706688fb9a99a582  ./DSC03264.JPG
15dbe6cb79df1f8c9e60c6c9980f26bb  ./DSC03240.JPG
1e38d157827a95f7e09842b84176155c  ./IMG_0555.MOV
2f3492e09e0d0f841cd800532db9fdf0  ./DSC03243.JPG
371d89263d00f755f94bfe28a8b4e61b  ./DSC03233.JPG
45d8ea75de2c35d1f6518a6bc8731634  ./DSC03260.JPG
48694f2945928d30cde4e0f7ee1e0d18  ./DSC03242.JPG
56c8028b4618de57c46366c4f1a7e307  ./DSC03238.JPG
76e67072b6042d88240136111acd81cd  ./DSC03244.JPG
77476f3dbd8be041b341a7b719a6e849  ./DSC03162.JPG
77b9b70a8be7609964ba2cdca9ad1f0a  ./DSC03266.JPG
7eb8070eaf0a2bdc5e0be23383116260  ./DSC03237.JPG
861aaabf43a62670aa81742e89b305c8  ./DSC03160.JPG
88e597364347c6617fb5bc69f26e1f4d  ./DSC03163.JPG
957ffdbf1861425499ea25af76319aac  ./DSC03246.JPG
9ad4330d7a67a8d17a6a8d3ee75a1d78  ./DSC03265.JPG
b3cb5d8f0a8ac59387d09c69ffd99c71  ./DSC03239.JPG
d46129c8d0a2d2bb765c5250700e8e14  ./DSC03158.JPG
decf372347ecaa4608d9fb967e49b644  ./DSC03235.JPG
e0d2252acfe4913da7b1ac81a14cfad8  ./DSC03159.JPG
f4990f3ce0d73e00f5651b6effea42aa  ./DSC03263.JPG
f61c5f51a93f51c6c1ae95b33e21b43a  ./TEST/DSC03267.JPG

tsu2 · May 30, 2014, 7:48pm

I’m not clear about the actual difficulty, is it that the User’s copies are mixed with other, unique files? Are you concerned that some files might be named the same but actually be altogether different files and not simply a compressed or otherwise altered but still same file?

If you can specify <exactly> what are your concerns then only after that a solution can be described…

My guess is that diff could compare and find identical filenames by itself,
but,
I depending on how valuable you consider your pics and the possibility that your compressed or otherwise modified files might be faulty (it happens, particularly if the data is old)

I’d instead recommend

Verifying all your original source files are usable
Then create new copies of your original source files in a new location, using whatever method you want to verify integrity. If only the filename has been changed, then you can automate using checksums (eg the code from Jim)

Then, you’d simply delete the “old” file copies with their naming difficulties.
Assumes more or less that all existing copies of files are in their own directory separate from the source.

TSU

jetchisel · May 30, 2014, 9:21pm

You can try the comm(1) utility (which was written by the one and only RMS lol! ) and with find,md5sum and sort.

Compare Directory1 and Directory2

comm -13 <(find** Directory1** -type f -iname '*.jpg' -execdir md5sum {} + |sort) <(find **Directory2** -type f -iname '*.jpg' -execdir md5sum {} + | sort)

For more info about comm(1) see, it is also a very short manual.

man comm

Any way, you can play with those numbers with comm and see what will make you happy lol! , of course that example is just for jpg files that do not care about the case i.e JPG,JpG,JPg,jPG,jpg

jetchisel · May 31, 2014, 6:43am

ionmich:

I tried it in a test directory of only 24 files. You will note that file DSC03267.JPG exists in the current directory and in the subdirectory TEST. Easy enough to match up, but I have 30,000 files of which I suspect some 20% are duplicated.

> find . -type f | awk '{system("md5sum \047" $0 "\047")}' | sort
00728dad46f6f061858ea7a77e30639c  ./DSC03245.JPG
00a74b329e44a689e8cc87c27a42c4fc  ./DSC03267.JPG
0c6b8d49fb512ad4047695c4f35ec092  ./DSC03161.JPG
1376835a329b1392706688fb9a99a582  ./DSC03264.JPG
15dbe6cb79df1f8c9e60c6c9980f26bb  ./DSC03240.JPG
1e38d157827a95f7e09842b84176155c  ./IMG_0555.MOV
2f3492e09e0d0f841cd800532db9fdf0  ./DSC03243.JPG
371d89263d00f755f94bfe28a8b4e61b  ./DSC03233.JPG
45d8ea75de2c35d1f6518a6bc8731634  ./DSC03260.JPG
48694f2945928d30cde4e0f7ee1e0d18  ./DSC03242.JPG
56c8028b4618de57c46366c4f1a7e307  ./DSC03238.JPG
76e67072b6042d88240136111acd81cd  ./DSC03244.JPG
77476f3dbd8be041b341a7b719a6e849  ./DSC03162.JPG
77b9b70a8be7609964ba2cdca9ad1f0a  ./DSC03266.JPG
7eb8070eaf0a2bdc5e0be23383116260  ./DSC03237.JPG
861aaabf43a62670aa81742e89b305c8  ./DSC03160.JPG
88e597364347c6617fb5bc69f26e1f4d  ./DSC03163.JPG
957ffdbf1861425499ea25af76319aac  ./DSC03246.JPG
9ad4330d7a67a8d17a6a8d3ee75a1d78  ./DSC03265.JPG
b3cb5d8f0a8ac59387d09c69ffd99c71  ./DSC03239.JPG
d46129c8d0a2d2bb765c5250700e8e14  ./DSC03158.JPG
decf372347ecaa4608d9fb967e49b644  ./DSC03235.JPG
e0d2252acfe4913da7b1ac81a14cfad8  ./DSC03159.JPG
f4990f3ce0d73e00f5651b6effea42aa  ./DSC03263.JPG
f61c5f51a93f51c6c1ae95b33e21b43a  ./TEST/DSC03267.JPG

In your case where a sub directory is involve then the options, ** -maxdepth -mindepth** and **-prune **from find(1) should be able to help.

man find

or if you don’t like reading long man pages using your shell and you are using KDE.

konqueror man:find &

to be more precise.

konqueror man:/usr/share/man/man1/find.1.gz &

I guess that works for nautilus as well but i don’t know to be honest .

malcolmlewis · May 31, 2014, 6:59am

On Sat 31 May 2014 04:46:01 AM CDT, jetchisel wrote:

Code:

konqueror man:find &

to be more precise.

Code:

konqueror man:/usr/share/man/man1/find.1.gz &

I guess that works for nautilus as well but i don’t know to be honest

Hi
For Gnome there is yelp, so press alt+F2 and enter yelp man:find

–
Cheers Malcolm °¿° SUSE Knowledge Partner (Linux Counter #276890)
openSUSE 13.1 (Bottle) (x86_64) GNOME 3.10.1 Kernel 3.11.10-11-desktop
If you find this post helpful and are logged into the web interface,
please show your appreciation and click on the star below… Thanks!

ionmich · May 31, 2014, 8:12pm

Thanks to all of you. Problem solved.