ap for deleting duplicate files? (mostly .jpg files)

rwbehne1 · February 24, 2018, 8:16am

I just ran photorec on a bad drive to recover files, and I ended up with thousands of files, many of them being duplicates but with different file names. I have moved all .jpg files into one directory. Is there an application that can scan all the files in a directory, identify duplicates regardless of file name, and delete them leaving only one copy?

hcvv · February 24, 2018, 10:16am

I do not think there is such an application out of the box. This is typically the sort of task Unix/Linux is very good at because it offers a lot of low level tools and a shell to combine them into an ad hoc solution.

I probably would try to sort them according to size, then only look at those with the same size and compare those (two’s, three’s ?) if they are the same (with diff or compare).

The sorting (starting from a ls listing with appropriate options) would take some try and error I assume.

hcvv · February 24, 2018, 10:35am

Tried a bit. This one will list file and their size in decreasing size:

ls -Sl --time-style=+%t | while read P L U G S N; do echo $N $S; done

Try that one first in your directory. And tell me if it helps and if you want to go on with this way of working. Then we can carry on.

jetchisel · February 24, 2018, 11:24am

Hi,

There is the program called

fdupes

I suggest you first backup the directory where you have those files.

hcvv · February 24, 2018, 11:43am

I searched for “man fdupes” in Google. Looks as a solution, but as you suggest: handle with care and on a duplicate!

jetchisel · February 24, 2018, 2:49pm

Hi,

Right, as a rule of thumb I always backup files/directories whenever I’m editing, no matter which/what/how I edit the files.

Now if you run

fdupes directory

It should show you the duplicate files in groups ( A blank line separates the group from the others )

For example

fdupes **.**

Should lists the files in the current pwd/cwd or whatever you want to call it.

For a recursive search. Note the trailing dot

fdupes -r **.**

You can compare text files too using some utility/program.

diff file1 file2

returns nothing if the files are identical.
For the OP which sie said that files are mainly pictures. the cmp utility is an alternative.

cmp file1.jpg file2.jpg

The same outputs nothing if the files are identical.

you can add a test if you like.

if cmp file1.jpg file2.jpg; then
  printf '%s
' file1.jpg and file2.jpg are identical'
fi

Another solution would be to hash the files and compare them, but imo there are specific tools written just to do what the OP wants. so better use the existing tools rather than reinvent the wheel.

hcvv · February 24, 2018, 3:17pm

The “better” here is your idea. It will be easier. But it is a nice little exercise to write a small script for it. rotfl!

malcolmlewis · February 24, 2018, 5:38pm

On Sat 24 Feb 2018 09:26:01 AM CST, hcvv wrote:

I do not think there is such an application out of the box. This is
typically the sort of task Unix/Linux is very good at because it offers
a lot of low level tools and a shell to combine them into an ad hoc
solution.

I probably would try to sort them according to size, then only look at
those with the same size and compare those (two’s, three’s ?) if they
are the same (with diff or compare).

The sorting (starting from a ls listing with appropriate options) would
take some try and error I assume.

Hi
Use fdupes with required options…


fdupes - finds duplicate files in a given set of directories

–
Cheers Malcolm °¿° SUSE Knowledge Partner (Linux Counter #276890)
openSUSE Leap 42.3|GNOME 3.20.2|4.4.114-42-default
If you find this post helpful and are logged into the web interface,
please show your appreciation and click on the star below… Thanks!

tsu2 · February 24, 2018, 7:03pm

Here’s another option

https://github.com/dedupeio/dedupe

You can either download the python library from PyPi or try one of the listed services that use the library to dedupe your data.
Looks to me that it’ll find your duplicates, then it’s up to you what to do with the result (eg delete a copy).

TSU

rwbehne1 · February 24, 2018, 10:58pm

I looked at all suggestions, all of which were good, and
decided that fdupes would probably be best. So I tried
it on a backed-up directory, and it did exactly what I
needed without any difficulty.

After trying it out I felt confident enough to use it on
the entire recovered file tree - all 3,097,177 files. I
figure that it’ saving me lots of time and labor, and if
anything does go wrong I can always re-run photorec on
the bad drive again as it’ll only take about 10 hours.

Thanks to everyone who responded! This was a good
learning experience for me.

hcvv · February 25, 2018, 9:00am

I think you took the right decision. Thank you for reporting back. We are all glad that we could have been of help.