I have written an application to find, compare and flag or delete duplicate files. This application is used to identify and clean out files with identical contents from about 2 million photos and video clips stored in about 200 directories. Total data is about 4 Tb on a 5 disk Raid 5 array with Reiser file system. The applications is written in “C” and compiled with -mtune=native -march=native -O3 and works perfectly.
It takes about 8 hours for the application to complete when run the application on Suse 11.3 and about 2.5 hours to complete when run on Gentoo, complied with the same compiler flags as the application. The systems are installed as dual boot on a 64 bit quad Phenom II 3000 processor.
In addition, there is a monster of a spreadsheet that takes several minutes to re-calculate on data changes.
My question:
1.) Is there a plugin for Yast or a script that will download the source RPM’s compile with user set compiler flags, link and replace the installed executables on a fully installed and configured system?
2.) Is there a plugin for Yast or a script that will download the source RPM for updates, compile with user set compiler flags, link and replace the installed executables?
The target is to have a Suse / Gnome system optimised for this hardware.
I have been installing and using Suse since 1995. I am to stupid to get Gnome to work with Gentoo.
All information and recommendations will be highly appreciated.
What do you mean by “identical content”? If it means “identical files”, then checksumming the files which have an identical size could be done in a simple bash script. Here’s what I’m using (to just find the duplicates):
#! /bin/bash
# recursively find duplicate files (same size and same md5sum), optionaly with the given extension
dir=$1
ext=$2
if [ "x$*" == "x" ] ; then
exec echo "syntax : $0 directory [extension]"
elif [ ! -d $dir ] ; then
exec echo "directory $dir not found"
elif [ "x$2" == "x" ] ; then
allfiles=(`find $1 -type f -ls | awk '{ print $7"@"$11 }' | sort -n`)
else
allfiles=(`find $1 -type f -name "*.$2" -ls | awk '{ print $7"@"$11 }' | sort -n`)
fi
i=0
j=0
while [ $i -lt ${#allfiles[li]} ] ; do
[/li] j=$(($i+1))
if [ $j -lt ${#allfiles[li]} ] ; then
[/li] e1=${allfiles[$i]} ; e2=${allfiles[$j]}
f1=${e1##*@} ; f2=${e2##*@}
s1=${e1%%@*} ; s2=${e2%%@*}
if [ $s1 -eq $s2 ] ; then
m1=`md5sum $f1 | awk '{ print $1}'`
m2=`md5sum $f2 | awk '{ print $1}'`
echo "$f1 = $f2"
fi
fi
let i++
done
I don’t know if it’s relevant though … and don’t know how/if it would handle 2 million files (probably not).
If however you compare the photos, then I’m sure you have written a fine program.
Hans Linux wrote:
> I have written an application to find, compare and flag or delete
> duplicate files. This application is used to identify and clean out
> files with identical contents from about 2 million photos and video
> clips stored in about 200 directories. Total data is about 4 Tb on a 5
> disk Raid 5 array with Reiser file system. The applications is written
> in “C” and compiled with -mtune=native -march=native -O3 and works
> perfectly.
> It takes about 8 hours for the application to complete when run the
> application on Suse 11.3 and about 2.5 hours to complete when run on
> Gentoo, complied with the same compiler flags as the application. The
> systems are installed as dual boot on a 64 bit quad Phenom II 3000
> processor.
I’m not sure how the questions you ask below affect this? Presumably you
are already compiling your application on the opensuse box?
Have you profiled your application to see where the time is going?
> In addition, there is a monster of a spreadsheet that takes several
> minutes to re-calculate on data changes.
>
> My question:
> 1.) Is there a plugin for Yast or a script that will download the
> source RPM’s compile with user set compiler flags, link and replace the
> installed executables on a fully installed and configured system?
I don’t know. I guess the opensuse preferred answer is to upload the
source to the build service and compile it there.
The application has been compiled profiled on the Suse box and on the same box running Gentoo compiled with CFLAGS=-O2 flag set. There are no significant differences in performance.
Compiling the Gentoo box (full recompile) with and the application with CFLAGS=-O3 -mtune=native -march=native the performance is a about 3.5 times faster than on the Suse box and Gentoo box complied with CFLAGS=O2.
The cause seems to be the optimisation of the kernel, I/O system, raid system, memory access and unknowns.
My present fix is to run full optimised Gentoo and the application as a virtual machine with direct access to the hard drive array. It’s only slightly slower then with Gentoo and the application running direct on the hardware.