Email or username:

Password:

Forgot your password?
Top-level
Dr. Quadragon ❌

I'm picking apart my client's codebase for refactoring. As expected, there's a lot of duplicates.

But duplicates are easy. fdupes lists them quick. The real, actual problem is near-duplicates! The files that differ by couple of lines, but otherwise are identical. How do I find them? Is there any tool for that?

8 comments
hkc (Carbonated)

@drq czkawka works great for images and some other files, not sure about text tho

hkc (Carbonated)

@drq just checked, nothing about text similarity checks. I guess you can try checking each file with others using `diff` but that gonna take a while

Or you can just throw files away until it breaks :blobfoxgooglymlem:

Dr. Quadragon ❌

@hatkidchan I'm not comparing 4587 files manually unless I'm paid VERY large amount of money.

Moana Rijndael 🍍🍕

@drq
0) generate all possible pairs of files
1) generate diff
2) if diff has too few lines, print a warning with diff
3) otherwise, check next pair
ofc this should be done by script

:ageblobcat:

@hatkidchan

Moana Rijndael 🍍🍕

@drq do it in parallel then, shell has tools for that :ageblobcat:
Anyway, tools for findind duplicates works somehow???

@hatkidchan

Enigma Voice

@drq
4587!/(2*4585!), actually, which means 2293*4587.
Still a lot, yes.

@mo @hatkidchan

Go Up