I'm picking apart my client's codebase for refactoring....

All posts Dr. Quadragon ❌'s posts Post Back to profile

I'm picking apart my client's codebase for refactoring. As expected, there's a lot of duplicates.

But duplicates are easy. fdupes lists them quick. The real, actual problem is near-duplicates! The files that differ by couple of lines, but otherwise are identical. How do I find them? Is there any tool for that?

Like 26 Apr 2022 at 20:59 | Open on mastodon.ml

8 comments

hkc (Carbonated)

@drq czkawka works great for images and some other files, not sure about text tho

26 Apr 2022 at 21:00 | Open on mastodon.astrr.ru

Dr. Quadragon ❌

@hatkidchan Thanks! Will check it out.

26 Apr 2022 at 21:04 | Open on mastodon.ml

hkc (Carbonated)

@drq just checked, nothing about text similarity checks. I guess you can try checking each file with others using `diff` but that gonna take a while

Or you can just throw files away until it breaks :blobfoxgooglymlem:

26 Apr 2022 at 21:07 | Open on mastodon.astrr.ru

Dr. Quadragon ❌

@hatkidchan I'm not comparing 4587 files manually unless I'm paid VERY large amount of money.

26 Apr 2022 at 21:09 | Open on mastodon.ml

Moana Rijndael 🍍🍕

@drq
0) generate all possible pairs of files
1) generate diff
2) if diff has too few lines, print a warning with diff
3) otherwise, check next pair
ofc this should be done by script

:ageblobcat:

@hatkidchan