CDHIT on aa protein sequences rather than on nucleotides protein sequences
Goal of the clustering is to have a shared table gathering of functions and their abundance across samples.
The clustering is done one nucleotide genes sequences. It could be interesting to do it on amino acid sequences. This would be faster and would allow to cluster proteins that have similar aa sequences and so similar function even if their nucleotide sequences have diverged.
We would still use a strict identity threshold (>95% ?) to cluster aa sequences as the main goal is to have a shared function table between sample and not to have protein families.