BC40 Download


We construct the BC40 dataset(release date is 2020-07-28) in which each entry is publicly available from PDB. Specifically, PDB will cluster all protein chains by MMseq2 at 30%, 40%, …, 90%, 95%, and 100% sequence identity each week to remove redundancy, and BC40 is the dataset with 40% cutoff such that the proteins share no more than 40% sequence identity. Additionally, we also remove the proteins that share more than 25% sequence identity with our CullPDB dataset. The first line in each MSA file is the original protein sequence.