From a database containing the published HMG protein sequences, we constructed an alignment of the HMG box functional domain based on sequence identity. Due to the large number of sequences (more than 250) and the short size of this domain, several data sets were used. This analysis reveals that the HMG box superfamily can be separated into two clearly defined subfamilies: (i) the SOX/MATA/TCF family, which clusters proteins able to bind to specific DNA sequences; and (ii) the HMG/UBF family, which clusters members which bind non specifically to DNA. The appearance and diversification of these subfamilies largely predate the split between the yeast and the metazoan lineages. Particular emphasis was placed on the analysis of the SOX subfamily. For the first time our analysis clearly identified the SOX subfamily as structured in six groups of genes named SOX5/6, SRY, SOX2/3, SOX14, SOX4/22, and SOX9/18. The validity of these gene clusters is confirmed by their functional characteristics and their sequences outside the HMG box. In sharp contrast, there are only a few robust branching patterns inside the UBF/HMG family, probably because of the much more ancient diversification of this family than the diversification of the SOX family. The only consistent groups that can be detected by our analysis are HMG box 1, vertebrate HMG box 2, insect SSRP, and plant HMG. The various UBF boxes cannot be clustered together and their diversification appears to be extremely ancient, probably before the appearance of metazoans.