Mulan logo Mulan: A Java Library for Multi-Label Learning

Datasets

The following multi-label datasets are properly formatted for use with Mulan. We initially provide a table with dataset statistics, followed by the actual files and sources.

Statistics

       attributes      
name domain instances nominal numeric labels cardinality density distinct
bibtex text 7395 1836 0 159 2.402 0.015 2856
birds new dataset audio 645 2 258 19 1.014 0.053 133
bookmarks text 87856 2150 0 208 2.028 0.010 18716
CAL500 music 502 0 68 174 26.044 0.150 502
corel5k images 5000 499 0 374 3.522 0.009 3175
corel16k (10 samples) images 13811±87 500 0 161±9 2.867±0.033 0.018±0.001 4937±158
delicious text (web) 16105 500 0 983 19.020 0.019 15806
emotions music 593 0 72 6 1.869 0.311 27
enron text 1702 1001 0 53 3.378 0.064 753
EUR-Lex (directory codes) text 19348 0 5000 412 1.292 0.003 1615
EUR-Lex (subject matters) text 19348 0 5000 201 2.213 0.011 2504
EUR-Lex (eurovoc descriptors) text 19348 0 5000 3993 5.310 0.001 16467
flags new dataset images (toy) 194 9 10 7 3.392 0.485 54
genbase biology 662 1186 0 27 1.252 0.046 32
mediamill video 43907 0 120 101 4.376 0.043 6555
medical text 978 1449 0 45 1.245 0.028 94
rcv1v2 (subset1) text 6000 0 47236 101 2.880 0.029 1028
rcv1v2 (subset2) text 6000 0 47236 101 2.634 0.026 954
rcv1v2 (subset3) text 6000 0 47236 101 2.614 0.026 939
rcv1v2 (subset4) text 6000 0 47229 101 2.484 0.025 816
rcv1v2 (subset5) text 6000 0 47235 101 2.642 0.026 946
scene image 2407 0 294 6 1.074 0.179 15
tmc2007 text 28596 49060
0 22 2.158 0.098 1341
yahoo text 5423±1259 0 32786±7990 31±6 1.481±0.154 0.051±0.012 321±139
yeast biology 2417 0 103 14 4.237 0.303 198

Files and Sources

  • birds new dataset
    files: Train and test set along with the XML header [birds.rar]
    source: F. Briggs, Yonghong Huang, R. Raich, K. Eftaxias, Zhong Lei, W. Cukierski, S. Hadley, A. Hadley, M. Betts, X. Fern, J. Irvine, L. Neal, A. Thomas, G. Fodor, G. Tsoumakas, Hong Wei Ng, Thi Ngoc Tho Nguyen, H. Huttunen, P. Ruusuvuori, T. Manninen, A. Diment, T. Virtanen, J. Marzat, J. Defretin, D. Callender, C. Hurlburt, K. Larrey, M. Milakov. "The 9th annual MLSP competition: New methods for acoustic classification of multiple simultaneous bird species in a noisy environment", in proc. 2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP).
  • CAL500
    files: Dataset along with the XML header [CAL500.rar]
    source: Douglas Turnbull, Luke Barrington, David Torres and Gert Lanckriet, Semantic Annotation and Retrieval of Music and Sound Effects, IEEE Transactions on Audio, Speech and Language Processing 16(2), pp. 467-476, 2008.
    More information: http://cosmal.ucsd.edu/cal/projects/AnnRet/
  • corel5k
    files: Train and test sets along with their union and the XML header [corel5k.rar] [corel5k-sparse.rar]
    source: Pinar Duygulu, Kobus Barnard, Nando de Freitas, and David Forsyth, Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary , 7th European Conference on Computer Vision, pp IV:97-112, 2002.
    More information: http://kobus.ca/research/data/eccv_2002/
  • corel16k
    files: 10 different samples containing the train, test and test3 disjoint sets along with their union and the XML header [corel16k.rar]
    source: "Matching Words and Pictures", by Kobus Barnard, Pinar Duygulu, Nando de Freitas, David Forsyth, David Blei, and Michael I. Jordan, Journal of Machine Learning Research, Vol 3, pp 1107-1135.
    More information: http://kobus.ca/research/data/jmlr_2003/
  • emotions
    files: Train and test sets along with their union and the XML header [emotions.rar]
    source: K. Trohidis, G. Tsoumakas, G. Kalliris, I. Vlahavas. "Multilabel Classification of Music into Emotions". Proc. 2008 International Conference on Music Information Retrieval (ISMIR 2008), pp. 325-330, Philadelphia, PA, USA, 2008.
  • EUR-Lex
    files: Cross validation splits of TF-IDF representation of the documents with the first 5000 most frequent features selected, as used in the experiments. Usable for a direct comparison. XML header included. [eurlex-directory-codes.rar] [eurlex-subject-matters.rar] [eurlex-eurovoc-descriptors.rar]
    source
    : Eneldo Loza Mencía and Johannes Fürnkranz. Efficient pairwise multi­label classification for large-scale problems in the legal domain. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD-2008), Part II, pages 50-65, Antwerp, Belgium, 2008.Springer-Verlag
    More information
    : Knowledge Engineering Group, TU Darmstadt
  • flags new dataset
    files: Train and test sets along with their union, the XML header and a readme file [flags.zip]
    source: The dataset was used for Multi-label Classification in "Gonçalves, Eduardo Corrêa, Alexandre Plastino, and Alex A. Freitas. A Genetic Algorithm for Optimizing the Label Ordering in Multi-Label Classifier Chains. ICTAI 2013." The original data can be found at the UCI repository.
  • genbase
    files: Train and test sets along with their union and the XML header [genbase.rar]
    source: S. Diplaris, G. Tsoumakas, P. Mitkas and I. Vlahavas. Protein Classification with Multiple Algorithms, Proc. 10th Panhellenic Conference on Informatics (PCI 2005), pp. 448-456, Volos, Greece, November 2005.
    note: The first attribute in this dataset is just an identification of the instance. There are several attributes with constant values (yes/no).
  • mediamill
    files: Train and test sets along with their union and the XML header [mediamill.rar]
    source: C.G.M. Snoek, M. Worring, J.C. van Gemert, J.-M. Geusebroek, and A.W.M. Smeulders. The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia. In Proceedings of ACM Multimedia, pp. 421-430, Santa Barbara, USA, October 2006.
    related URL
    : The Mediamill challenge
  • medical
    files: Train and test sets along with their union and the XML header [medical.rar]
    source: John P. Pestian, Christopher Brew, Pawel Matykiewicz, D. J. Hovermale, Neil Johnson, K. Bretonnel Cohen, and Wodzislaw Duch. 2007. A shared task involving multi-label classification of clinical free text. In Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing (BioNLP '07). Association for Computational Linguistics, Stroudsburg, PA, USA, 97-104.
  • scene
    files: Train and test sets along with their union and the XML header [scene.rar]
    source: M.R. Boutell, J. Luo, X. Shen, and C.M. Brown. Learning multi-labelscene classiffication. Pattern Recognition, 37(9):1757-1771, 2004.
  • tmc2007
    files (sparse): Train and test sets along with their union and the XML header [tmc2007.rar]
    A shorter version of this dataset, after feature selection (top 500 features selected) is also available:
    files: [tmc2007-500.rar]
    source: A. Srivastava, B. Zane-Ulman: Discovering recurring anomalies in text reports regarding complex space systems. In: 2005 IEEE Aerospace Conference. (2005)
    related URL: SIAM Text Mining Workshop 2007
  • yahoo
    files: 11 train and test sets along with their union and the XML header [yahoo.rar]
    source: N. Ueda, K. Saito: Parametric mixture models for multi-labeled text, In  Neural Information Processing Systems 15 (NIPS 15), MIT Press, pp. 737-744, 2002.
  • yeast
    files: Train and test sets along with their union and the XML header [yeast.rar]
    source: A. Elisseeff and J. Weston. A kernel method for multi-labelled classification. In T.G. Dietterich, S. Becker, and Z. Ghahramani, (eds), Advances in Neural Information Processing Systems 14, 2001.

Links

SourceForge.net Logo