Mulan logo Mulan: A Java Library for Multi-Label Learning

Datasets

The following multi-label datasets are properly formatted for use with Mulan. We initially provide a table with dataset statistics, followed by the actual files and sources.

Statistics

       attributes      
name domain instances nominal numeric labels cardinality density distinct
bibtex text 7395 1836 0 159 2.402 0.015 2856
bookmarks text 87856 2150 0 208 2.028 0.010 18716
CAL500 music 502 0 68 174 26.044 0.150 502
corel5k images 5000 499 0 374 3.522 0.009 3175
corel16k (10 samples) images 13811±87 500 0 161±9 2.867±0.033 0.018±0.001 4937±158
delicious text (web) 16105 500 0 983 19.020 0.019 15806
emotions music 593 0 72 6 1.869 0.311 27
enron text 1702 1001 0 53 3.378 0.064 753
EUR-Lex (directory codes) text 19348 0 5000 412 1.292 0.003 1615
EUR-Lex (subject matters) text 19348 0 5000 201 2.213 0.011 2504
EUR-Lex (eurovoc descriptors) text 19348 0 5000 3993 5.310 0.001 16467
genbase biology 662 1186 0 27 1.252 0.046 32
mediamill video 43907 0 120 101 4.376 0.043 6555
medical text 978 1449 0 45 1.245 0.028 94
rcv1v2 (subset1) text 6000 0 47236 101 2.880 0.029 1028
rcv1v2 (subset2) text 6000 0 47236 101 2.634 0.026 954
rcv1v2 (subset3) text 6000 0 47236 101 2.614 0.026 939
rcv1v2 (subset4) text 6000 0 47229 101 2.484 0.025 816
rcv1v2 (subset5) text 6000 0 47235 101 2.642 0.026 946
scene image 2407 0 294 6 1.074 0.179 15
tmc2007 text 28596 49060
0 22 2.158 0.098 1341
yeast biology 2417 0 103 14 4.237 0.303 198

Files and Sources

  • CAL500
    files: Dataset along with the XML header [CAL500.rar]
    source: Douglas Turnbull, Luke Barrington, David Torres and Gert Lanckriet, Semantic Annotation and Retrieval of Music and Sound Effects, IEEE Transactions on Audio, Speech and Language Processing 16(2), pp. 467-476, 2008.
    More information: http://cosmal.ucsd.edu/cal/projects/AnnRet/
  • corel5k
    files: Train and test sets along with their union and the XML header [corel5k.rar] [corel5k-sparse.rar]
    source: Pinar Duygulu, Kobus Barnard, Nando de Freitas, and David Forsyth, Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary , 7th European Conference on Computer Vision, pp IV:97-112, 2002.
    More information: http://kobus.ca/research/data/eccv_2002/
  • corel16k
    files: 10 different samples containing the train, test and test3 disjoint sets along with their union and the XML header [corel16k.rar]
    source: "Matching Words and Pictures", by Kobus Barnard, Pinar Duygulu, Nando de Freitas, David Forsyth, David Blei, and Michael I. Jordan, Journal of Machine Learning Research, Vol 3, pp 1107-1135.
    More information: http://kobus.ca/research/data/jmlr_2003/
  • emotions
    files: Train and test sets along with their union and the XML header [emotions.rar]
    source: K. Trohidis, G. Tsoumakas, G. Kalliris, I. Vlahavas. "Multilabel Classification of Music into Emotions". Proc. 2008 International Conference on Music Information Retrieval (ISMIR 2008), pp. 325-330, Philadelphia, PA, USA, 2008.
  • EUR-Lex
    files: Cross validation splits of TF-IDF representation of the documents with the first 5000 most frequent features selected, as used in the experiments. Usable for a direct comparison. XML header included. [eurlex-directory-codes.rar] [eurlex-subject-matters.rar] [eurlex-eurovoc-descriptors.rar]
    source
    : Eneldo Loza Mencía and Johannes Fürnkranz. Efficient pairwise multi­label classification for large-scale problems in the legal domain. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD-2008), Part II, pages 50-65, Antwerp, Belgium, 2008.Springer-Verlag
    More information
    : Knowledge Engineering Group, TU Darmstadt
  • genbase
    files: Train and test sets along with their union and the XML header [genbase.rar]
    source: S. Diplaris, G. Tsoumakas, P. Mitkas and I. Vlahavas. Protein Classification with Multiple Algorithms, Proc. 10th Panhellenic Conference on Informatics (PCI 2005), pp. 448-456, Volos, Greece, November 2005.
    note: The first attribute in this dataset is just an identification of the instance. There are several attributes with constant values (yes/no).
  • mediamill
    files: Train and test sets along with their union and the XML header [mediamill.rar]
    source: C.G.M. Snoek, M. Worring, J.C. van Gemert, J.-M. Geusebroek, and A.W.M. Smeulders. The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia. In Proceedings of ACM Multimedia, pp. 421-430, Santa Barbara, USA, October 2006.
    related URL
    : The Mediamill challenge
  • scene
    files: Train and test sets along with their union and the XML header [scene.rar]
    source: M.R. Boutell, J. Luo, X. Shen, and C.M. Brown. Learning multi-labelscene classiffication. Pattern Recognition, 37(9):1757-1771, 2004.
  • tmc2007
    files (sparse): Train and test sets along with their union and the XML header [tmc2007.rar]
    A shorter version of this dataset, after feature selection (top 500 features selected) is also available:
    files: [tmc2007-500.rar]
    source: A. Srivastava, B. Zane-Ulman: Discovering recurring anomalies in text reports regarding complex space systems. In: 2005 IEEE Aerospace Conference. (2005)
    related URL: SIAM Text Mining Workshop 2007
  • yeast
    files: Train and test sets along with their union and the XML header [yeast.rar]
    source: A. Elisseeff and J. Weston. A kernel method for multi-labelled classification. In T.G. Dietterich, S. Becker, and Z. Ghahramani, (eds), Advances in Neural Information Processing Systems 14, 2002.

Links

SourceForge.net Logo