| DatasetsThe following multi-label datasets are properly formatted for use with Mulan. We initially provide a table with dataset statistics, followed by the actual files and sources.  Statistics
  
          
            
              |  |  |  | attributes |  |  |  |  |  
              | name | domain | instances | nominal | numeric | labels | cardinality | density | distinct |  
              | bibtex | text | 7395 | 1836 | 0 | 159 | 2.402 | 0.015 | 2856 |  
              | birds  | audio | 645 | 2 | 258 | 19 | 1.014 | 0.053 | 133 |  
              | bookmarks | text | 87856 | 2150 | 0 | 208 | 2.028 | 0.010 | 18716 |  
              | CAL500 | music | 502 | 0 | 68 | 174 | 26.044 | 0.150 | 502 |  
              | corel5k | images | 5000 | 499 | 0 | 374 | 3.522 | 0.009 | 3175 |  
              | corel16k (10 samples) | images | 13811±87 | 500 | 0 | 161±9 | 2.867±0.033 | 0.018±0.001 | 4937±158 |  
              | delicious | text (web) | 16105 | 500 | 0 | 983 | 19.020 | 0.019 | 15806 |  
              | emotions | music | 593 | 0 | 72 | 6 | 1.869 | 0.311 | 27 |  
              | enron | text | 1702 | 1001 | 0 | 53 | 3.378 | 0.064 | 753 |  
              | EUR-Lex (directory codes) | text | 19348 | 0 | 5000 | 412 | 1.292 | 0.003 | 1615 |  
              | EUR-Lex (subject matters) | text | 19348 | 0 | 5000 | 201 | 2.213 | 0.011 | 2504 |  
              | EUR-Lex (eurovoc descriptors) | text | 19348 | 0 | 5000 | 3993 | 5.310 | 0.001 | 16467 |  
              | flags | images (toy) | 194 | 9 | 10 | 7 | 3.392 | 0.485 | 54 |  
              | genbase | biology | 662 | 1186 | 0 | 27 | 1.252 | 0.046 | 32 |  
              | mediamill | video | 43907 | 0 | 120 | 101 | 4.376 | 0.043 | 6555 |  
              | medical | text | 978 | 1449 | 0 | 45 | 1.245 | 0.028 | 94 |  
              | NUS-WIDE  | images | 269648 | 0 | 128/500 | 81 | 1.869 | 0.023 | 18430 |  
              | rcv1v2 (subset1) | text | 6000 | 0 | 47236 | 101 | 2.880 | 0.029 | 1028 |  
              | rcv1v2 (subset2) | text | 6000 | 0 | 47236 | 101 | 2.634 | 0.026 | 954 |  
              | rcv1v2 (subset3) | text | 6000 | 0 | 47236 | 101 | 2.614 | 0.026 | 939 |  
              | rcv1v2 (subset4) | text | 6000 | 0 | 47229 | 101 | 2.484 | 0.025 | 816 |  
              | rcv1v2 (subset5) | text | 6000 | 0 | 47235 | 101 | 2.642 | 0.026 | 946 |  
              | scene | image | 2407 | 0 | 294 | 6 | 1.074 | 0.179 | 15 |  
              | tmc2007 | text | 28596 | 49060 
 | 0 | 22 | 2.158 | 0.098 | 1341 |  
              | yahoo | text | 5423±1259 | 0 | 32786±7990 | 31±6 | 1.481±0.154 | 0.051±0.012 | 321±139 |  
              | yeast | biology | 2417 | 0 | 103 | 14 | 4.237 | 0.303 | 198 |  Files and Sources
              birds
			   files: Train and test set along with the XML header [birds.rar]
 source: F. Briggs, Yonghong Huang, R. Raich, K. Eftaxias, Zhong Lei, W. Cukierski, S. Hadley, A. Hadley, M. Betts, X. Fern, J. Irvine, L. Neal, A. Thomas, G. Fodor, G. Tsoumakas, Hong Wei Ng, Thi Ngoc Tho Nguyen, H. Huttunen, P. Ruusuvuori, T. Manninen, A. Diment, T. Virtanen, J. Marzat, J. Defretin, D. Callender, C. Hurlburt, K. Larrey, M. Milakov.
			  "The 9th annual MLSP competition: New methods for acoustic classification of multiple simultaneous bird species in a noisy environment",
			  in proc. 2013 IEEE International Workshop on Machine Learning for
			  Signal Processing (MLSP).
 
              CAL500files: Dataset along with the XML header [CAL500.rar]
 source:      Douglas Turnbull, Luke Barrington, David Torres and Gert Lanckriet, Semantic Annotation and Retrieval of Music and Sound Effects, IEEE Transactions on Audio, Speech and Language Processing 16(2), pp. 467-476, 2008.
 More information: http://cosmal.ucsd.edu/cal/projects/AnnRet/
 
              corel5kfiles: Train and test sets along with their union and the XML header [corel5k.rar]
			  [corel5k-sparse.rar]
 source:      Pinar Duygulu, Kobus Barnard, Nando de Freitas, and David Forsyth, Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary , 7th European Conference on Computer Vision, pp IV:97-112, 2002.
 More information:
              http://kobus.ca/research/data/eccv_2002/
 
              corel16kfiles: 10 different samples containing the train, test and test3 disjoint sets along with their union and the XML header [corel16k.rar]
 source:      "Matching Words and Pictures",  by Kobus Barnard, Pinar Duygulu, Nando de Freitas, David Forsyth, David Blei, and Michael I. Jordan, Journal of Machine Learning Research, Vol 3, pp 1107-1135.
 More information: http://kobus.ca/research/data/jmlr_2003/
 
              emotions files: Train and test sets along with their union and the XML header [emotions.rar]
 source: K. Trohidis, G. Tsoumakas, G. Kalliris, I. Vlahavas. "Multilabel Classification of Music into Emotions". Proc. 2008 International Conference on Music Information Retrieval (ISMIR 2008), pp. 325-330, Philadelphia, PA, USA, 2008.
 
               EUR-Lexfiles: Cross validation splits of TF-IDF representation of the documents with the first 5000 most frequent features selected, as used in the experiments. Usable for a direct comparison. XML header included. [eurlex-directory-codes.rar]  [eurlex-subject-matters.rar] [eurlex-eurovoc-descriptors.rar]
 source: Eneldo Loza Mencía and Johannes Fürnkranz. Efficient pairwise multilabel classification for large-scale problems in the legal domain. In  Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD-2008), Part II, pages 50-65, Antwerp, Belgium, 2008.Springer-Verlag
 More information: Knowledge Engineering Group, TU Darmstadt
 
              flagsfiles: Train and test sets along with their union, the XML header and a readme file [flags.zip]
 source: The dataset was used for Multi-label Classification in "Gonçalves, Eduardo Corrêa, Alexandre Plastino, and Alex A. Freitas. A Genetic Algorithm for Optimizing the Label Ordering in Multi-Label Classifier Chains. ICTAI 2013." The original data can be found at the UCI repository.
 
              genbasefiles: Train and test sets along with their union and the XML header [genbase.rar]
 source:  S. Diplaris, G. Tsoumakas, P. Mitkas and I. Vlahavas. Protein  Classification with Multiple Algorithms, Proc. 10th Panhellenic  Conference on Informatics (PCI 2005), pp. 448-456, Volos, Greece,  November 2005.
 note: The first attribute in this dataset is just an identification of the instance. There are several attributes with constant values (yes/no).
 
 
              mediamill files: Train and test sets along with their union and the XML header [mediamill.rar]
 source:  C.G.M. Snoek, M. Worring, J.C. van Gemert, J.-M. Geusebroek, and A.W.M.  Smeulders. The Challenge Problem for Automated Detection of 101  Semantic Concepts in Multimedia. In Proceedings of ACM Multimedia, pp.  421-430, Santa Barbara, USA, October 2006.
 related URL: The Mediamill challenge
 
              medicalfiles: Train and test sets along with their union and the XML header [medical.rar]
 source:
			  
			  John P. Pestian, Christopher Brew, Pawel Matykiewicz, D. J.
			  Hovermale, Neil Johnson, K. Bretonnel Cohen, and Wodzislaw Duch.
			  2007. A shared task involving multi-label classification of
			  clinical free text. In Proceedings
			  of the Workshop on BioNLP 2007: Biological, Translational, and
			  Clinical Language Processing (BioNLP
			  '07). Association for Computational Linguistics, Stroudsburg, PA,
			  USA, 97-104.
 
              NUS-WIDE
			   We provide two versions of the full NUS-WIDE dataset. In the first version, images are represented using 500-D bag of visual words features provided by the creators of the dataset [1]. In the second version, images are represented using 128-D cVLAD+ features described in [2]. In both cases, we provide train and test sets (splitted as described in [1]). The 1st attirube in all datasets is the image id.
 files: 128-D cVLAD+ [nuswide-cVLADplus.rar] / 500-D bag of visual words [nuswide-bow.rar]
 [1] Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao Zheng. "NUS-WIDE: A Real-World Web Image Database from National University of Singapore", ACM International Conference on Image and Video Retrieval. Greece. Jul. 8-10, 2009.
 [2] E. Spyromitros-Xioufis, S. Papadopoulos, Y. Kompatsiaris, G. Tsoumakas, I. Vlahavas, "A Comprehensive Study over VLAD and Product Quantization in Large-scale Image Retrieval", IEEE Transactions on Multimedia, 2014.
 
              scenefiles: Train and test sets along with their union and the XML header [scene.rar]
 source: M.R. Boutell, J. Luo, X. Shen, and C.M. Brown. Learning   					multi-labelscene                        classiffication. Pattern Recognition, 37(9):1757-1771, 2004.
 
              tmc2007 files (sparse): Train and test sets along with their union and the XML header [tmc2007.rar]
 A shorter version of this dataset, after feature selection (top 500  features selected) is also available:
 files: [tmc2007-500.rar]
 source: A. Srivastava, B. Zane-Ulman: Discovering recurring anomalies in text reports                        regarding complex space systems. In: 2005 IEEE Aerospace Conference. (2005)
 related URL: SIAM Text Mining Workshop 2007
 
              yahoo files: 11 train and test sets along with their union and the XML header [yahoo.rar]
 source:  N. Ueda, K. Saito: Parametric mixture models for multi-labeled text,
			  In  Neural Information Processing Systems 15 (NIPS 15), MIT Press, pp. 737-744, 2002.
 
				yeast files: Train and test sets along with their union and the XML header [yeast.rar]
 source:  A. Elisseeff and J. Weston. A kernel method for multi-labelled  classification. In T.G. Dietterich, S. Becker, and Z. Ghahramani,  (eds), Advances in Neural Information Processing Systems 14, 2001.
 Links |