Datasets
The following multi-label datasets are properly formatted for use with Mulan. We initially provide a table with dataset statistics, followed by the actual files and sources.
Statistics
| |
|
|
attributes |
|
|
|
|
| name |
domain |
instances |
nominal |
numeric |
labels |
cardinality |
density |
distinct |
| bibtex |
text |
7395 |
1836 |
0 |
159 |
2.402 |
0.015 |
2856 |
| bookmarks |
text |
87856 |
2150 |
0 |
208 |
2.028 |
0.010 |
18716 |
| CAL500 |
music |
502 |
0 |
68 |
174 |
26.044 |
0.150 |
502 |
| corel5k |
images |
5000 |
499 |
0 |
374 |
3.522 |
0.009 |
3175 |
| corel16k (10 samples) |
images |
13811±87 |
500 |
0 |
161±9 |
2.867±0.033 |
0.018±0.001 |
4937±158 |
| delicious |
text (web) |
16105 |
500 |
0 |
983 |
19.020 |
0.019 |
15806 |
| emotions |
music |
593 |
0 |
72 |
6 |
1.869 |
0.311 |
27 |
| enron |
text |
1702 |
1001 |
0 |
53 |
3.378 |
0.064 |
753 |
| EUR-Lex (directory codes) |
text |
19348 |
0 |
5000 |
412 |
1.292 |
0.003 |
1615 |
| EUR-Lex (subject matters) |
text |
19348 |
0 |
5000 |
201 |
2.213 |
0.011 |
2504 |
| EUR-Lex (eurovoc descriptors) |
text |
19348 |
0 |
5000 |
3993 |
5.310 |
0.001 |
16467 |
| genbase |
biology |
662 |
1186 |
0 |
27 |
1.252 |
0.046 |
32 |
| mediamill |
video |
43907 |
0 |
120 |
101 |
4.376 |
0.043 |
6555 |
| medical |
text |
978 |
1449 |
0 |
45 |
1.245 |
0.028 |
94 |
| rcv1v2 (subset1) |
text |
6000 |
0 |
47236 |
101 |
2.880 |
0.029 |
1028 |
| rcv1v2 (subset2) |
text |
6000 |
0 |
47236 |
101 |
2.634 |
0.026 |
954 |
| rcv1v2 (subset3) |
text |
6000 |
0 |
47236 |
101 |
2.614 |
0.026 |
939 |
| rcv1v2 (subset4) |
text |
6000 |
0 |
47229 |
101 |
2.484 |
0.025 |
816 |
| rcv1v2 (subset5) |
text |
6000 |
0 |
47235 |
101 |
2.642 |
0.026 |
946 |
| scene |
image |
2407 |
0 |
294 |
6 |
1.074 |
0.179 |
15 |
| tmc2007 |
text |
28596 |
49060 |
0 |
22 |
2.158 |
0.098 |
1341 |
| yeast |
biology |
2417 |
0 |
103 |
14 |
4.237 |
0.303 |
198 |
Files and Sources
- CAL500
files: Dataset along with the XML header [CAL500.rar]
source: Douglas Turnbull, Luke Barrington, David Torres and Gert Lanckriet, Semantic Annotation and Retrieval of Music and Sound Effects, IEEE Transactions on Audio, Speech and Language Processing 16(2), pp. 467-476, 2008.
More information: http://cosmal.ucsd.edu/cal/projects/AnnRet/
- corel5k
files: Train and test sets along with their union and the XML header [corel5k.rar]
[corel5k-sparse.rar]
source: Pinar Duygulu, Kobus Barnard, Nando de Freitas, and David Forsyth, Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary , 7th European Conference on Computer Vision, pp IV:97-112, 2002.
More information:
http://kobus.ca/research/data/eccv_2002/
- corel16k
files: 10 different samples containing the train, test and test3 disjoint sets along with their union and the XML header [corel16k.rar]
source: "Matching Words and Pictures", by Kobus Barnard, Pinar Duygulu, Nando de Freitas, David Forsyth, David Blei, and Michael I. Jordan, Journal of Machine Learning Research, Vol 3, pp 1107-1135.
More information: http://kobus.ca/research/data/jmlr_2003/
- emotions
files: Train and test sets along with their union and the XML header [emotions.rar]
source: K. Trohidis, G. Tsoumakas, G. Kalliris, I. Vlahavas. "Multilabel Classification of Music into Emotions". Proc. 2008 International Conference on Music Information Retrieval (ISMIR 2008), pp. 325-330, Philadelphia, PA, USA, 2008.
- EUR-Lex
files: Cross validation splits of TF-IDF representation of the documents with the first 5000 most frequent features selected, as used in the experiments. Usable for a direct comparison. XML header included. [eurlex-directory-codes.rar] [eurlex-subject-matters.rar] [eurlex-eurovoc-descriptors.rar]
source: Eneldo Loza Mencía and Johannes Fürnkranz. Efficient pairwise multilabel classification for large-scale problems in the legal domain. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD-2008), Part II, pages 50-65, Antwerp, Belgium, 2008.Springer-Verlag
More information: Knowledge Engineering Group, TU Darmstadt
- genbase
files: Train and test sets along with their union and the XML header [genbase.rar]
source: S. Diplaris, G. Tsoumakas, P. Mitkas and I. Vlahavas. Protein Classification with Multiple Algorithms, Proc. 10th Panhellenic Conference on Informatics (PCI 2005), pp. 448-456, Volos, Greece, November 2005.
note: The first attribute in this dataset is just an identification of the instance. There are several attributes with constant values (yes/no).
- mediamill
files: Train and test sets along with their union and the XML header [mediamill.rar]
source: C.G.M. Snoek, M. Worring, J.C. van Gemert, J.-M. Geusebroek, and A.W.M. Smeulders. The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia. In Proceedings of ACM Multimedia, pp. 421-430, Santa Barbara, USA, October 2006.
related URL: The Mediamill challenge
- scene
files: Train and test sets along with their union and the XML header [scene.rar]
source: M.R. Boutell, J. Luo, X. Shen, and C.M. Brown. Learning multi-labelscene classiffication. Pattern Recognition, 37(9):1757-1771, 2004.
- tmc2007
files (sparse): Train and test sets along with their union and the XML header [tmc2007.rar]
A shorter version of this dataset, after feature selection (top 500 features selected) is also available:
files: [tmc2007-500.rar]
source: A. Srivastava, B. Zane-Ulman: Discovering recurring anomalies in text reports regarding complex space systems. In: 2005 IEEE Aerospace Conference. (2005)
related URL: SIAM Text Mining Workshop 2007
- yeast
files: Train and test sets along with their union and the XML header [yeast.rar]
source: A. Elisseeff and J. Weston. A kernel method for multi-labelled classification. In T.G. Dietterich, S. Becker, and Z. Ghahramani, (eds), Advances in Neural Information Processing Systems 14, 2002.
Links
|