Datasets used in KDD-2011

Title: Probabilistic Topic Models with Biased Propagation on Heterogeneous Information Networks
Authors: Hongbo Deng, Jiawei Han, Bo Zhao, Yintao Yu, Cindy Xide Lin

DBLP subset


The Digital Bibliography and Library Project (DBLP) is
a collection of bibliographic information on major computer
science journals and proceedings, which can be used to build
a heterogeneous information network with multi-typed objects
along with rich text data as Figure 1 (a). Each paper is
represented by a bag of words that appeared in the abstract
and title of the paper. Besides the rich-text documents, we
also obtain two other types of objects: author and venue
(i.e., conference). In this experiment, we use a subset of the
DBLP records that belongs to four areas: database, data
mining, information retrieval and artificial intelligence, and
contains 28,569 documents, 28,702 authors and 20 conferences.
The abstract is collected for representing each document,
and this data collection has 11,771 unique terms.
Within the heterogeneous information network, we observe
two explicit types of relationships: paper-author and papervenue,
which consist of a total number of 103,201 links.
Moreover, we use a labeled data set [22] with 4,057 authors,
100 papers and all 20 conferences for quantitative accuracy

Distribution of words, documents, authors, venues, and labels
Download the data in matlab

NSF-Awards subset


The NSF Research Awards Abstracts (NSF-Awards) consists
of 129,000 abstracts describing NSF awards for basic
research from 1990 to 2003, which are grouped into more
than 640 research programs. For each NSF award, we obtain
the abstract represented by a bag of words, and the
affiliated investigator(s), forming a heterogeneous information
network. In our test, we extract a subset of documents
that belong to the largest 10 research programs, such as ap-
plied mathematics, economics and geophysics, thus leaving
us with 16,405 documents and 9,989 associated investigators.
Within the heterogeneous information network, there
are a total of 20,717 links between documents and investigators.
Moreover, this data collection has 18,674 unique
terms which appear in all the abstracts.

Distribution of words, documents, investigators, and labels
Download the data in matlab

Code will be available later…
Date Created: Feb 18, 2011

Categories: research resources
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: