The Natural Language Processing Group at the Department of Computer Science and Technology, Tsinghua University (THUNLP), also a part of the National Lab for Information Science and Technology and the State Key Lab of IntelligentTechnology and Systems, is working on methodologies and algorithms for computer processing and understanding of humanlanguages with emphasis on Chinese. We focus on basic research in language computation as well as the application-oriented NLP technologies. We have published a number of papers in the related top conferences and journals such as ACL,COLING, EMNLP, IJCAI, VLDB, Computational Linguistics, Journal of Quantitative Linguistics, IEEE Intelligent Systems inrecent years.

Our research covers a range of topics in natural language processing, including:

NLP based on Huge-scale Naturally Annotated Corpora

  • Word segmentation using punctuations in huge-scale web articles
  • New word detection and related word retrieval from userlogs of Chinese input method
  • Chinese abbreviation extraction from anchor texts in webpages
  • New word detection from user logs of search engine

Social Tagging and Keyword Extraction

  • Tag disambiguation
  • Tag suggestions using topic models
  • Tag suggestions via Latent Reason Identification
  • Exploring subsumption relations in social tags
  • Keyword extraction by clustering to find exemplar terms
  • Keyword extraction via topic decomposition

Multilingual Analysis

  • Fast and robust sentence alignment algorithm
  • Bilingual terminology extraction system
  • Statistical method for Uyghur tokenization
  • Uyghur morpheme analysis
  • "Female Script" pinyin input method

Text Classification

  • Feature selection for Chinese text classification
  • Scalable term selection for text classification
  • Efficient text classification using term projection
  • Transfer learning and self training classification
  • Text classification-based image classification for text
