  • Apr 03, 2017 · code a novel dataset of Yahoo News comment threads (2.4k threads and 10k comments) and 1k threads from the Internet Argument Corpus; and analyze the features characteristic of ERICs. This is one of the largest annotated corpora of online human dialogues, with the most detailed set of annotations.
The casict2011 corpus is provided by the research group in Institute of Computing Technology, Chinese Academy of Sciences. The corpus contains 2 parts, each containing about 1 million (adding up to...

The corpus of Physical Review Letters, Physical Review, and Reviews of Modern Physics is comprised of over 450,000 articles and dates back to 1893. We are making available two data sets based on this corpus: 1) Citing article pairs: This data set consists of pairs of APS articles that cite each other.

Nov 21, 2020 · This dataset is intended to mobilize researchers to apply recent advances in natural language processing to generate new insights in support of the fight against this infectious disease. The corpus will be updated weekly as new research is published in peer-reviewed publications and archival services like bioRxiv, medRxiv, and others.
  • The Enron Corpus is a database of over 600,000 emails generated by 158 employees of the Enron Corporation in the years leading up to the company's collapse in December 2001. . The corpus was generated from Enron email servers by the Federal Energy Regulatory Commission (FERC) during its subsequent investigation
  • INRIA Holiday images dataset . Movie human actions dataset from Laptev et al. ESP game dataset; NUS-WIDE tagged image dataset of 269K images . Bastian Leibe’s dataset page: pedestrians, vehicles, cows, etc.
  • The corpus is provided as it is. The authors do not warrant that the corpus will be free from errors or will be suitable for any particular purpose. The authors of the corpus are not responsible for any direct or indirect problems that may be caused to the user of this corpus. The use of the corpus is limited to research and educational ...

    A multi-view text corpus, constructed from news articles from three online news services. Synthetic Multi-view Datasets. A set of synthetic text datasets for the evaluation of multi-view learning algorithms. Yeast Literature Dataset. A new text corpus, mined from biomedical literature, which refers to the terms used to describe S. cerevisiae ORFs.

    The LJ Speech Dataset. This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.

    Apr 24, 2015 · The LibriSpeech corpus is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz. We have made the corpus freely available for download, along with separately prepared language-model training data and pre-built language models.

    The Multimedia Commons (MMCommons) initiative is a community formed to coordinate efforts to advance the field of multimedia. Most of our attention is currently directed at making the Yahoo-Flickr Creative Commons 100 Million (YFCC100M) dataset even more useful, by offering a repository that contains supplemental material to this collection, such as content, features, and annotations.

    Nov 04, 2020 · corpora.bleicorpus – Corpus in Blei’s LDA-C format; ... This module is an API for downloading, getting information and loading datasets/models.

    Test the installation: Check that the user environment and privileges are set correctly by logging in to a user account, starting the Python interpreter, and accessing the Brown Corpus (see the previous section).

    BioCreative corpus: Dataset produced by the BioCreative assessment, text passages relevant for GO annotations of human proteins. GENIA corpus: Annotated corpus of literature related to the MeSH terms: Human, Blood Cells, and Transcription Factors. Yapex corpus: Training and test data for the protein tagger (NER) YAPEX.

    In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references. Project Description. The project consisted in recording utterance of hearing-impaired children participating in two types of tasks:

    Easily search for standard datasets and open-access datasets on a broad scope of topics, spanning from biomedical sciences to software security, through IEEE’s dataset storage and dataset search platform, DataPort.

    Jan 08, 2019 · VoxCeleb is a large-scale speaker identification dataset. It contains around 100,000 phrases by 1,251 celebrities, extracted from YouTube videos, spanning a diverse range of accents, professions...

    Jul 29, 2013 · In this paper, we introduce a new spreadsheet corpus ob- tained from industry for researchers to explore. This dataset was extracted from the Enron email archive, which is a large set of email messages that were made public during the legal investigation concerning the Enron corporation. It differs from the EUSES corpus in a number of ways:

    Current as of 12/04/2020. Texas Solid Waste Activity statewide files consist of three sets of extracts: IHW notice of registration data, IHW summary data, and reference tables affiliated with both NOR and summary data. Some are available for download here—others must be ordered.

    The party and election dates can be used to link the corpus information to the Manifesto Project Main Dataset. Coverage. The corpus currently covers electoral programmes from more than 50 different countries in almost 40 languages. It contains about 2.750 machine-readable programmes.

In the framework of the system, we present to the team a series of experiments on different corpus-level recognition datasets. The team uses Convolutional Neural Network (CNN) to perform a semantic segmentation of a speech signal. Compared with the previous methods, the proposed method achieves a better performance on both test datasets.
Corpora in Stav. This is a repository of biomedical corpora which can be visualized using Stav on-line visualization tool.; The datasets contain semantic annotations which range from named-entities (e.g., genes and drugs) and binary relationships (e.g., protein-protein interactions) to biomedical events (e.g., phosphorylation).
One of the key contributions of this shared task is the introduction of new annotated datasets: the Cambridge English Write & Improve (W&I) corpus and the LOCNESS corpus. Cambridge English Write & Improve Write & Improve (Yannakoudakis et al., 2018) is an online web platform that assists non-native English students with their writing.
The downside to working in Spanish is the scarcity of annotated data. NLTK’s conll2002 Spanish corpus has just 5,000 sentences. Since a POS tagger is the first step for building a NER tagger, I need to find a good dataset with POS annotations. Do you happen to know where to find a large Spanish dataset? Thank you! Reply