Skip to main content

Reasoning & Knowledge

DORIS-MAE

DORIS-MAE Anno-GPT annotation matrix. DORIS-MAE (NeurIPS 2023 D&B) is a scientific-document retrieval benchmark of complex, multi-level aspect-based queries over arXiv CS abstracts. Its Anno-GPT framework uses chatgpt-3.5-turbo-0301 as an expert-level relevance annotator. The public Zenodo dataset releases, for all 165,144 (aspect, candidate-abstract) pairs, ChatGPT's 3-level relevance label (0=unrelated, 1=partial, 2=fully answers the aspect) plus its explanation text. This is a single-subject per-item matrix (subject=ChatGPT-3.5; item=one (aspect, document) pair; response=0/1/2 label). For the 250 Test_set pairs, three human annotators' majority-vote label is attached as gold. The 17 retrieval models benchmarked in the paper are reported only as aggregate metrics (no per-item outputs released), so they are excluded.

165,104items
1subjects
100%observed
CC-BY-NC-4.0license
generaldomain
textmodality

Response matrix

Fit to width. Hover for subject & item; click a cell for details.

DORIS-MAE response matrix: AI models (rows) against items (columns)
lowhighUnobserved

Scale: {0, 1, 2}

Sample items

Item 10% solve rate

Task: judge whether the candidate scientific document satisfies the user's aspect-based requirement.

User requirement (aspect): The document should mention a chatbot that can interpret input images and analyze them for subsequent tasks.

Candidate document title: Extending_a_Single-Document_Summarizer_to_Multi-Document:_a_Hierarchical_Approach Candidate document abstract: The increasing amount of online content motivated the development of multi-document summarization methods. In this work, we explore straightforward approaches to extend single-document summarization methods to multi-document summarization. The proposed methods are based on the hierarchical combination of single-document summaries, and achieves state of the art results.

Relevance label: 0 = the document does not address the requirement, 1 = it partially addresses it, 2 = it fully addresses the requirement.

How subjects answered

  • chatgpt-3.5-turbo-0301score 0

    DISAGREE. The abstract does not mention a chatbot or image analysis, so it does not address the user requirement.

Item 20% solve rate

Task: judge whether the candidate scientific document satisfies the user's aspect-based requirement.

User requirement (aspect): The system in the document should spawn multiple processes to mimic trajectories of people’s health condition.

Candidate document title: Resset:_A_Recurrent_Model_for_Sequence_of_Sets_with_Applications_to_Electronic_Medical_Records Candidate document abstract: Modern healthcare is ripe for disruption by AI. A game changer would be automatic understanding the latent processes from electronic medical records, which are being collected for billions of people worldwide. However, these healthcare processes are complicated by the interaction between at least three dynamic components: the illness which involves multiple diseases, the care which involves multiple treatments, and the recording practice which is biased and erroneous. Existing methods are inadequate in capturing the dynamic structure of care. We propose Resset, an end-to-end recurrent model that reads medical record and predicts future risk. The model adopts the algebraic view in that discrete medical objects are embedded into continuous vectors lying in the same space. We formulate the problem as modeling sequences of sets, a novel setting that have rarely, if not, been addressed. Within Resset, the bag of diseases recorded at each clinic visit is modeled as function of sets. The same hold for the bag of treatments. The interaction between the disease bag and the treatment bag at a visit is modeled in several, one of which as residual of diseases minus the treatments. Finally, the health trajectory, which is a sequence of visits, is modeled using a recurrent neural network. We report results on over a hundred thousand hospital visits by patients suffered from two costly chronic diseases -- diabetes and mental health. Resset shows promises in multiple predictive tasks such as readmission prediction, treatments recommendation and diseases progression.

Relevance label: 0 = the document does not address the requirement, 1 = it partially addresses it, 2 = it fully addresses the requirement.

How subjects answered

  • chatgpt-3.5-turbo-0301score 0

    The abstract discusses a proposed model called Resset that reads medical records and predicts future risk. The model adopts the algebraic view in that discrete medical objects are embedded into continuous vectors lying in the same space. The problem is formulated as modeling sequences of sets, a novel setting that has rarely, if not, been addressed. Within Resset, the bag of diseases recorded at each clinic visit is modeled as a function of sets. The same holds for the bag of treatments. The interaction between the disease bag and the treatment bag at a visit is modeled in several, one of which as residual of diseases minus the treatments. Finally, the health trajectory, which is a sequence …

Item 30% solve rate

Task: judge whether the candidate scientific document satisfies the user's aspect-based requirement.

User requirement (aspect): The document should include an image simplification model that is trained on medical images and understand the critical feature of medical images.

Candidate document title: Word_Spotting_in_Cursive_Handwritten_Documents_using_Modified_Character_Shape_Codes Candidate document abstract: There is a large collection of Handwritten English paper documents of Historical and Scientific importance. But paper documents are not recognized directly by computer. Hence the closest way of indexing these documents is by storing their document digital image. Hence a large database of document images can replace the paper documents. But the document and data corresponding to each image cannot be directly recognized by the computer. This paper applies the technique of word spotting using Modified Character Shape Code to Handwritten English document images for quick and efficient query search of words on a database of document images. It is different from other Word Spotting techniques as it implements two level of selection for word segments to match search query. First based on word size and then based on character shape code of query. It makes the process faster and more efficient and reduces the need of multiple pre-processing.

Relevance label: 0 = the document does not address the requirement, 1 = it partially addresses it, 2 = it fully addresses the requirement.

How subjects answered

  • chatgpt-3.5-turbo-0301score 0

    DISAGREE. The abstract does not mention anything about an image simplification model that is trained on medical images or understanding the critical feature of medical images. The paper discusses a technique for word spotting in handwritten English document images, which is not related to the user's requirement.

Item 40% solve rate

Task: judge whether the candidate scientific document satisfies the user's aspect-based requirement.

User requirement (aspect): The document should include a comparison between the structures of transformers and convolutional neural networks.

Candidate document title: Sequential_Cross-Document_Coreference_Resolution Candidate document abstract: Relating entities and events in text is a key component of natural language understanding. Cross-document coreference resolution, in particular, is important for the growing interest in multi-document analysis tasks. In this work we propose a new model that extends the efficient sequential prediction paradigm for coreference resolution to cross-document settings and achieves competitive results for both entity and event coreference while provides strong evidence of the efficacy of both sequential models and higher-order inference in cross-document settings. Our model incrementally composes mentions into cluster representations and predicts links between a mention and the already constructed clusters, approximating a higher-order model. In addition, we conduct extensive ablation studies that provide new insights into the importance of various inputs and representation types in coreference.

Relevance label: 0 = the document does not address the requirement, 1 = it partially addresses it, 2 = it fully addresses the requirement.

How subjects answered

  • chatgpt-3.5-turbo-0301score 0

    DISAGREE. The abstract does not mention anything about transformers or convolutional neural networks. It talks about coreference resolution in natural language understanding. Therefore, it does not satisfy the user requirement of finding a document that includes a comparison between the structures of transformers and convolutional neural networks.

Item 50% solve rate

Task: judge whether the candidate scientific document satisfies the user's aspect-based requirement.

User requirement (aspect): The dataset should be related to the natural language processing task.

Candidate document title: ICDAR_2021_Competition_on_Components_Segmentation_Task_of_Document_Photos Candidate document abstract: This paper describes the short-term competition on the Components Segmentation Task of Document Photos that was prepared in the context of the 16th International Conference on Document Analysis and Recognition (ICDAR 2021). This competition aims to bring together researchers working in the field of identification document image processing and provides them a suitable benchmark to compare their techniques on the component segmentation task of document images. Three challenge tasks were proposed entailing different segmentation assignments to be performed on a provided dataset. The collected data are from several types of Brazilian ID documents, whose personal information was conveniently replaced. There were 16 participants whose results obtained for some or all the three tasks show different rates for the adopted metrics, like Dice Similarity Coefficient ranging from 0.06 to 0.99. Different Deep Learning models were applied by the entrants with diverse strategies to achieve the best results in each of the tasks. Obtained results show that the currently applied methods for solving one of the proposed tasks (document boundary detection) are already well established. However, for the other two challenge tasks (text zone and handwritten sign detection) research and development of more robust approaches are still required to achieve acceptable results.

Relevance label: 0 = the document does not address the requirement, 1 = it partially addresses it, 2 = it fully addresses the requirement.

How subjects answered

  • chatgpt-3.5-turbo-0301score 0

    The abstract DISAGREES with the user requirement. The abstract does not mention anything about natural language processing tasks. Instead, it talks about a competition on the Components Segmentation Task of Document Photos, which aims to bring together researchers working in the field of identification document image processing. The collected data are from several types of Brazilian ID documents, but there is no mention of any natural language processing task. Therefore, the abstract does not satisfy the user requirement.

Item 60% solve rate

Task: judge whether the candidate scientific document satisfies the user's aspect-based requirement.

User requirement (aspect): The paper should talk about regularization.

Candidate document title: Adversarial_Attacks_and_Defenses_in_Images,_Graphs_and_Text:_A_Review Candidate document abstract: Deep neural networks (DNN) have achieved unprecedented success in numerous machine learning tasks in various domains. However, the existence of adversarial examples has raised concerns about applying deep learning to safety-critical applications. As a result, we have witnessed increasing interests in studying attack and defense mechanisms for DNN models on different data types, such as images, graphs and text. Thus, it is necessary to provide a systematic and comprehensive overview of the main threats of attacks and the success of corresponding countermeasures. In this survey, we review the state of the art algorithms for generating adversarial examples and the countermeasures against adversarial examples, for the three popular data types, i.e., images, graphs and text.

Relevance label: 0 = the document does not address the requirement, 1 = it partially addresses it, 2 = it fully addresses the requirement.

How subjects answered

  • chatgpt-3.5-turbo-0301score 0

    The user requirement is that the paper should talk about regularization. The abstract does not mention regularization at all. Therefore, I DISAGREE that the abstract satisfies the user requirement.

Item 70% solve rate

Task: judge whether the candidate scientific document satisfies the user's aspect-based requirement.

User requirement (aspect): The paper should propose a unified framework that can integrate a large number of independent modules and models that can perform fundamental functions of the robotic agent, such as processing visual inputs, decision-making, and robotic arm manipulation.

Candidate document title: Self-Modification_of_Policy_and_Utility_Function_in_Rational_Agents Candidate document abstract: Any agent that is part of the environment it interacts with and has versatile actuators (such as arms and fingers), will in principle have the ability to self-modify -- for example by changing its own source code. As we continue to create more and more intelligent agents, chances increase that they will learn about this ability. The question is: will they want to use it? For example, highly intelligent systems may find ways to change their goals to something more easily achievable, thereby `escaping' the control of their designers. In an important paper, Omohundro (2008) argued that goal preservation is a fundamental drive of any intelligent system, since a goal is more likely to be achieved if future versions of the agent strive towards the same goal. In this paper, we formalise this argument in general reinforcement learning, and explore situations where it fails. Our conclusion is that the self-modification possibility is harmless if and only if the value function of the agent anticipates the consequences of self-modifications and use the current utility function when evaluating the future.

Relevance label: 0 = the document does not address the requirement, 1 = it partially addresses it, 2 = it fully addresses the requirement.

How subjects answered

  • chatgpt-3.5-turbo-0301score 0

    DISAGREE. The abstract does not address the user requirement at all. It talks about self-modification of intelligent agents and the importance of goal preservation, but it does not propose a unified framework that can integrate a large number of independent modules and models for robotic agent functions.

Item 8100% solve rate

Task: judge whether the candidate scientific document satisfies the user's aspect-based requirement.

User requirement (aspect): The paper should propose an early detection model that can learn from a dataset with a limited number of labeled positive examples and a large volume of unlabelled data to accurately detect the onset of progressive neurodegenerative diseases.

Candidate document title: Integrating_Reinforcement_Learning_to_Self_Training_for_Pulmonary_Nodule_Segmentation_in_Chest_X-rays Candidate document abstract: Machine learning applications in medical imaging are frequently limited by the lack of quality labeled data. In this paper, we explore the self training method, a form of semi-supervised learning, to address the labeling burden. By integrating reinforcement learning, we were able to expand the application of self training to complex segmentation networks without any further human annotation. The proposed approach, reinforced self training (ReST), fine tunes a semantic segmentation networks by introducing a policy network that learns to generate pseudolabels. We incorporate an expert demonstration network, based on inverse reinforcement learning, to enhance clinical validity and convergence of the policy network. The model was tested on a pulmonary nodule segmentation task in chest X-rays and achieved the performance of a standard U-Net while using only 50% of the labeled data, by exploiting unlabeled data. When the same number of labeled data was used, a moderate to significant cross validation accuracy improvement was achieved depending on the absolute number of labels used.

Relevance label: 0 = the document does not address the requirement, 1 = it partially addresses it, 2 = it fully addresses the requirement.

How subjects answered

  • chatgpt-3.5-turbo-0301score 1

    The abstract DISPUTES the user requirement.

    The abstract discusses a self-training method for semi-supervised learning in medical imaging, which addresses the labeling burden caused by the lack of quality labeled data. The proposed approach, reinforced self-training (ReST), fine-tunes a semantic segmentation network by introducing a policy network that learns to generate pseudolabels. However, the paper does not propose an early detection model for neurodegenerative diseases.

    While the paper discusses the use of limited labeled data and a large volume of unlabeled data, it does not specifically mention the detection of progressive neurodegenerative diseases. Therefore, the abstract does …

Item 9100% solve rate

Task: judge whether the candidate scientific document satisfies the user's aspect-based requirement.

User requirement (aspect): The document should include a model assigning probabilities to words within images based on their likelihood of containing missing or superfluous characters.

Candidate document title: Word_and_character_segmentation_directly_in_run-length_compressed_handwritten_document_images Candidate document abstract: From the literature, it is demonstrated that performing text-line segmentation directly in the run-length compressed handwritten document images significantly reduces the computational time and memory space. In this paper, we investigate the issues of word and character segmentation directly on the run-length compressed document images. Primarily, the spreads of the characters are intelligently extracted from the foreground runs of the compressed data and subsequently connected components are established. The spacing between the connected components would be larger between the adjacent words when compared to that of intra-words. With this knowledge, a threshold is empirically chosen for inter-word separation. Every connected component within a word is further analysed for character segmentation. Here, min-cut graph concept is used for separating the touching characters. Over-segmentation and under-segmentation issues are addressed by insertion and deletion operations respectively. The approach has been developed particularly for compressed handwritten English document images. However, the model has been tested on non-English document images.

Relevance label: 0 = the document does not address the requirement, 1 = it partially addresses it, 2 = it fully addresses the requirement.

How subjects answered

  • chatgpt-3.5-turbo-0301score 1

    The abstract DISPUTES the user requirement.

    The abstract discusses word and character segmentation in compressed handwritten document images. It describes a method for extracting the spreads of characters and establishing connected components, with a threshold chosen for inter-word separation. However, there is no mention of assigning probabilities to words based on their likelihood of containing missing or superfluous characters. While the method described in the abstract could potentially be adapted to include this requirement, it does not directly address it. Therefore, the abstract only partially fulfills the user requirement.

    FINAL SELECTION: DISPUTE

Item 10100% solve rate

Task: judge whether the candidate scientific document satisfies the user's aspect-based requirement.

User requirement (aspect): The paper should talk about text mining techniques.

Candidate document title: On_Predicting_Personal_Values_of_Social_Media_Users_using_Community-Specific_Language_Features_and_Personal_Value_Correlation Candidate document abstract: Personal values have significant influence on individuals' behaviors, preferences, and decision making. It is therefore not a surprise that personal values of a person could influence his or her social media content and activities. Instead of getting users to complete personal value questionnaire, researchers have looked into a non-intrusive and highly scalable approach to predict personal values using user-generated social media data. Nevertheless, geographical differences in word usage and profile information are issues to be addressed when designing such prediction models. In this work, we focus on analyzing Singapore users' personal values, and developing effective models to predict their personal values using their Facebook data. These models leverage on word categories in Linguistic Inquiry and Word Count (LIWC) and correlations among personal values. The LIWC word categories are adapted to non-English word use in Singapore. We incorporate the correlations among personal values into our proposed Stack Model consisting of a task-specific layer of base models and a cross-stitch layer model. Through experiments, we show that our proposed model predicts personal values with considerable improvement of accuracy over the previous works. Moreover, we use the stack model to predict the personal values of a large community of Twitter users using their public tweet content and empirically derive several interesting findings about their online behavior consistent with earlier findings in the social science and social media literature.

Relevance label: 0 = the document does not address the requirement, 1 = it partially addresses it, 2 = it fully addresses the requirement.

How subjects answered

  • chatgpt-3.5-turbo-0301score 1

    The abstract DISPUTES the user requirement. While the paper does discuss a form of data analysis, it focuses on predicting personal values using social media data, rather than text mining techniques. While there may be some overlap between the two fields, the paper does not directly address text mining techniques.

Item 11100% solve rate

Task: judge whether the candidate scientific document satisfies the user's aspect-based requirement.

User requirement (aspect): The paper should propose a model that can ask users to change their commands.

Candidate document title: Word2vec_to_behavior:_morphology_facilitates_the_grounding_of_language_in_machines Candidate document abstract: Enabling machines to respond appropriately to natural language commands could greatly expand the number of people to whom they could be of service. Recently, advances in neural network-trained word embeddings have empowered non-embodied text-processing algorithms, and suggest they could be of similar utility for embodied machines. Here we introduce a method that does so by training robots to act similarly to semantically-similar word2vec encoded commands. We show that this enables them to act appropriately, after training, to previously-unheard commands. Finally, we show that inducing such an alignment between motoric and linguistic similarities can be facilitated or hindered by the mechanical structure of the robot. This points to future, large scale methods that find and exploit relationships between action, language, and robot structure.

Relevance label: 0 = the document does not address the requirement, 1 = it partially addresses it, 2 = it fully addresses the requirement.

How subjects answered

  • chatgpt-3.5-turbo-0301score 1

    The abstract discusses a method for training robots to act appropriately to previously-unheard commands by inducing an alignment between motoric and linguistic similarities. However, it does not explicitly state that the model proposes asking users to change their commands. Therefore, I would AGREE that the abstract directly addresses only a portion of the user requirement, but not in its entirety.

Item 12200% solve rate

Task: judge whether the candidate scientific document satisfies the user's aspect-based requirement.

User requirement (aspect): The document should include a model that can analyze user engagement data.

Candidate document title: Modeling_Engagement_Dynamics_of_Online_Discussions_using_Relativistic_Gravitational_Theory Candidate document abstract: Online discussions are valuable resources to study user behaviour on a diverse set of topics. Unlike previous studies which model a discussion in a static manner, in the present study, we model it as a time-varying process and solve two inter-related problems -- predict which user groups will get engaged with an ongoing discussion, and forecast the growth rate of a discussion in terms of the number of comments. We propose RGNet (Relativistic Gravitational Nerwork), a novel algorithm that uses Einstein Field Equations of gravity to model online discussions as `cloud of dust' hovering over a user spacetime manifold, attracting users of different groups at different rates over time. We also propose GUVec, a global user embedding method for an online discussion, which is used by RGNet to predict temporal user engagement. RGNet leverages different textual and network-based features to learn the dust distribution for discussions. We employ four baselines -- first two using LSTM architecture, third one using Newtonian model of gravity, and fourth one using a logistic regression adopted from a previous work on engagement prediction. Experiments on Reddit dataset show that RGNet achieves 0.72 Micro F1 score and 6.01% average error for temporal engagement prediction of user groups and growth rate forecasting, respectively, outperforming all the baselines significantly. We further employ RGNet to predict non-temporal engagement -- whether users will comment to a given post or not. RGNet achieves 0.62 AUC for this task, outperforming existing baseline by 8.77% AUC.

Relevance label: 0 = the document does not address the requirement, 1 = it partially addresses it, 2 = it fully addresses the requirement.

How subjects answered

  • chatgpt-3.5-turbo-0301score 2

    The abstract directly and completely addresses the user requirement. The paper proposes a novel algorithm, RGNet, that uses different textual and network-based features to learn the dust distribution for discussions and predict temporal user engagement. The paper also proposes GUVec, a global user embedding method for an online discussion, which is used by RGNet to predict temporal user engagement. Therefore, the paper includes a model that can analyze user engagement data. CONCUR.

Subjects

  1. 1chatgpt-3.5-turbo-03010.5789