Commonsense Reasoning with Logical Entailment Trees

Developing benchmarks and evaluation methods for logical commonsense reasoning

Quick Navigation: OverviewResearch ProgressResearch Goals


Overview

Developing benchmarks and reasoning methods for evaluating logical commonsense reasoning in large language models at the BLAST Lab, University of Colorado Boulder.

Timeline: June 2025 – Present Advisor: Dr. Maria L. Pacheco, CU Boulder Target Venue: ACL 2026

Research Progress

Completed Work

Benchmark Development

Created a commonsense QA benchmark dataset designed to evaluate logical reasoning capabilities:

  • Logically Composed Options: Question-answer pairs with options that require multi-fact reasoning
  • Multi-Step Reasoning Trees: Structured representations of reasoning chains
  • Compositional Structure: Questions that test compositional and logical reasoning abilities

The benchmark enables systematic evaluation of how language models handle complex commonsense inference tasks that require combining multiple facts and logical steps.

Baseline Evaluation

Benchmarked reasoning quality using standard LLM prompting strategies:

  • N-Shot Prompting: Tested few-shot learning approaches with varying numbers of examples
  • Chain-of-Thought (CoT) Prompting: Evaluated step-by-step reasoning generation
  • Baseline Performance: Established baseline metrics for reasoning quality on the benchmark

These baselines provide comparison points for evaluating more advanced reasoning methods.

In Progress: Advanced Reasoning Methods

Currently designing and implementing structured evaluation methods:

  • Neuro-Symbolic Framework: Combining informal logic with neural approaches
  • Logical Structure Modeling: Tracking entailment relationships and inference chains
  • Step-Level Evaluation: Assessing inference quality at each reasoning step
  • Social Consensus Integration: Incorporating common-sense knowledge patterns

Research Goals

This work aims to:

  1. Provide better tools for evaluating commonsense reasoning in LLMs
  2. Understand how models perform multi-step logical inference
  3. Bridge symbolic and neural approaches to reasoning
  4. Enable more nuanced assessment beyond traditional QA accuracy metrics

This research is conducted at the BLAST Lab at CU Boulder under the supervision of Dr. Maria L. Pacheco.