Commonsense Reasoning with Logical Entailment Trees
Developing benchmarks and evaluation methods for logical commonsense reasoning
Quick Navigation: Overview • Research Progress • Research Goals
Overview
Developing benchmarks and reasoning methods for evaluating logical commonsense reasoning in large language models at the BLAST Lab, University of Colorado Boulder.
Timeline: June 2025 – Present Advisor: Dr. Maria L. Pacheco, CU Boulder Target Venue: ACL 2026
Research Progress
Completed Work
Benchmark Development
Created a commonsense QA benchmark dataset designed to evaluate logical reasoning capabilities:
- Logically Composed Options: Question-answer pairs with options that require multi-fact reasoning
- Multi-Step Reasoning Trees: Structured representations of reasoning chains
- Compositional Structure: Questions that test compositional and logical reasoning abilities
The benchmark enables systematic evaluation of how language models handle complex commonsense inference tasks that require combining multiple facts and logical steps.
Baseline Evaluation
Benchmarked reasoning quality using standard LLM prompting strategies:
- N-Shot Prompting: Tested few-shot learning approaches with varying numbers of examples
- Chain-of-Thought (CoT) Prompting: Evaluated step-by-step reasoning generation
- Baseline Performance: Established baseline metrics for reasoning quality on the benchmark
These baselines provide comparison points for evaluating more advanced reasoning methods.
In Progress: Advanced Reasoning Methods
Currently designing and implementing structured evaluation methods:
- Neuro-Symbolic Framework: Combining informal logic with neural approaches
- Logical Structure Modeling: Tracking entailment relationships and inference chains
- Step-Level Evaluation: Assessing inference quality at each reasoning step
- Social Consensus Integration: Incorporating common-sense knowledge patterns
Research Goals
This work aims to:
- Provide better tools for evaluating commonsense reasoning in LLMs
- Understand how models perform multi-step logical inference
- Bridge symbolic and neural approaches to reasoning
- Enable more nuanced assessment beyond traditional QA accuracy metrics
This research is conducted at the BLAST Lab at CU Boulder under the supervision of Dr. Maria L. Pacheco.