Commonsense Reasoning with Logical Entailment Trees

Quick Navigation: Overview • Research Progress • Research Goals

Overview

Developing benchmarks and reasoning methods for evaluating logical commonsense reasoning in large language models at the BLAST Lab, University of Colorado Boulder.

Timeline: June 2025 – Present Advisor: Dr. Maria L. Pacheco, CU Boulder Target Venue: ACL 2026

Research Progress

Completed Work

Benchmark Development

Created a commonsense QA benchmark dataset designed to evaluate logical reasoning capabilities:

Logically Composed Options: Question-answer pairs with options that require multi-fact reasoning
Multi-Step Reasoning Trees: Structured representations of reasoning chains
Compositional Structure: Questions that test compositional and logical reasoning abilities

The benchmark enables systematic evaluation of how language models handle complex commonsense inference tasks that require combining multiple facts and logical steps.

Baseline Evaluation

Benchmarked reasoning quality using standard LLM prompting strategies:

N-Shot Prompting: Tested few-shot learning approaches with varying numbers of examples
Chain-of-Thought (CoT) Prompting: Evaluated step-by-step reasoning generation
Baseline Performance: Established baseline metrics for reasoning quality on the benchmark

These baselines provide comparison points for evaluating more advanced reasoning methods.

In Progress: Advanced Reasoning Methods

Currently designing and implementing structured evaluation methods:

Neuro-Symbolic Framework: Combining informal logic with neural approaches
Logical Structure Modeling: Tracking entailment relationships and inference chains
Step-Level Evaluation: Assessing inference quality at each reasoning step
Social Consensus Integration: Incorporating common-sense knowledge patterns

Research Goals

This work aims to:

Provide better tools for evaluating commonsense reasoning in LLMs
Understand how models perform multi-step logical inference
Bridge symbolic and neural approaches to reasoning
Enable more nuanced assessment beyond traditional QA accuracy metrics

This research is conducted at the BLAST Lab at CU Boulder under the supervision of Dr. Maria L. Pacheco.