LOGICAL-COMMONSENSEQA

A Benchmark for Logical Commonsense Reasoning

LOGICAL-COMMONSENSEQA

A Benchmark for Logical Commonsense Reasoning

🏆 ACL 2026 Main Conference 📖 Commonsense Reasoning 🏔 CU Boulder · BLAST Lab

Obed Junias1 ·  Maria Leonor Pacheco1
1University of Colorado Boulder

Abstract. Commonsense reasoning often involves evaluating multiple plausible interpretations rather than selecting a single atomic answer, yet most benchmarks rely on single-label evaluation, obscuring whether statements are jointly plausible, mutually exclusive, or jointly implausible. We introduce LOGICAL-COMMONSENSEQA, a benchmark that reframes commonsense reasoning as logical composition over pairs of atomic statements using plausibility-level operators (AND, OR, and NEITHER/NOR). Evaluating instruction-tuned, reasoning-specialized, and fine-tuned models under zero-shot, few-shot, and chain-of-thought prompting, we find that while models perform reasonably on conjunctive and moderately on disjunctive reasoning, performance degrades sharply on negation-based questions. LOGICAL-COMMONSENSEQA exposes fundamental reasoning limitations and provides a controlled framework for advancing compositional commonsense reasoning.

Motivation

Standard commonsense QA benchmarks ask models to pick a single correct answer. But everyday reasoning rarely works this way — understanding that “it can rain AND be sunny” or “you cannot be in two cities at once” requires logical composition, not just plausibility scoring over singletons.

Existing benchmarks obscure this distinction. LOGICAL-COMMONSENSEQA makes it explicit by encoding the logical relationship between pairs of commonsense statements as a first-class reasoning target.

Logical Operators
AND
Both statements are jointly plausible given the question context
OR
At least one statement is plausible — they are not mutually exclusive
NEITHER / NOR
Neither statement is plausible — both are jointly implausible

Example question:

Q: What would you do if you were hungry?
S₁: You would eat food.
S₂: You would drink water.
Logical operator: AND
✓ Both are jointly plausible responses to hunger.
Benchmark Construction

The benchmark is built on top of CommonsenseQA and its underlying ConceptNet knowledge graph:

  1. Pair Sampling — For each question, pairs of answer candidates are sampled and labeled with AND / OR / NEITHER based on their joint plausibility under the original question context.
  2. Annotation Protocol — Labels are derived from the structure of human-annotated ConceptNet triples combined with crowdsourced plausibility judgments.
  3. Compositional Test Set — Items span all three operator types, enabling fine-grained evaluation of conjunction, disjunction, and negation-based commonsense reasoning.
Evaluation Setup

We evaluate a diverse set of model families and prompting strategies:

Category Models
Instruction-tuned GPT-4o, Claude 3 Sonnet, Llama-3 Instruct
Reasoning-specialized o1-mini, DeepSeek-R1
Fine-tuned RoBERTa-large, DeBERTa-v3

Prompting conditions: Zero-shot · Few-shot · Chain-of-Thought (CoT)

Key Findings
  • Models perform reasonably on AND (conjunctive) reasoning — the easiest operator — especially under chain-of-thought prompting.
  • Performance is moderate on OR (disjunctive) reasoning, with notable variance across model families.
  • Performance degrades sharply on NEITHER/NOR (negation-based) questions — the hardest operator — across all model types and prompting conditions.
  • Chain-of-thought prompting helps for AND and OR but provides limited benefit for negation, suggesting structural reasoning gaps rather than insufficient context.
  • Fine-tuned discriminative models exhibit different failure modes compared to large generative instruction-tuned models.
Contributions
  1. LOGICAL-COMMONSENSEQA benchmark — A novel evaluation dataset that reframes commonsense QA as logical composition over statement pairs, enabling controlled assessment of AND, OR, and NEITHER/NOR reasoning.

  2. Systematic evaluation — Comprehensive benchmarking of instruction-tuned, reasoning-specialized, and fine-tuned models across multiple prompting regimes (zero-shot, few-shot, CoT).

  3. Diagnostic insight — A detailed error analysis revealing that negation-based commonsense reasoning is a persistent, cross-model bottleneck not resolved by scaling or improved prompting.

  4. Foundation for future work — A controlled compositional framework that enables targeted research on advancing logical commonsense reasoning beyond single-label benchmarks.

Presentation

ACL 202663rd Annual Meeting of the Association for Computational Linguistics

This work is presented at the main conference. If you are attending ACL 2026 and would like to discuss the paper, feel free to reach out.

🖥 View the conference slides →

Citation
@article{junias2026logical,
  title = {LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning},
  author = {Junias, Obed and Pacheco, Maria Leonor},
  journal = {arXiv preprint arXiv:2601.16504},
  year = {2026},
  url = {https://arxiv.org/abs/2601.16504}
}

This research is conducted at the BLAST Lab at the University of Colorado Boulder under the supervision of Dr. Maria Leonor Pacheco.