LOGICAL-COMMONSENSEQA
A Benchmark for Logical Commonsense Reasoning
LOGICAL-COMMONSENSEQA
A Benchmark for Logical Commonsense Reasoning
Standard commonsense QA benchmarks ask models to pick a single correct answer. But everyday reasoning rarely works this way — understanding that “it can rain AND be sunny” or “you cannot be in two cities at once” requires logical composition, not just plausibility scoring over singletons.
Existing benchmarks obscure this distinction. LOGICAL-COMMONSENSEQA makes it explicit by encoding the logical relationship between pairs of commonsense statements as a first-class reasoning target.
Example question:
S₂: You would drink water.
The benchmark is built on top of CommonsenseQA and its underlying ConceptNet knowledge graph:
- Pair Sampling — For each question, pairs of answer candidates are sampled and labeled with AND / OR / NEITHER based on their joint plausibility under the original question context.
- Annotation Protocol — Labels are derived from the structure of human-annotated ConceptNet triples combined with crowdsourced plausibility judgments.
- Compositional Test Set — Items span all three operator types, enabling fine-grained evaluation of conjunction, disjunction, and negation-based commonsense reasoning.
We evaluate a diverse set of model families and prompting strategies:
| Category | Models |
|---|---|
| Instruction-tuned | GPT-4o, Claude 3 Sonnet, Llama-3 Instruct |
| Reasoning-specialized | o1-mini, DeepSeek-R1 |
| Fine-tuned | RoBERTa-large, DeBERTa-v3 |
Prompting conditions: Zero-shot · Few-shot · Chain-of-Thought (CoT)
- Models perform reasonably on AND (conjunctive) reasoning — the easiest operator — especially under chain-of-thought prompting.
- Performance is moderate on OR (disjunctive) reasoning, with notable variance across model families.
- Performance degrades sharply on NEITHER/NOR (negation-based) questions — the hardest operator — across all model types and prompting conditions.
- Chain-of-thought prompting helps for AND and OR but provides limited benefit for negation, suggesting structural reasoning gaps rather than insufficient context.
- Fine-tuned discriminative models exhibit different failure modes compared to large generative instruction-tuned models.
-
LOGICAL-COMMONSENSEQA benchmark — A novel evaluation dataset that reframes commonsense QA as logical composition over statement pairs, enabling controlled assessment of AND, OR, and NEITHER/NOR reasoning.
-
Systematic evaluation — Comprehensive benchmarking of instruction-tuned, reasoning-specialized, and fine-tuned models across multiple prompting regimes (zero-shot, few-shot, CoT).
-
Diagnostic insight — A detailed error analysis revealing that negation-based commonsense reasoning is a persistent, cross-model bottleneck not resolved by scaling or improved prompting.
-
Foundation for future work — A controlled compositional framework that enables targeted research on advancing logical commonsense reasoning beyond single-label benchmarks.
ACL 2026 63rd Annual Meeting of the Association for Computational Linguistics
This work is presented at the main conference. If you are attending ACL 2026 and would like to discuss the paper, feel free to reach out.
🖥 View the conference slides →
title = {LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning},
author = {Junias, Obed and Pacheco, Maria Leonor},
journal = {arXiv preprint arXiv:2601.16504},
year = {2026},
url = {https://arxiv.org/abs/2601.16504}
}
This research is conducted at the BLAST Lab at the University of Colorado Boulder under the supervision of Dr. Maria Leonor Pacheco.