Open to PhD research conversations in AI/NLP, reasoning, and trustworthy language systems for Spring/Fall 2027 — get in touch.

LOGICAL-COMMONSENSEQA

A Benchmark for Logical Commonsense Reasoning

High CommonsenseQA accuracy does not imply logical commonsense reasoning.

Obed Junias¹ · Maria Leonor Pacheco¹
¹University of Colorado Boulder

ACL 2026 Short Papers Commonsense Reasoning BLAST Lab · CU Boulder

📄 Paper ⬇ PDF 🖥 Slides ➕ Code 📊 Dataset

Abstract

Commonsense reasoning often involves evaluating multiple plausible interpretations rather than selecting a single atomic answer, yet most benchmarks rely on single-label evaluation, obscuring whether statements are jointly plausible, mutually exclusive, or jointly implausible. We introduce LOGICAL-COMMONSENSEQA, a benchmark that reframes commonsense reasoning as logical composition over pairs of atomic statements using plausibility-level operators (AND, OR, and NEITHER/NOR). Evaluating instruction-tuned, reasoning-specialized, and fine-tuned models under zero-shot, few-shot, and chain-of-thought prompting, we find that while models perform reasonably on conjunctive and moderately on disjunctive reasoning, performance degrades sharply on negation-based questions. LOGICAL-COMMONSENSEQA exposes fundamental reasoning limitations and provides a controlled framework for advancing compositional commonsense reasoning.

Motivation

From single-answer ranking to logical commonsense composition

When we say a model is good at commonsense reasoning, we usually mean it can pick a plausible answer from a set of choices. For example, asked what someone driving a car might have seen, the model picks "automobile accidents" — and we count that as correct.

But when humans reason through a question, we rarely stop at one answer. A situation can have multiple plausible interpretations. And when we evaluate those possibilities, we naturally ask:

Can both of these be true at the same time?

Can at least one of them be true?

Can neither of them be true?

That is the motivation behind this benchmark. We ask: can models compose commonsense plausibility using logical operators — AND, OR, and NEITHER/NOR? The task format stays multiple-choice, so any model that can take a standard QA benchmark can take this one.

Research Gap

Existing benchmarks test parts of the problem, but not all three together

Commonsense QA

CommonsenseQA, SocialIQA, PIQA

Tests everyday plausibility but in a single-answer format. Ambiguity is erased, and logical relationships between answer choices are never evaluated.

Ambiguity-Aware

AmbigQA, ProtoQA

Recognizes that questions can have multiple valid answers, but does not evaluate explicit logical composition over those alternatives.

Logical Reasoning

LogiQA, ReClor, COM2

Tests structured inference with logical operators, but the focus is formal validity — not everyday commonsense plausibility.

LOGICAL-COMMONSENSEQA sits at the intersection: commonsense plausibility + ambiguity + explicit logical composition.

Logical Operators

Each answer option pairs two atomic statements under one of three plausibility-level operators

AND

Both statements are independently plausible given the question context

At least one statement is plausible — they are not mutually exclusive

NEITHER/NOR

Neither statement is plausible — both are jointly implausible

MIXED

Different operators across answer choices within the same question — prevents shortcut exploitation

Q: Sammy wanted to go to where the people were. Where might he go?

AND	local events AND social venues	✔ both plausible
OR	local events OR empty parks	✔ at least one plausible
NEITHER/NOR	NEITHER quiet retreats NOR empty parks	✔ neither is plausible

Benchmark Construction

Neural generation with deterministic symbolic composition — no model in the composition loop

Candidate Generation

Starting from 5,000 CommonsenseQA questions, GPT-4o-mini over-generates 4–6 plausible and 4–6 implausible atomic answer candidates per question, specifically prompted for multi-step causal and situational reasoning rather than shallow lexical cues.

Refinement and Pruning

Options are filtered to remove trivial answers solvable by keyword matching. Near-miss distractors are deliberately preserved — options that satisfy most contextual constraints but fail on one subtle commonsense violation. Result: 3 correct and 4 incorrect atomic options per question.

Deterministic Symbolic Composition

A symbolic program pairs refined atomic options and assigns operator labels (AND, OR, NEITHER/NOR). This step is fully deterministic — no language model is involved in composition or labeling. We also construct a MIXED setting where different operators appear across choices within the same question, yielding 4,999 additional instances.

Human validation · Gwet's AC2 = 0.84 (awareness) / 0.91 (consensus) · 250 test questions, two independent annotators

19,996

Total instances

4,999

Per operator (AND / OR / NEITHER / MIXED)

11,996 / 6K / 2K

Train / Dev / Test

Evaluation Setup

Instruction-tuned, reasoning-specialized, and fine-tuned models across multiple prompting regimes

Paradigm	Models	Prompting
Prompted	LLaMA-3.1-8B, LLaMA-3.3-70B, Qwen2.5-7B, Gemini-2.5-Flash, Gemini-3-Flash-Preview	Zero-shot, 1/2/3-shot, CoT
Fine-tuned	Flan-T5-base (seq2seq), DeBERTa-v3-base (encoder), LLaMA-3.1-8B (QLoRA)	Supervised fine-tuning

Key Results

Models collapse on NEITHER/NOR — near or below random chance across model families and prompting strategies

13.1%

LLaMA-3.1-8B · 0-shot

NEITHER/NOR Macro-F1 — below random chance (25%)

13.4%

LLaMA-3.3-70B · 0-shot

9× more parameters, essentially no gain over the 8B model

23.5%

Gemini-2.5-Flash · 0-shot

Frontier proprietary model — still near random chance

89.5%

LLaMA-3.1-8B · fine-tuned

Supervision recovers the gap — the failure is learnable

The same model, 59 points lower. LLaMA-3.1-8B scores 72.2% on CommonsenseQA. On LOGICAL-COMMONSENSEQA, the same model drops to 13.1% on NEITHER/NOR — a 59-point fall on the same underlying commonsense knowledge. Single-answer benchmarks were giving us a flattering picture.

Humans vs. Models

Human evaluation on NEITHER/NOR scores 0.70, while zero-shot LLMs score ~0.13. The task is clearly solvable — the collapse is specific to models at inference time.

Few-shot prompting hurts

With 3 in-context examples, LLaMA-3.1-8B drops from 13.1% to ~6% on NEITHER/NOR. Chain-of-thought provides no rescue either, reaching only 8.2%.

Analysis

The failure is compositional, not atomic — models know the facts but cannot compose them under operators

To pinpoint where failures arise, we decomposed the task into three components and evaluated each in isolation with LLaMA-3.1-8B. The results show that knowledge is largely present — the breakdown is in composition.

79%

Atomic Plausibility

Classifying individual statements as plausible or implausible, in isolation. The commonsense knowledge is largely intact.

↓

52–69%

Operator Verification

Given gold plausibility labels, does the model determine whether they satisfy the target operator? Even with correct atomic facts, applying the logical relation is already imperfect.

↓

46.8%

NEITHER/NOR + Distractors

Full task with competing composite answer options. Distractor competition causes the sharpest additional drop. The difficulty arises from the interaction of all three components, not any one alone.

Error Patterns

ANDSingle-statement dominance — The model anchors on one plausible clause and treats the full option as correct, ignoring whether the second clause is also plausible.

ORThematic similarity — Rather than verifying that at least one clause is plausible, the model selects options whose two clauses are thematically related but individually implausible.

NEITHER/NORNegation inversion and plausibility dominance — The model selects the most plausible pair of statements despite the operator requiring both to be implausible. Highly plausible content survives even when it should be rejected.

LCSQA separates knowing commonsense facts from composing them under logical constraints. Models are not simply missing knowledge — they fail when negation, operator scope, and distractor competition interact simultaneously.

Takeaways

High CommonsenseQA accuracy does not imply logical commonsense reasoning

Benchmarks can overestimate reasoning

72.2% → 13.1%

Same model, same commonsense knowledge — different question format.

Implication: Single-answer accuracy tells a flattering story about model reasoning ability.

Negation reveals hidden failure

13–23%

All zero-shot models land near or below random chance (25%) on NEITHER/NOR.

Implication: Scaling from 8B to 70B parameters does not fix the problem.

The gap is learnable

89.5%

NEITHER/NOR Macro-F1 after fine-tuning LLaMA-3.1-8B with QLoRA.

Implication: The failure is an inference-time limitation, not a dataset artifact.

Core message

Logical commonsense reasoning requires more than selecting a plausible answer. It requires composing plausibility under explicit operators and resisting distractors that are individually plausible but logically incorrect.

Presentation

ACL 2026 64th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

San Diego, California, United States · Poster: July 6, 2026. If you are attending ACL and would like to discuss the paper, feel free to reach out.

View the conference slides →

Citation

@inproceedings{junias-pacheco-2026-logical,
  title      = {{LOGICAL}-{COMMONSENSEQA}: A Benchmark for Logical Commonsense Reasoning},
  author     = "Junias, Obed and Pacheco, Maria Leonor",
  booktitle = "Proceedings of the 64th Annual Meeting of the {A}ssociation for {C}omputational {L}inguistics (Volume 2: Short Papers)",
  month      = jul,
  year       = "2026",
  address    = "San Diego, California, United States",
  publisher = "Association for Computational Linguistics",
  url        = "https://aclanthology.org/2026.acl-short.61/",
  pages      = "746--758",
  ISBN       = "979-8-89176-391-3"
}

This research is conducted at the BLAST Lab at the University of Colorado Boulder under the supervision of Dr. Maria Leonor Pacheco.