Grammar Aware Language Models

Karl V. Muller

Cornell Tech

Abstract

Large language models (LLMs) demonstrate emergent and strong grammatical fluency, but can still violate the constraints of a language's grammar such as in code generation or formal reasoning. We investigate whether an explicit, lightweight grammar prior can improve syntactic correctness without modifying the weights of the model itself. We propose a grammar aware scoring framework that augments an LLM's sentence log-likelihood with a part of speech (POS) bigram priors model trained on tagged text. The resulting model acts similar to a product of experts, combining the semantic likelihood of a sentence from the LLM with the grammar structural plausibility of that sentence from the POS model. We evaluate this approach on the BLiMP benchmark of minimal grammatical pairs. Our results show that POS based priors consistently improve performance on several syntactic structures found in formal language (e.g. subject-verb agreement, intransitives, irregular past participles) while occasionally degrading complex structures (e.g. negative polarity items, principle A). Overall, we find that grammar priors are most effective when applied selectively, highlighting both the promise and limitation of a simple, shallow grammatical bias.

1. Introduction

Despite rapid advances in scale and training data, LLMs remain imperfect with respect to grammatical structure. While attention mechanisms can capture long range dependencies across a sentence well, it does not capture or explicitly enforce against syntactic constraints of a language. In this way, the grammar structure an LLM learns becomes purely emergent, leading to outputs that may be semantically plausible yet structurally ill formed. The gap can become problematic in downstream applications of an LLM where syntactic validity is critically tied to semantics, such as in programming languages, formal proofs, or math. In these domains, even minor grammar violations can invalidate an output. Moreover, for underrepresented languages in the training data, the syntactic generalization degrades even more, further motivating explicit mechanisms for grammatical structure steering.

Prior work has incorporated grammatical structure into language model decoding by explicitly constraining the output space, for example through Grammar Constrained Decoding [Zhou et al., 2023] and Grammar Aligned Decoding [Wang et al., 2024]. These approaches restrict generation to tokens licensed by a formal grammar and modify decoding to match the conditional language model distribution. While effective at enforcing well formed sentences, such methods impose a hard constraint on generation and tightly couple decoding to a specific formal grammar.

In contrast, we explore whether grammatical knowledge can be incorporated as a soft probabilistic prior, biasing an LLM toward syntactic plausible output while still allowing the base model to determine the semantics. Rather than restricting token generation or retraining the model, we combine an LLM with an interpretable POS bigram grammar via a product of experts. We empirically study its effect across a wide range of syntactic structures in the BLiMP benchmark.

2. Background & Related Work

Language modeling is typically formulated as next token prediction, where models are trained to maximize likelihood over large corpora and are expected to implicitly acquire syntactic structure from data. While modern LLMs exhibit strong grammar fluency, it remains an open question to what extent they internalize a language's grammar rules beyond surface level regularities.

To benchmark the abilities of language models to produce grammatically correct output, several targeted evaluation benchmarks have been proposed. The Corpus of Linguistic Acceptability (CoLA) consists of English sentences labeled as grammatical or ungrammatical, enabling acceptability classification and error analysis across grammar structures [Warstadt et al., 2019]. The Benchmark of Linguistic Minimal Pairs (BLiMP), extends this line of work by providing minimal pairs that isolate specific grammatical structures, allowing a more fine grained evaluation of syntactic generalization [Warstadt et al., 2023]. These benchmarks have revealed that language models learn some systematic grammatical patterns, such as word order and agreement, but struggle with constructions requiring more long distance patterns.

Beyond evaluation, several approaches have sought to incorporate grammatical structure directly into language model inference. Grammar Constrained Decoding (GCD) enforces hard constraints derived from formal grammars, restricting generation to syntactically valid token sequences [Zhou et al., 2023]. Grammar Aligned Decoding (GAD) similarly integrates grammatical structure by modifying decoding distributions to better match grammar conditioned models [Wang et al., 2024]. While effective at enforcing structure, these methods impose hard generation constraints and tightly couple decoding to a specific formal grammar.

Classical natural language processing has long employed probabilistic models as lightweight and interpretable representations of linguistic structure. In particular, POS tagging abstracts away lexical content while preserving grammatical patterns, and n-gram models over POS sequences capture local syntactic dependencies such as agreement and short range structure [Jurafsky and Martin, 2025]. More generally, combining multiple probabilistic models to jointly evaluate data can be formalized through a product of experts framework, in which agreement among experts sharpens the resulting distribution [Hinton, 1999].

3. Method

Building on probabilistic grammar modeling, we combine a pretrained LLM with a POS bigram grammar using a product of experts formulation, treating grammar as a soft probabilistic prior rather than a hard constraint like in GCD and GAD.

Let y = (w₁, ..., wₜ) denote a sentence of length T. A pretrained causal language model assigns a log-likelihood:

s_LM(y) = Σ_t=1^T log p_LM(w_t | w_<t)

Given a POS tagging (z₁, ..., zₜ) of the sentence, a POS bigram grammar trained on tagged text assigns a log-likelihood:

s_POS(y) = Σ_t=1^T log p_POS(z_t | z_t-1)

We combine these components using a product-of-experts scoring function:

s_GLM(y) = s_LM(y) + λ s_POS(y)

where λ ≥ 0 controls the influence of the grammar prior. In probability space, this corresponds to multiplying the LLM likelihood by a POS grammar likelihood raised to the power λ. When λ = 0, the model reduces to the base LLM. We evaluate this grammar aware score in a sentence discrimination setting, where grammatical and ungrammatical sentence pairs are compared directly using s_GLM.

4. Experimentation & Results

4.1 Models and Grammar Priors

We evaluated our grammar aware scoring framework using Qwen3-1.7B, a pretrained casual language model, in inference mode only. No model parameters are updated. As the grammar expert, we used a POS bigram language model trained on POS tagged text derived from the WikiText2 corpus (wikitext-2-raw-v1). The POS tagging of this corpus was obtained using spaCy, and the bigram model was trained with Kneser-Ney (KN) smoothing to better address unseen POS bigrams.

Used two different spaCy taggers, small neural network based (en_core_web_sm) and transformer based (en_core_web_trf), to construct two POS KN bigram models to account for sensitivity in tagging quality. The grammar weight was fixed to λ = 0.5, which provided a representative balance between language model and grammar scores in preliminary experiments.

4.2 Benchmarks and Evaluation

We evaluated exclusively on the BLiMP benchmark, which consists of minimal pairs of grammatical and ungrammatical English sentences covering 67 syntax subsets. Each subset contained controlled sentence pairs differing only in grammaticality. Using this benchmark, we looked for sentence discrimination accuracy, which was the proportion of minimal pairs for a certain syntax structure with a higher score over the grammatical sentences than the ungrammatical ones under the scoring function.

4.3 Results Across Syntactic Structures

Table 1: Mean sentence discrimination accuracy on BLiMP grouped by syntactic categories. POS bigram priors help local dependencies (subject/verb agreement, islands, intransitives, and irregular past participle verbs) but continue to underperform on NPI licensing, Principle A, and wh–that alternations.

Phenomenon Type	LM	POS Bigram (sm)	POS Bigram (trf)	Δ (trf)
Subject–verb Agreement	0.835	0.847	0.858	+0.023
Wh-movement	0.876	0.936	0.935	+0.059
Adjunct + wh Islands	0.800	0.864	0.866	+0.066
Intransitive	0.752	0.824	0.820	+0.068
Irregular Past Participle Verbs	0.804	0.872	0.872	+0.068
Quantifiers / Existential	0.788	0.813	0.813	+0.025
NPI Licensing	0.666	0.629	0.626	-0.039
Binding (Principle A)	0.778	0.706	0.708	-0.070
Wh–that Alternations	0.604	0.521	0.521	-0.083
Overall (67)	0.772	0.758	0.758	-0.014

Across all 67 minimal pairs, the grammar aware models show a small decrease in mean accuracy relative to the base LLM (−0.014 on average), with a median difference of 0.0. However, this aggregate obscures substantial gains on specific subsets. Approximately 30 of 67 sets improve under the grammar prior, with the largest gains observed for subject/verb agreement, wh-word movement, and existential-there constructions, including long distance subject gaps.

In contrast, we observe pronounced degradations on grammar sets that require hierarchical or semantic understanding to follow the formal grammar, such as negative polarity item licensing, Principle A binding, and wh-complementizer subsets. These results suggest that POS bigram grammars effectively capture local surface regularities but are not as well suited for modeling deeper structural dependencies.

Comparing POS models trained with the small NN and transformer based spaCy taggers yields only marginal differences, and neither variant impacts the overall negative mean effect.

4.4 Summary of Findings

Taken together, these results indicate that POS based grammar priors can meaningfully improve syntactic discrimination for certain classes of a grammar, but degrade performance for others. A single global grammar weight applied uniformly across constructions masks these differences, highlighting the need for selective or adaptive application of grammatical priors.

5. Discussion

Our experiments reveal a nuanced take on a grammar aware language model using POS priors. While the experiment on the BLiMP benchmark showed that it could substantially improve performance on certain syntax, it could also introduce lower results on others, highlighting both the possibilities and limitations of a POS bigram expert.

5.1 Where POS Priors Help

We observe consistent gains on the minimal grammar pairs where the syntax is guided by local or surface level regularities, including subject/verb agreement, determiner/noun agreement, intransitive/transitive alternations, and several wh-movement pairs. These improvements are particularly pronounced for long distance wh-subject gaps and existential constructions, suggesting that even simple POS priors can still boost LLM predictions when those syntax rely on consistent local tag patterns (certain long distance dependencies can still be affected by a local grammar change).

This aligns with classical NLP findings that POS n-grams effectively model agreement and short range dependencies [Jurafsky and Martin, 2025], suggesting that LLMs may underweight certain structural signals that are easily recoverable at the POS level.

5.2 Where POS Priors Hurt

In contrast, the POS grammar prior degrades performance on minimal pairs that require hierarchical structure or contextual and semantic understanding. These include Principle A binding constraints, negative polarity item licensing, and wh–that alternations with gaps. In such cases, simple bigram POS sequences fail to capture the relevant grammatical dependencies, and the grammar expert favors syntactically frequent but linguistically invalid continuations. These regressions persist across different taggers and smoothing strategies, indicating that the limitation lies in the representational capacity and simplicity of a POS bigram.

5.3 Limits of Uniform Grammar Weighting

A key finding of the benchmarking is that applying a single global grammar weight across all syntactic minimal pairs is suboptimal. While the POS prior improves roughly half of the BLiMP subsets, the remaining subsets experience regressions large enough to produce a slightly negative average effect overall. This suggests that grammatical structure is not uniformly beneficial and that grammar aware modeling must be adaptive to the current syntax and context being evaluated.

5.4 Implications for Grammar Aware Language Modeling

Framing grammar integration as a product of experts provides a flexible alternative to grammar constrained decoding, preserving the original token space and avoiding retraining. However, our results indicate that a POS bigram grammar expert should be treated as modular, selectively applied component rather than a universally beneficial constraint. POS level grammar captures important regularities, but its limitations show the need for richer or higher level grammatical representation.

6. Conclusion

We investigated whether lightweight grammatical structure can be incorporated into large language models as a soft probabilistic prior, rather than as a hard decoding constraint. By combining a pretrained LLM with a POS bigram grammar in a product of experts formulation, we evaluated grammar aware scoring across 67 syntactic sets in the BLiMP benchmark.

Our results demonstrate that POS based grammar priors can meaningfully improve performance on a subset of syntactic constructions, particularly those governed by local structural regularities. At the same time, they expose clear limitations on syntax requiring hierarchical or semantic understanding. Taken together, these findings suggest that even a simple grammatical abstraction can complement LLMs, but only when applied selectively and with awareness of their representational limits.

7. Future Work

The limitations observed for POS based grammar priors suggest that richer syntactic representations are needed to capture constraints beyond local category transitions. A natural extension is to incorporate dependency aware grammar priors that more directly reflect hierarchical relationships. For example, one could train n-gram models over dependency labels or over structured tuples such as (head POS, dependency label, child POS), extracted from dependency parsed corpora.

Such representations may better capture syntax that POS bigrams systematically fail on, including binding constraints (relative positions of pronouns and antecedents), wh–that alternations (presence or absence of clause linking dependencies), and negative polarity item licensing (structural relationships between licensors and NPIs). Sentences could be tagged using a dependency parser, transformed into linearized dependency sequences or paths, and scored using a Kneser–Ney smoothed n-gram model. The resulting dependency level likelihood could then be combined with the base language model in the same product of experts framework explored in this work. This approach would retain the interpretability and modularity of grammar priors while providing a syntactic bias more closely aligned with the grammatical constraints underlying the challenging BLiMP subsets.

References

G. E. Hinton. Products of experts. Proceedings of the Ninth International Conference on Artificial Neural Networks, 1999. URL https://www.cs.toronto.edu/~fritz/absps/icann-99.pdf.

D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, with Language Models. 3rd edition, 2025. URL https://web.stanford.edu/~jurafsky/slp3/. Online manuscript released August 24, 2025.

X. Wang et al. Grammar-aligned decoding. arXiv preprint arXiv:2405.21047, 2024. URL https://arxiv.org/abs/2405.21047.

A. Warstadt, A. Singh, and S. R. Bowman. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471, 2019. URL https://arxiv.org/abs/1805.12471.

A. Warstadt, A. Parrish, H. Liu, A. Mohananey, W. Peng, S.-F. Wang, and S. R. Bowman. Blimp: The benchmark of linguistic minimal pairs for english. arXiv preprint arXiv:1912.00582, 2023. URL https://arxiv.org/abs/1912.00582.

J. Zhou et al. Grammar-constrained decoding for structured text generation. arXiv preprint arXiv:2305.13971, 2023. URL https://arxiv.org/abs/2305.13971.

Interested in this research?

Download the full PDF or get in touch to discuss this work.