Azhary Arliansyah

Articles / Giving AI a Map for Sentences: How Syntax Makes Transformers Smarter

Giving AI a Map for Sentences: How Syntax Makes Transformers Smarter

NLP Research Transformers ASTE BiLSTM Sentiment Analysis
Giving AI a Map for Sentences: How Syntax Makes Transformers Smarter

Aspect-Based Sentiment Triplet Extraction (ASTE) is currently one of the most intricate challenges in NLP. Standard Sentiment Analysis evaluates the overall polarity of a sentence. In contrast, ASTE pushes the boundary by extracting Triplets: Aspect Term, Opinion Term, and Sentiment Polarity.

In this research, the authors tackle the Symmetry Ambiguity problem. The proposed model, the Syntax-Aware Transformer (SA-Transformer), overcomes this by injecting explicit syntactic dependencies and relative distances into the attention mechanism, guiding semantics with structural context [1]


1. Model Architecture Overview

To resolve the Symmetry Ambiguity problem, the SA-Transformer is designed with a dual-branch architecture. It processes the semantic sequence meaning and the grammatical syntactic structure in parallel, merging them using a syntax-aware attention mechanism.

Aspect-Based Sentiment Triplets Extracted: [staff, courteous, POS], [food, terrible, NEG] Word-Pair Tag Preds
yijy_{ij}
Adjacent Inference Strategy (GCN) Syntactic Pair Reps
PijRdpP_{ij} \in \mathbb{R}^{d_p}
Syntactic Distance
DistijDist_{ij}
SA-Transformer (Syntax-Aware Attention)
E(l)RdE^{(l)} \in \mathbb{R}^{d}
H(l)RdhH^{(l)} \in \mathbb{R}^{d_h}
Adjacent Edge Attention (AEA)
Adj(Aij)Adj(A_{ij})
Rel(Rij)Rel(R_{ij})
Bidirectional LSTM (BiLSTM)
wiR300w_i \in \mathbb{R}^{300}
Dependency Parser Text Sequence GloVe Word Embeddings Text Tokens Input Sentence Tokens "The staff was very courteous but food was terrible..." Syntactic Branch Semantic Branch

*Figure 0: The Overall SA-Transformer Architecture. Detailed input-output flow mapping the raw text tokens explicitly through syntax extraction (A & R matrices), GloVe encodings (w_i), edge pooling (E_{ij}), syntax-aware attention (P_{ij}), to relational prediction tags.*

2. GloVe Embedding Layer

Before syntax is analyzed, the model first maps standard word tokens into dense vector representations using pre-trained GloVe embeddings (300-dimensional). Each token wiw_i is mapped to a fixed-length vector eiR300e_i \in \mathbb{R}^{300}. These embedding vectors are then passed as input to the BiLSTM encoder.

GloVe Embedding Lookup (300-dimensional) The staff was very courteous GloVe Lookup Table e1 = [ 0.418, 0.249, -0.412, 0.003, ..., -0.216, 0.103] 300d e2 = [ 0.287, -0.156, 0.831, -0.529, ..., 0.174, -0.042] 300d e3 = [ 0.156, 0.392, -0.215, 0.710, ..., -0.337, 0.278] 300d e4 = [-0.201, 0.517, 0.143, -0.086, ..., 0.421, 0.385] 300d e5 = [ 0.673, -0.321, 0.095, 0.458, ..., -0.562, -0.158] 300d Each word is mapped to a pre-trained 300-dimensional GloVe vector

*Figure 1: GloVe word embedding lookup. Each token is mapped to a fixed 300-dimensional vector representation.*

3. Contextual Semantic Encoding (BiLSTM)

After obtaining the GloVe embeddings, these vectors are passed into a Bi-directional LSTM (BiLSTM) to produce sequence-aware semantic representations (hih_i). A mathematically accurate representation of the LSTM cell used in the Context Sequence Encoder is traced below: a Bi-directional LSTM (BiLSTM) to produce sequence-aware semantic representations (hih_i). A mathematically accurate representation of the LSTM cell used in the Context Sequence Encoder is traced below:

Ct1lC_{t-1}^l
CtlC_t^l
ht1lh_{t-1}^l
htlh_t^l
htl1h_t^{l-1}
htlh_t^l
× + × ×
σ\sigma
ftlf_t^l
σ\sigma
itli_t^l
tanh
C~tl\tilde{C}_t^l
σ\sigma
otlo_t^l
tanh

*Figure 2: Mathematical trace of the Bidirectional LSTM Cell mapping semantics.*

The LSTM cell independently processes token words. To capture the entire sentence context, these cells are stacked into a sequence graph where both Forward and Backward pathways process the input sentence structure simultaneously:

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM The staff was very courteous ...
e1e_1
0.418 0.249 ...
e2e_2
0.287 -0.156 ...
e3e_3
0.156 0.392 ...
e4e_4
-0.201 0.517 ...
e5e_5
0.673 -0.321 ...
\oplus
h1h_1
\oplus
h2h_2
\oplus
h3h_3
\oplus
h4h_4
\oplus
h5h_5

*Figure 3: The bidirectional stream traversing the sequence in both forward and backwards time steps context.*

The resulting concatenated hidden state hi=[hi;hi]h_i = [\vec{h_i}; \overleftarrow{h_i}] encapsulates sequence memory, yielding the baseline representation Si(0)=hiS^{(0)}_i = h_i.

4. Syntactic Skeleton: The Dependency Tree and Matrices

To capture structural relationships, the sequence is passed through a Dependency Parser. The parser extracts the syntactic relations and projects them into an Adjacency Matrix (A) (binary connections) and a Relationship Matrix (R) (the grammatical edge labels).

det nsubj acomp advmod cc conj nsubj det acomp The staff was very courteous but the food was terrible

*Figure 4: Dependency Tree Visualization mapping standard grammatical relationships.*

The Matrix A and Matrix R below are fundamentally N×NN \times N mappings (where N=10N = 10 tokens). They form the foundational graphs for the Transformer layers.

Adjacency Matrix (A) — 10×10

Ai,j0,1EdgeExists?A_{i,j} ∈ {0, 1} — Edge Exists?
Thestaffwasverycourt.butthefoodwasterr. Thestaffwasverycourt.butthefoodwasterr. 1100000000111000000001101100100001100000001110000000100100000000001100000000111000100001110000000011

Relationship Matrix (R) — 10×10

Ri,jDependencyTypeR_{i,j} — Dependency Type
Thestaffwasverycourt.butthefoodwasterr. Thestaffwasverycourt.butthefoodwasterr. -det--------det-nsubj--------nsubj--acompcc--conj-----advmod-------acompadvmod--------cc--------------det--------det-nsubj---conj----nsubj-acomp--------acomp-

5. Breaking Symmetry Ambiguity with AEA

A standard multi-task model correctly recognizes "staff" and "food" as aspects, but struggles to link the opinions because both are connected to the word "was" through nsubj dependencies. Standard Graph Convolutional Networks (GCNs) treat the conj edge between the two "was" tokens identically, incorrectly causing the "courteous" opinion to bleed over into "food".

The Adjacent Edge Attention (AEA) solves this by dynamically differentiating identical grammatical labels based on their structural neighborhood.

nsubj (0.85) acomp (0.92) conj (0.12) Restricted nsubj (0.81) acomp (0.89) staff courteous was was food terrible

*Figure 5: The AEA Neural Audit dynamically crushing the weight of the "conj" edge to prevent emotional bleedover across clauses.*


6. Syntactic Distance (Shortest Path BFS)

To further assist the Transformer attention layers, the explicit structural distance is computed between tokens using Breadth-First Search (BFS) over the dependency tree.

BFS Shortest Path (dist = 4 hops) very courteous was was food advmod acomp conj nsubj

*Figure 6: Syntactic relative distance counts strictly grammatical structural hops rather than sequential word-length.*

The SA-Transformer counts strict structural hops rather than linear sequence distance. A distance of 4 is mapped into vector Edist[4]E_{dist}[4] and concatenated directly into the Attention Key/Value representations.


7. SA-Transformer (Syntax-Aware Attention)

The core innovation is the Syntax-Aware Attention mechanism. It injects the edge representations (E(l)E^{(l)}) from AEA directly into the attention alongside BiLSTM hidden states (H(l)H^{(l)}):

Kj=hjWK+ei,jWKe,Vj=hjWV+ei,jWVeK_j = h_j W_K + e_{i,j} W_{K_e}, \quad V_j = h_j W_V + e_{i,j} W_{V_e}

αi,j=softmax((hiWQ)KjTdk),Si(l+1)=jαi,jVj\alpha_{i,j} = \text{softmax}\left(\frac{(h_i W_Q) \cdot K_j^T}{\sqrt{d_k}}\right), \quad S_i^{(l+1)} = \sum_j \alpha_{i,j} V_j

Worked Example: Attention for "staff" (i=2i=2)

Using the BiLSTM hidden states and AEA edge representations from prior sections, the following traces how the SA-Transformer updates the representation of "staff":

Syntax-Aware Attention: Updating "staff" (i=2)
Q2=h2("staff")WQQ₂ = h₂("staff") · W_Q
[0.52, -0.31, 0.74, ...] ∈ ℝ²⁰⁰
K1=h1("The")WKK₁ = h₁("The")·W_K
+e2,1(det)WKe+ e₂,₁(det)·W_{Ke}
score = 0.87
K3=h3("was")WKK₃ = h₃("was")·W_K
+e2,3(nsubj)WKe+ e₂,₃(nsubj)·W_{Ke}
score = 1.13 (highest!)
K8=h8("food")WKK₈ = h₈("food")·W_K
+ 0 (no edge, A₂,₈=0) score = 0.41 (blocked) Softmax → Attention Weights α α₂,₁ = 0.28 (The) | α₂,₃ = 0.52 (was) | α₂,₈ = 0.07 (food) | others ≈ 0.13
V1=h1WV+e2,1WVeV₁ = h₁·W_V + e₂,₁·W_{Ve}
× 0.28
V3=h3WV+e2,3WVeV₃ = h₃·W_V + e₂,₃·W_{Ve}
× 0.52
V8=h8WV+0V₈ = h₈·W_V + 0
× 0.07 Σ S₂⁽¹⁾ = [0.41, -0.18, 0.63, ...] ∈ ℝ²⁰⁰ "staff" now encodes syntax: heavily influenced by "was" (nsubj), not by "food"

*Figure 7: SA-Transformer attention flow for "staff". Edge representations from AEA boost syntactically connected words (nsubj→was: α=0.52) while blocking unconnected ones (food: α=0.07).*

After LL layers, the Syntactic Pair Representation is formed by concatenating two words' final representations with their distance embedding:

Pi,j=[Si(L);Sj(L);fd(i,j)]P_{i,j} = [S_i^{(L)} ; S_j^{(L)} ; f^d(i,j)]


8. Adjacent Inference Strategy & Final Extraction

Each pair representation Pi,jP_{i,j} from Section 7 is classified into a tag. Below the following traces the full pipeline for the word pair ("staff", "courteous"):

Step 1: Pair Representation Input

Pstaff,court.=[S2(L);S5(L);fd(2,5)]P_{\text{staff,court.}} = [S_2^{(L)} ; S_5^{(L)} ; f^d(2,5)]

S₂("staff") from SA-Trans [0.41, -0.18, 0.63, ...] ∈ ℝ²⁰⁰ S₅("courteous") from SA-Trans [0.73, 0.29, -0.51, ...] ∈ ℝ²⁰⁰
fd(2,5)=dist2hopsf^d(2,5) = dist 2 hops
[0.12, -0.34, ...] ∈ ℝ¹⁰⁰ P₂,₅ = concat → [···] ∈ ℝ⁵⁰⁰

Step 2: MLP Classification → Initial Logits

The MLP maps Pi,jP_{i,j} to 6-class logits ci,jc_{i,j}:

cstaff,court.=MLP(P2,5)=[0.12N,0.85A,0.47O,2.31POS,1.05NEG,0.38NEU]c_{\text{staff,court.}} = \text{MLP}(P_{2,5}) = [\underset{N}{0.12}, \underset{A}{-0.85}, \underset{O}{-0.47}, \underset{\textbf{POS}}{\textbf{2.31}}, \underset{NEG}{-1.05}, \underset{NEU}{0.38}]

Step 3: GCN Refinement (T=2 iterations)

The GCN aggregates predictions from neighboring cells (i±1,j)(i{\pm}1, j) and (i,j±1)(i, j{\pm}1):

c~2,5(t)=Wc1,5(t1)+Wc3,5(t1)+Wc2,4(t1)+Wc2,6(t1)\tilde{c}_{2,5}^{(t)} = W \cdot c_{1,5}^{(t-1)} + W \cdot c_{3,5}^{(t-1)} + W \cdot c_{2,4}^{(t-1)} + W \cdot c_{2,6}^{(t-1)}

Target: c₂,₅ (staff, courteous) POS=2.31, NEG=-1.05, N=0.12 c₁,₅ (The, courteous) N=1.92 (no relation) c₃,₅ (was, courteous) N=1.44 (acomp link) c₂,₄ (staff, very) O=1.15 (opinion span) c₂,₆ (staff, but) N=2.10 (no relation)

Step 4: Final Softmax → Tag Prediction

After GCN refinement, the final probabilities are computed:

P(y2,5)=softmax(c2,5+c~2,5(T))=[0.03N,0.01A,0.02O,0.89POS,0.01NEG,0.04NEU]P(y_{2,5}) = \text{softmax}(c_{2,5} + \tilde{c}_{2,5}^{(T)}) = [\underset{N}{0.03}, \underset{A}{0.01}, \underset{O}{0.02}, \underset{\textbf{POS}}{\textbf{0.89}}, \underset{NEG}{0.01}, \underset{NEU}{0.04}]

y2,5=POS(staff is linked to courteous with positive sentiment)\Rightarrow y_{2,5} = \textbf{POS} \quad \text{(staff is linked to courteous with positive sentiment)}

Similarly, for the pair ("food", "terrible"):

P(y8,10)=softmax(c8,10+c~8,10(T))=[0.02N,0.01A,0.01O,0.03POS,0.91NEG,0.02NEU]P(y_{8,10}) = \text{softmax}(c_{8,10} + \tilde{c}_{8,10}^{(T)}) = [\underset{N}{0.02}, \underset{A}{0.01}, \underset{O}{0.01}, \underset{POS}{0.03}, \underset{\textbf{NEG}}{\textbf{0.91}}, \underset{NEU}{0.02}]

y8,10=NEG(food is linked to terrible with negative sentiment)\Rightarrow y_{8,10} = \textbf{NEG} \quad \text{(food is linked to terrible with negative sentiment)}

Complete Word-Pair Prediction Grid (yi,jy_{i,j})

Applying this process to every word pair in "The staff was very courteous but the food was terrible" produces the full 10×1010 \times 10 tagging grid:

Word-Pair Tag Prediction Grid (10×10) Thestaffwasverycourt.butthefoodwasterr. Thestaffwasverycourt.butthefoodwasterr. -NNNNNNNNNN-NNPOSNNNNNNN-NNNNNNNNNN-NNNNNNNPOSNN-NNNNNNNNNN-NNNNNNNNNN-NNNNNNNNNN-NNEGNNNNNNNN-NNNNNNNNNEGN- POS (staff↔courteous) NEG (food↔terrible) N (no relation)

*Figure 9: Complete 10×10 word-pair tagging grid for the full sentence. The grid is symmetric — (staff, courteous) and (courteous, staff) both predict POS. Key aspect-opinion relationships are highlighted in green (POS) and red (NEG). All other pairs receive the N (no relation) tag.*

Final Extracted Triplets

Reading the tagged grid, the model extracts the final ASTE triplets:

Aspect Opinion Sentiment Grid Cell
staff courteous POS y2,5=0.89y_{2,5} = 0.89
food terrible NEG y8,10=0.91y_{8,10} = 0.91

9. Experimental Results

The SA-Transformer was tested against three major families of ASTE models using four benchmark datasets from SemEval challenges.

Evaluated Baselines:

  1. Pipeline Methods: TSF [2], CLMA+ [3].
  2. Multitask Methods: BMRC [4], Span-ASTE [5].
  3. Word-Pair Methods: GTS [6], S3E2 [7].

Comparative Results Snapshot (Micro F1-Score)

Model Family Representative Model Rest14 (F1) Lap14 (F1) Rest15 (F1)
Pipeline CLMA+ 41.36 32.55 39.77
Multitask Span-ASTE 58.74 45.41 55.43
Word-Pair S3E2 59.81 48.06 55.97
Proposed SA-Transformer 63.58 52.33 58.91

The architecture demonstrates a substantial boost. SA-Transformer outscored S3E2 by +3.77% on Rest14 explicitly because AEA cleanly resolves sentences containing multiple conflicting aspect targets.

References

1. Yuan, Li and Wang, Jin and Yu, Liang-Chih and Zhang, Xuejie (2024). Encoding Syntactic Information into Transformers for Aspect-Based Sentiment Triplet Extraction. IEEE Transactions on Affective Computing. link View Source
2. Peng, Haiyun and others (2019). Knowing what, how and why: A near complete solution for aspect-based sentiment analysis. AAAI.
3. Wang, W. and others (2017). Coupled multi-layer attentions for co-extraction of aspect and opinion terms. AAAI.
4. Chen, S. and others (2021). Bidirectional machine reading comprehension for aspect sentiment triplet extraction. AAAI.
5. Xu, L. and others (2021). Learning span-level interactions for aspect sentiment triplet extraction. ACL.
6. Wu, Z. and others (2020). Grid tagging scheme for aspect-oriented fine-grained opinion extraction. ACL Findings.
7. Chen, Z. and others (2021). Semantic and syntactic enhanced aspect sentiment triplet extraction. ACL Findings.
8. Zhao, Z. and others (2022). Multi-task alignment scheme for span-level aspect sentiment triplet extraction. ICANN.