Can AI Really Code? How We Measure if Machines are Actually Good at Programming
Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes [3,4]. BLEU is originally designed to evaluate natural language, neglecting important syntactic and semantic features of codes, and perfect accuracy is too strict, thus it underestimates different outputs with the same semantic logic.
In order to remedy this, Ren et al. [1] introduced a new automatic evaluation metric, dubbed CodeBLEU. It absorbs the strength of BLEU in the n-gram match, and further injects code syntax via Abstract Syntax Trees (AST) and code semantics via Data-Flows.
Why Not BLEU?
While BLEU [2] has been the gold standard for machine translation, it fails to capture the unique properties of source code:
- Limited keywords vs Millions of words: Unlike natural languages with vast vocabularies, code uses a restricted set of keywords. These keywords are more important than other tokens and should gain higher weight in evaluation.
- Tree structure vs Sequential structure: Natural language is typically processed sequentially (left-to-right), but code is fundamentally hierarchical, represented by an AST.
- Unique instructions vs Ambiguous semantics: Natural languages are context-dependent and ambiguous. Code, however, is designed to be deterministic, where variable dependencies (data-flow) define the logic.
The CodeBLEU Formula
CodeBLEU is defined as a weighted combination of four distinct scores:
Where:
- are hyperparameters that sum to 1.
- is the weighted n-gram match.
- is the syntactic AST match.
- is the semantic data-flow match.
Running Example
Throughout this article, we will trace all four components of CodeBLEU using one consistent example. Consider evaluation of a square function:
Reference Code:
public static int square(int x) {
int y = x * x;
return y;
}Candidate Code:
public static int square(int x) {
int y = x * x;
return x;
}The only difference is the return statement: the candidate returns the input x instead of the computed result y. This is a subtle but critical semantic error — the kind that standard BLEU struggles to detect.
We will compute all four scores step by step:
- BLEU (standard n-gram overlap)
- BLEU_weight (keyword-boosted n-gram)
- Match_ast (AST subtree match)
- Match_df (data-flow graph match)
...and then combine them into the final CodeBLEU score.
1. Weighted N-Gram Match
The original BLEU compares n-grams between the candidate and the reference and calculates the ratio of matched n-grams. However, it treats all tokens equally. In programming languages, certain tokens (like keywords) are more critical for the program logic than others (like variable names).
CodeBLEU introduces Weighted N-Gram Match to assign different weights to different tokens. In the original paper, keywords are assigned a weight 5 times higher than other tokens.
The weighted n-gram precision is calculated as:
Where denotes the weight assigned to the n-gram. Currently, this is applied to unigrams ().
Concrete Example: The Weighting Process
Consider our running example of a square function. We compare a candidate that incorrectly returns the input instead of the calculated .
*Figure 1: Token re-weighting in CodeBLEU. Keywords like public are amplified to capture their structural importance.*
The resulting weight distribution ensures that a keyword mismatch (e.g., swapping int for float) penalizes the score more heavily than an identifier mismatch.
Calculation: Weighted vs Standard N-Grams
Let's calculate the score for our square running example.
First, we tokenize both the reference and candidate codes. They both contain exactly 20 tokens.
1. Standard BLEU (Unigram overlap):
- Candidate has an extra
xand is missing ay. - 19 out of 20 tokens match (with clipping applied).
- BLEU Score =
2. Weighted BLEU ():
- Keywords (Weight 5.0): There are 6 keywords (
public,static,int,int,int,return). . - Other Tokens (Weight 1.0): There are 14 other tokens. .
- Total Reference Weight: .
- Matched Weight: All 6 keywords match perfectly (+30.0). Of the 14 other tokens, 13 match (+13.0). Total matched weight = 43.0.
- Score =
Notice that is actually higher than standard BLEU here! Because the error involved swapping identifiers (x for y), the high-weight keywords successfully matched, pushing the grammar score up. This perfectly demonstrates why we must also check structural and semantic logic.
2. Syntactic AST Match
Programming languages have a natural tree structure called the Abstract Syntax Tree (AST). CodeBLEU uses this by matching subtrees between the candidate and the reference.
Each node in the AST represents a construct (e.g., MethodDeclaration, BinaryExpression). The leaves (names of variables and functions) are removed because the syntactic structure is what matters most here.
The AST match score is:
Where:
- is the total number of subtrees in the reference.
- is the number of matched candidate subtrees.
Visualization: Structural Comparison
In our candidate, the return statement returns x instead of y. However, because the AST matching strictly focuses on grammatical structure and strips variable name leaves, "returning a local variable" yields the exact same subtree.
*Figure 2: Because leaf node names are excluded, the AST structure for returning `x` or `y` is completely identical. AST ensures structural integrity but misses logical flow.*
Calculation: AST Match Score
For our running example:
- Generating the tree-sitter AST and removing variable name leaves results in exactly 12 subtrees.
- Because the structural logic (Return Statement -> Identifier) is identical between candidate and reference, all 12 subtrees map 1-to-1 perfectly.
- Score =
With BLEU at 95.0 and AST Match at 100.0, the candidate appears almost perfect. This is where semantic data-flow steps in to detect the critical bug.
3. Semantic Data-flow Match
While AST captures the syntactic structure, it sometimes misses semantic logic. CodeBLEU addresses this by using Data-Flow Graphs to measure the semantic similarity.
In a data-flow graph, nodes represent variables and edges represent the source of their values. For example, in our square function, the variable y depends on x, and the return value should depend on y.
Data-Flow Calculation Steps
- Extract Graph: Identify variable nodes and their relationships (where each value comes from).
- Normalization: Ignore variable names and positions. All variables are renamed to a uniform format (
var_0,var_1, etc.). - Accuracy Calculation:
Visualization: The Semantic Gap
The candidate's semantic error is clearly visible when comparing the data-flow graphs. The return node in the candidate points back to the input x, bypassing the calculation in y.
*Figure 3: Semantic data-flow capture. The missing link between the calculation (y) and its final use (return) reveals a deep logical error.*
Calculation: The Decisive Match
Let's evaluate the semantic match for our running example:
Reference Data-Flows:
yvalue comes fromxRET(return statement) comes fromy
(Total reference edges = 2)
Candidate Data-Flows:
yvalue comes fromxRETcomes fromx
Comparing the normalized edges:
- The edge
y <- xmatches. - The reference expected
RET <- y, but candidate hasRET <- x. This is a mismatch!
Score =
Final CodeBLEU Calculation
Now we combine all components to derive the final score. Using evenly distributed weights (), we aggregate:
Conclusion: A pure string-based BLEU score reported a deceivingly high 95.0. By capturing the data-flow anomaly, CodeBLEU heavily penalizes the logical error, dropping the true evaluation score to 85.67, perfectly mirroring human judgment.
4. Experimental Results & Correlation
The effectiveness of CodeBLEU was evaluated across three real-world tasks: Text-to-Code, Code Translation, and Code Refinement. The researchers calculated the Pearson Correlation Coefficient [3] [3] to checkhow well CodeBLEU matches human judgment compared to traditional metrics.
Key Performance Gains
CodeBLEU showed significant improvements in matching programmer-assigned scores:
| Task | BLEU & Human | CodeBLEU & Human | Improvement |
|---|---|---|---|
| Text-to-code | 0.967 | 0.977 | +1.0% |
| Code translation | 0.940 | 0.970 | +3.0% |
| Code refinement | 0.923 | 0.979 | +5.6% |
Optimal Hyperparameters
Through ablation studies, the authors found that increasing the weight of Syntactic AST and Semantic Data-Flow matches leads to better human correlation. The recommended configuration for general code synthesis is:
This gives a total of 80% weight to the structural and logical characteristics of the code, rather than just n-gram token overlap.
Conclusion
CodeBLEU represents a major step forward from simple string matching (BLEU) and rigid logic checks (Perfect Accuracy). By combining weighted n-grams, AST subtree matching, and data-flow dependency checks, it provides a more holistic and human-aligned metric for evaluating code synthesis models.
We hope this metric accelerates the development of more reliable and logically sound code-generation agents.