Can AI Really Code? How We Measure if Machines are Actually Good at Programming

Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes ^[3,4]. BLEU is originally designed to evaluate natural language, neglecting important syntactic and semantic features of codes, and perfect accuracy is too strict, thus it underestimates different outputs with the same semantic logic.

In order to remedy this, Ren et al. ^[1] introduced a new automatic evaluation metric, dubbed CodeBLEU. It absorbs the strength of BLEU in the n-gram match, and further injects code syntax via Abstract Syntax Trees (AST) and code semantics via Data-Flows.

Why Not BLEU?

While BLEU ^[2] has been the gold standard for machine translation, it fails to capture the unique properties of source code:

Limited keywords vs Millions of words: Unlike natural languages with vast vocabularies, code uses a restricted set of keywords. These keywords are more important than other tokens and should gain higher weight in evaluation.
Tree structure vs Sequential structure: Natural language is typically processed sequentially (left-to-right), but code is fundamentally hierarchical, represented by an AST.
Unique instructions vs Ambiguous semantics: Natural languages are context-dependent and ambiguous. Code, however, is designed to be deterministic, where variable dependencies (data-flow) define the logic.

The CodeBLEU Formula

CodeBLEU is defined as a weighted combination of four distinct scores:

CodeBLEU = \alpha \cdot BLEU + \beta \cdot BLEU_{weight} + \gamma \cdot Match_{ast} + \delta \cdot Match_{df}

Where:

$\alpha, \beta, \gamma, \delta$ are hyperparameters that sum to 1.
$BLEU_{weight}$ is the weighted n-gram match.
$Match_{ast}$ is the syntactic AST match.
$Match_{df}$ is the semantic data-flow match.

Running Example

Throughout this article, we will trace all four components of CodeBLEU using one consistent example. Consider evaluation of a square function:

Reference Code:

public static int square(int x) {
    int y = x * x;
    return y;
}

Candidate Code:

public static int square(int x) {
    int y = x * x;
    return x;
}

The only difference is the return statement: the candidate returns the input x instead of the computed result y. This is a subtle but critical semantic error — the kind that standard BLEU struggles to detect.

We will compute all four scores step by step:

BLEU (standard n-gram overlap)
BLEU_weight (keyword-boosted n-gram)
Match_ast (AST subtree match)
Match_df (data-flow graph match)

...and then combine them into the final CodeBLEU score.

1. Weighted N-Gram Match

The original BLEU compares n-grams between the candidate and the reference and calculates the ratio of matched n-grams. However, it treats all tokens equally. In programming languages, certain tokens (like keywords) are more critical for the program logic than others (like variable names).

CodeBLEU introduces Weighted N-Gram Match to assign different weights to different tokens. In the original paper, keywords are assigned a weight 5 times higher than other tokens.

The weighted n-gram precision $p_n$ is calculated as:

p_n = \frac{\sum_{C \in Candidates} \sum_{i=1}^{l_n} w_n^i \cdot Count_{clip}(C(i, i+n))}{\sum_{C' \in Candidates} \sum_{i=1}^{l_n} w_n^i \cdot Count(C'(i, i+n))}

Where $w_n^i$ denotes the weight assigned to the n-gram. Currently, this is applied to unigrams ( $N=1$ ).

Concrete Example: The Weighting Process

Consider our running example of a square function. We compare a candidate that incorrectly returns the input $x$ instead of the calculated $y$ .

*Figure 1: Token re-weighting in CodeBLEU. Keywords like public are amplified to capture their structural importance.*

The resulting weight distribution ensures that a keyword mismatch (e.g., swapping int for float) penalizes the score more heavily than an identifier mismatch.

Calculation: Weighted vs Standard N-Grams

Let's calculate the score for our square running example.
First, we tokenize both the reference and candidate codes. They both contain exactly 20 tokens.

1. Standard BLEU (Unigram overlap):

Candidate has an extra x and is missing a y.
19 out of 20 tokens match (with clipping applied).
BLEU Score = $19 / 20 imes 100 = 95.0$

2. Weighted BLEU ( $BLEU_{weight}$ ):

Keywords (Weight 5.0): There are 6 keywords (public, static, int, int, int, return). $6 \times 5.0 = 30.0$ .
Other Tokens (Weight 1.0): There are 14 other tokens. $14 \times 1.0 = 14.0$ .
Total Reference Weight: $30.0 + 14.0 = 44.0$ .
Matched Weight: All 6 keywords match perfectly (+30.0). Of the 14 other tokens, 13 match (+13.0). Total matched weight = 43.0.
$BLEU_{weight}$ Score = $43.0 / 44.0 \times 100 = 97.7$

Notice that $BLEU_{weight}$ is actually higher than standard BLEU here! Because the error involved swapping identifiers (x for y), the high-weight keywords successfully matched, pushing the grammar score up. This perfectly demonstrates why we must also check structural and semantic logic.

2. Syntactic AST Match

Programming languages have a natural tree structure called the Abstract Syntax Tree (AST). CodeBLEU uses this by matching subtrees between the candidate and the reference.

Each node in the AST represents a construct (e.g., MethodDeclaration, BinaryExpression). The leaves (names of variables and functions) are removed because the syntactic structure is what matters most here.

The AST match score is:

Match_{ast} = \frac{Count_{clip}(T_{cand})}{Count(T_{ref})}

Where:

$Count(T_{ref})$ is the total number of subtrees in the reference.
$Count_{clip}(T_{cand})$ is the number of matched candidate subtrees.

Visualization: Structural Comparison

In our candidate, the return statement returns x instead of y. However, because the AST matching strictly focuses on grammatical structure and strips variable name leaves, "returning a local variable" yields the exact same subtree.

*Figure 2: Because leaf node names are excluded, the AST structure for returning `x` or `y` is completely identical. AST ensures structural integrity but misses logical flow.*

Calculation: AST Match Score

For our running example:

Generating the tree-sitter AST and removing variable name leaves results in exactly 12 subtrees.
Because the structural logic (Return Statement -> Identifier) is identical between candidate and reference, all 12 subtrees map 1-to-1 perfectly.
$Match_{ast}$ Score = $12 / 12 \times 100 = 100.0$

With BLEU at 95.0 and AST Match at 100.0, the candidate appears almost perfect. This is where semantic data-flow steps in to detect the critical bug.

3. Semantic Data-flow Match

While AST captures the syntactic structure, it sometimes misses semantic logic. CodeBLEU addresses this by using Data-Flow Graphs to measure the semantic similarity.

In a data-flow graph, nodes represent variables and edges represent the source of their values. For example, in our square function, the variable y depends on x, and the return value should depend on y.

Data-Flow Calculation Steps

Extract Graph: Identify variable nodes and their relationships (where each value comes from).
Normalization: Ignore variable names and positions. All variables are renamed to a uniform format (var_0, var_1, etc.).
Accuracy Calculation:
$Match_{df} = \frac{Count_{clip}(DF_{cand})}{Count(DF_{ref})}$

Visualization: The Semantic Gap

The candidate's semantic error is clearly visible when comparing the data-flow graphs. The return node in the candidate points back to the input x, bypassing the calculation in y.

*Figure 3: Semantic data-flow capture. The missing link between the calculation (y) and its final use (return) reveals a deep logical error.*

Calculation: The Decisive Match

Let's evaluate the semantic match for our running example:

Reference Data-Flows:

y value comes from x
RET (return statement) comes from y
(Total reference edges = 2)

Candidate Data-Flows:

y value comes from x
RET comes from x

Comparing the normalized edges:

The edge y <- x matches.
The reference expected RET <- y, but candidate has RET <- x. This is a mismatch!

$Match_{df}$ Score = $1 / 2 \times 100 = 50.0$

Final CodeBLEU Calculation

Now we combine all components to derive the final score. Using evenly distributed weights ( $alpha, beta, gamma, delta = 0.25$ ), we aggregate:

CodeBLEU = 0.25 (BLEU) + 0.25 (BLEU_{weight}) + 0.25 (Match_{ast}) + 0.25 (Match_{df})

CodeBLEU = 0.25 (95.0) + 0.25 (97.7) + 0.25 (100.0) + 0.25 (50.0) = \mathbf{85.67}

Conclusion: A pure string-based BLEU score reported a deceivingly high 95.0. By capturing the data-flow anomaly, CodeBLEU heavily penalizes the logical error, dropping the true evaluation score to 85.67, perfectly mirroring human judgment.

4. Experimental Results & Correlation

The effectiveness of CodeBLEU was evaluated across three real-world tasks: Text-to-Code, Code Translation, and Code Refinement. The researchers calculated the Pearson Correlation Coefficient ^[3] ^[3] to checkhow well CodeBLEU matches human judgment compared to traditional metrics.

Key Performance Gains

CodeBLEU showed significant improvements in matching programmer-assigned scores:

Task	BLEU & Human	CodeBLEU & Human	Improvement
Text-to-code	0.967	0.977	+1.0%
Code translation	0.940	0.970	+3.0%
Code refinement	0.923	0.979	+5.6%

Optimal Hyperparameters

Through ablation studies, the authors found that increasing the weight of Syntactic AST and Semantic Data-Flow matches leads to better human correlation. The recommended configuration for general code synthesis is:

\alpha = 0.1 \quad \beta = 0.1 \quad \gamma = 0.4 \quad \delta = 0.4

This gives a total of 80% weight to the structural and logical characteristics of the code, rather than just n-gram token overlap.

Conclusion

CodeBLEU represents a major step forward from simple string matching (BLEU) and rigid logic checks (Perfect Accuracy). By combining weighted n-grams, AST subtree matching, and data-flow dependency checks, it provides a more holistic and human-aligned metric for evaluating code synthesis models.

We hope this metric accelerates the development of more reliable and logically sound code-generation agents.

References

1. Shuo Ren and Daya Guo and Shuai Lu and Long Zhou and Shujie Liu and Duyu Tang and Neel Sundaresan and Ming Zhou and Ambrosio Blanco and Shuai Ma (2020). CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv preprint arXiv:2009.10297.

2. Kishore Papineni and Salim Roukos and Todd Ward and Wei-Jing Zhu (2002). BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting of the Association for Computational Linguistics.

3. Chen, Mark and Tworek, Jerry and Jun, Heewoo and Yuan, Qiming and Pinto, Henrique Ponde and Kaplan, Jared and Edwards, Harri and Burda, Yuri and Joseph, Nicholas and Brockman, Greg and others (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374.

4. Austin, Jacob and Odena, Augustus and Nye, Maxwell and Bosma, Maarten and Michalewski, Henryk and Miranda, David and Lau, Nathan and Peng, Charles and Sutton, Charles and Huybrechts, Sebastien (2021). Program Synthesis with Large Language Models. arXiv preprint arXiv:2108.07732.

Azhary Arliansyah