Structured Intent Encoding for AI Image Generation: Purpose Anchoring, Density Boundaries, and Cross-Modal Protocol Transfer
Chaoyang Li, Lei YangWith the rapid advancement of artificial intelligence (AI), large language models exhibit nearly universal problem-solving capabilities yet cannot autonomously comprehend human intentions. As the externalization of human thinking, prompt engineering embodies human unique core value in the intelligent era. Text-to-image (T2I) research has largely focused on prompt-surface optimization, while the prior question of how user intent should be structurally encoded remains underexplored. We investigate whether 5W3H/PPS (Prompt Protocol Structure), an eight-dimension intent encoding framework previously studied in text generation, retains protocol-level relevance in neural image generation. Using a frozen baseline pilot and two follow-up studies evaluated on three commercial Chinese T2I systems under Chinese-language task specifications, we examine three issues: cross-modal protocol transfer, functional differentiation within the Why dimension, and task-dependent density boundaries in structured intent encoding. We find evidence that PPS supports a task-conditioned intent protocol in image generation, rather than functioning as a uniformly superior prompting method. In the baseline pilot, the structural gap—defined as the difference between dimensional recovery and intent fidelity—persists across all pilot task–condition aggregates, indicating that structurally plausible images can still fail to preserve user-specific intent. A Why-decomposition study shows that purpose-oriented formulations outperform audience-oriented or mixed formulations in high-complexity tasks, whereas audience specification is more useful in lower-complexity settings. A density study further shows that protocol density is non-monotonic: some tasks benefit from full eight-dimension specification, whereas others collapse under over-specification. Taken together, these findings suggest that structured intent encoding in T2I is better understood as a task-calibrated protocol variable than as a uniformly beneficial prompting strategy. All findings are established on three Chinese commercial T2I systems under Chinese-language specifications and should be read as evidence for a task-calibrated intent protocol within this setting, not as a claim that generalizes to other languages, model families, or image-generation systems.