Prompt engineering methods using ConZIC

Summarise the methods and results of tuning the CounTX model with prompts generated using ConZIC.

(It is not sure whether using ConZIC to generate prompts is the optimal solution. Because sometimes prompts were generated that did not conform to the image.　However, the methods should be common and available in different models. To begin, a brief summary of the ConZIC framework is provided. It should be used as a reference when adopting other models.)

1. ConZIC Framework

Image Captioning is the task of generate prompt from a image. ConZIC is the zero-shot model in this task.

Generate prompts by repeating the process in the figure below.

Estimation of words that are masked by the beat encoder.
Re-evaluate the estimated words in the three models.
Determine mask word.

framework

2. Generating prompts that match the image recognition model

Method

Generate prompt by ConZIC.

arcled tubular steel.
numerous steel oil pipes
china developing steel pipes.

Make words list from the prompts.

steel, oil, numerous, tubular, developing, pipes, china, arcled

Calculate accuracy of each words and select top words.

steel, oil, numerous, tubular, china

Sorting and generating new prompts.

tubular numerous china steel oil
tubular numerous china oil steel
tubular numerous steel china oil
tubular numerous steel oil china
:
:

Calculate accuracy of each prompts and select the best prompt.

Result

Compare the accuracy of the following prompts for the images of the pipe.

"the pipes"
Prompts generated by the above methods.

result

3. Scenes of use

When the number of objects is known, find out what prompts would help the model to count them correctly.