Innovations in Computer Vision: Evaluation of ChatGPT, Gemini, and Copilot for Image Analysis

Authors

DOI:

https://doi.org/10.37431/conectividad.v6i2.284

Keywords:

ChatGPT, Gemini, Copilot, AI, Natural Language Processing

Abstract

In recent years, Large Scale Language Models (LLM) have had an exponential growth and have evolved rapidly, from their beginnings when they were conceived under the premise of simple tools that understood text to our times when they have become multimodal systems capable of generating creative and complex content. This innovation has been driven by the great advances in neural network architectures and, in addition, the availability of large data sets. In this study, the main objective is to compare three most used LLMs: ChatGPT, Gemini and Copilot, in the execution of the task of converting images to text (I2T). The capacity of each model to describe in a detailed and precise way different types of images was evaluated, among which artistic paintings, urban scenes and images with instructions were evaluated. The results obtained show that the three models have a high level of performance, the Gemini model stands out thanks to its ability to integrate visual and textual information more efficiently. The results of the study show that LLMs continue to evolve, so we can expect to see even more significant advances in their ability to understand and generate natural language. It is also expected that this evolution will allow these models to be more widely applied in the daily lives of all people, automating processes and helping to improve the development of virtual assistants.

References

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F., y McGrew, B. (2023). Gpt-4 technical report. arXiv e-prints. https://doi.org/arXiv:2303.08774

Devlin, J., Chang, M.-W., Lee, K., y Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv e-prints. https://doi.org/10.48550/arXiv.1810.04805

Gemini Team, Georgiev, P., Lei, V., Burnell, R., Bai, L., ......, y Vinyals, O. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv e-prints. https://doi.org/10.48550/arXiv.2403.05530

Google. (2022). LaMDA: Language Models for Dialog Applications. arXiv e-prints. https://doi.org/10.48550/arXiv.2201.08239

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., y Soricut, R. (2019). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv e-prints. https://doi.org/10.48550/arXiv.1909.11942

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., . . . Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv e-prints. https://doi.org/10.48550/arXiv.1907.11692

Open AI. (2020). Language Models are Few-Shot Learners. arXiv e-prints. https://doi.org/10.48550/arXiv.2005.14165

Open AI. (2023). GPT-4V(ision) System Card. https://openai.com/index/gpt-4v-system-card/. https://openai.com/index/gpt-4v-system-card/

Open AI, Anthropic AI, Zipline. (2021). Evaluating Large Language Models Trained on Code. arXiv e-prints. https://doi.org/10.48550/arXiv.2107.03374

Yang, J., Jin, H., Tang, R., Han, X., Feng, Q., Jiang, H., . . . Hu, X. (2023). Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond. https://doi.org/10.48550/arXiv.2304.13712

Published

2025-05-16

How to Cite

Minango Negrete, P. D., Zambrano Vizuete, Óscar M., Minango Negrete, J. C., Minaya Andino, C. A., & León Galeas, C. J. (2025). Innovations in Computer Vision: Evaluation of ChatGPT, Gemini, and Copilot for Image Analysis. CONECTIVIDAD, 6(2), 251–262. https://doi.org/10.37431/conectividad.v6i2.284