Innovations in Computer Vision: Evaluation of ChatGPT, Gemini, and Copilot for Image Analysis
DOI:
https://doi.org/10.37431/conectividad.v6i2.284Keywords:
ChatGPT, Gemini, Copilot, AI, Natural Language ProcessingAbstract
In recent years, Large Scale Language Models (LLM) have had an exponential growth and have evolved rapidly, from their beginnings when they were conceived under the premise of simple tools that understood text to our times when they have become multimodal systems capable of generating creative and complex content. This innovation has been driven by the great advances in neural network architectures and, in addition, the availability of large data sets. In this study, the main objective is to compare three most used LLMs: ChatGPT, Gemini and Copilot, in the execution of the task of converting images to text (I2T). The capacity of each model to describe in a detailed and precise way different types of images was evaluated, among which artistic paintings, urban scenes and images with instructions were evaluated. The results obtained show that the three models have a high level of performance, the Gemini model stands out thanks to its ability to integrate visual and textual information more efficiently. The results of the study show that LLMs continue to evolve, so we can expect to see even more significant advances in their ability to understand and generate natural language. It is also expected that this evolution will allow these models to be more widely applied in the daily lives of all people, automating processes and helping to improve the development of virtual assistants.
References
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F., y McGrew, B. (2023). Gpt-4 technical report. arXiv e-prints. https://doi.org/arXiv:2303.08774
Devlin, J., Chang, M.-W., Lee, K., y Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv e-prints. https://doi.org/10.48550/arXiv.1810.04805
Gemini Team, Georgiev, P., Lei, V., Burnell, R., Bai, L., ......, y Vinyals, O. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv e-prints. https://doi.org/10.48550/arXiv.2403.05530
Google. (2022). LaMDA: Language Models for Dialog Applications. arXiv e-prints. https://doi.org/10.48550/arXiv.2201.08239
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., y Soricut, R. (2019). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv e-prints. https://doi.org/10.48550/arXiv.1909.11942
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., . . . Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv e-prints. https://doi.org/10.48550/arXiv.1907.11692
Open AI. (2020). Language Models are Few-Shot Learners. arXiv e-prints. https://doi.org/10.48550/arXiv.2005.14165
Open AI. (2023). GPT-4V(ision) System Card. https://openai.com/index/gpt-4v-system-card/. https://openai.com/index/gpt-4v-system-card/
Open AI, Anthropic AI, Zipline. (2021). Evaluating Large Language Models Trained on Code. arXiv e-prints. https://doi.org/10.48550/arXiv.2107.03374
Yang, J., Jin, H., Tang, R., Han, X., Feng, Q., Jiang, H., . . . Hu, X. (2023). Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond. https://doi.org/10.48550/arXiv.2304.13712
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Instituto Superior Tecnológico Universitario Rumiñahui

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
The originals published in the electronic edition under the first publication rights of the journal belong to the Instituto Superior Tecnológico Universitario Rumiñahui; therefore, it is necessary to cite the source in any partial or total reproduction. All the contents of the electronic journal are distributed under a Creative Commons Attribution-Noncommercial 4.0 International (CC-BY-NC 4.0) license.