[3] Fan, Z., Wei, Z., Wang, S., Wang, R., Li, Z., Shan, H., & Huang, X. (2021). TCIC: Theme concepts learning cross language and vision for image captioning. arXiv preprint arXiv:2106.10936.
[4] Chen, Yen-Chun, et al. "Uniter: Universal image-text representation learning." European conference on computer vision. Springer, Cham, 2020.
[5] Li, Xiujun, et al. "Oscar: Object-semantics aligned pre-training for vision-language tasks." European Conference on Computer Vision. Springer, Cham, 2020.
[6] Zhang, Pengchuan, et al. "Vinvl: Revisiting visual representations in vision-language models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
特别声明:本站内容均来自网友提供或互联网,仅供参考,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
