Vision-Language Pre-training (VLP) has recently attracted rapidly growing attention from both the computer vision and NLP communities, especially due to the emergence of multimodal foundation models ...