It is a dataset DriveSeg created for research on road situation awareness (used for self-driving cars, etc.). For each frame of the video, the entire image is pixel-by-pixel semantic labeling. Labels are 12 categories of “vehicle, pedestrian, road, sidewalk, bicycle, motorcycle, building, terrain (horizontal vegetation), vegetation (vertical vegetation), pole, traffic light, and traffic sign”.
The human-handed version of this process is provided at 5,000 frames, 1080p@30Hz, and the semi-auto version at 20,000 frames, 720p@30Hz. Even if it is not an autonomous vehicle, it will be useful in segmentation vision research. (I plan to use it for separating a specific object from an image, replacing the background, and improving the performance of the segmentation algorithm)