"Sources" can be things like input video files, or "functions" which operate upon other sources. A finished music video is represented by a function combining lots of video clips, which themselves could be functions pulling bits of video from a longer video. A source provides some minimal information about itself (length, framerate, frame size) and is able to provide a frame given a frame number.
An interface allows the user to build up a collection of sources (with a "preview" window that allows the user to view the frames from one of their sources), takes an "output" source, and writes the frames to a file. Useful functions to have: one-after-another (with fading between them), side-by-side, rescale, change framerate, generate black, generate timecode, pass through transcode filter.
A simple text format could describe a set of functions to produce a video. Since the normal access pattern for frames would be sequential, a source that provided frames from a file would be easy; in the case of more complex operations, caching could be provided transparently (or non-transparently, as a "function").
This'd also make a much cooler Haskell example than the Pictures stuff that first-year UKC students do.
Avisynth does this under Windows.