[Tfug] Version Control

Wed Mar 27 23:45:18 MST 2013

Hi John,

>>> If you are looking for a monithilic VCS that internally knows about
>>> every imaginable file type ever created, or envisioned, I suggest that
>>> you seek suggestions from somewhere other than a Unix mailing group.
>>
>> I didn't ask for a VCS that did all of those things.  But, I sure
>> as hell don't want a VCS that thinks everything other than "source
>> code written in USASCII" is not worth versioning!
>
> I've been following this thread and, while it seems likely that you
> didn't *mean* this, I think that what you have written would very
> easily lead someone to think that you are indeed looking for either a
> monolithic VCS or a VCS that comes stock with a great many data type
> specific modules.  Indeed, it seemed to me that this was the sort of
> product you have been arguing for.

To clarify:  I want to be able to track the history of *any*
"electronic object" (i.e., file).  I don't want to have to
"worry" that <whatever> will choke on or, worse, *alter* an
object that it encounters that doesn't fit with *its* idea
of what objects should be (i.e., "text").

[E.g., I shouldn't have to examine a huge hierarchy to *ensure*
it doesn't contain files of "some type" with which it will have
problems.  I shouldn't have to move my schematics into a
different hierarchy from my source code; my PDF datasheets
into a different hierarchy from my schematics, etc.]

That need not be a "monolithic" product.  It can be composed
of multiple modules (pick your favorite term:  plug-in?)
interfacing to a common framework ("engine").  The "engine"
would ensure all objects are treated similarly -- as "versioning"
should be a common process applied to all objects that are
presented to it.

Do you consider Firefox to be a "monolithic browser"?  Or,
Eclipse a "monolithic IDE"?  Is Ubuntu a "monolithic OS"?

Recall, I can do this in my bastardized CVS already.  I don't
want to "move backwards" -- nor have to reinvent/reimplement
all of this -- unless I am getting something *significant*
in return!

> Also, all of the VCS that you or anybody else has mentioned *do*
> version everything you toss at them.  Clearly, most only meaningfully
> track changes in text files (as you are well aware), but that doesn't
> mean they don't track versions of binary files.

Let's do away with the term "binary" since *everything* is a "binary
file" -- even a text file!  :-/  The VCS discussed here look at files
as "text" and "non-text".  That's a really crude distinction.  There
are many "non-text" file formats that are in widespread use, well
documented, "open" formats, etc.  Yet, they are treated as "second
(or *third*!) class" objects -- unworthy of the types of support
that is provided to "text" objects ("favored son").

And, there is no reason for that other than "we haven't found a need
to develop handlers/modules/plugins for them -- why don't *you*
write one?"  (Hint:  I already *have* for my "home grown" approach)

All these tools do for "non text" files is store them and record
a version number/log file entry.  A file system and text editor
would do the same!  (where's the added value?)

> I think it is important to remember that UNIX-land has *long* had a
> history of working with data that originates in text form.  A few
> examples:

Of course!  But the world isn't UNIX!  Do you think your paycheck is
printed by a shell script?  And your year to date wages totaled
by adding successive entires (lines) in a "text database" onto which
each week's pay has been appended?  :>

Furthermore, expressing something as text just makes it easy
to *process*.  It doesn't make it easy to *understand* (remember,
we're supposed to burn CPU cycles to make things easier on
developers -- and easier on *users* than developers!!).

Here's a gesture definition from my gesture library.  You see
this appear as the output of "cvs diff" on the library.
	Version 1:	232 475 218 683 288 591
	Version 2:	165 660 218 683 288 591
What does it tell you -- besides "version 2 is different from
version 1"?  OTOH, once *rendered* you'd say, "Oh!  I see..."

(I picked this example because I render using gnuplot)

I.e., while something in "text" might be easier to *process*
(i.e., making life easy for the algorithms) it burdens the
*human* consumer (developer or "nontechnical user").

I can represent a TIFF as a text file -- but looking at that
text won't tell you whether you are viewing an image of a horse
or a boat!  (i.e., if the image has been *changed* from that
of a horse to that of a boat will be lost on you).  You have to
take a second step to figure out what the difference "means".

> * graphs in Gnuplot are not generated (usually) interactively or
> graphically but rather via simpler scripts
>
> * LaTeX/TeX document output (PS or PDF) is generated from a text file.
>   Similarly, there are many other document formats that work this way,
> such as (X)HTML, DocBook (XML), and so on.

And if you look at the diff between two PS files, it tells you little
more than "version 1 is different from version 2".  I.e., you just
get an output that makes sense to something with the cognizance of
a computer!

OTOH, rendering the PS and comparing the *outputs* makes the differences
very apparent!  ("Ah, the margins are slightly different; this line
breaks after "foo" instead of "bar"; the gradient fill is light to
dark instead of dark to light, etc.")

> Even though the eventual output is binary, the source is not.  I have
> written many documents in LaTeX along with a much longer manual using
> DocBook.  Add to that webpages and other random bits.  I've stored all
> of these in a VCS at one time or another and since the source is text
> you don't need to go through any extra steps to have your VCS track
> fine line-by-line changes.

Sure!  I abandoned Ventura Publisher when it abandoned the text
format for representing documents (an early "tagged" format).
But that was because the tools/tricks I had developed to provide
features that VP did not include would no longer work.

But, I don't want to be restricted to using *just* a tool that
produces text files simply because it would make life easier
for a VCS that *prefers* to process text.

> My point is not that this is *the* way or how everybody should do
> things, rather that UNIX has a long history of this and this history
> might offer some explanation as to why many of these
> binary-diff-versioning features are not present.  Maybe.  :)

I still contend that lack of support for non-text formats is
a consequence of developers' focus on "writing code".  If the
folks writing these tools were all PHOTOGRAPHERS, then the
tool would have been designed to allow comparing *photos*.
And, if you wanted to use it to track *text*, you would first
"print" the text, then photograph it, etc.  That would work to
*show* you what had changed -- but wouldn't be of much help
"harvesting" those changes.

[This is the problem I have with my crude way of handling
schematics, PCB layouts, CAD drawings, etc. -- sure, I can
see that this screw is now a Torx head instead of a Philips
head... but there's no way to "cut and paste" that information
from the drawing(s)/diff to a real "drawing"!  I have to cut
and paste via my *head* (which is subject to errors) -- another
thing machines are supposed to minimize for us!]

>> [Perforce has been recommended by several of my peers as the "least
>> bad" option given my constraints.  MS's "Team Foundation" has seen
>> nothing but admonitions against its use!  I'll build a perforce
>> server once I understand the best platform on which to host it.
>> Then, start checking in parts of my repository to see what working
>> in it is like, when it bogs down, how it handles crashes, etc.
>> For yucks, I may similarly check in the same portion of the repository
>> to git and svn -- on the same server and client.  Then, have *real*
>> data by which to compare their performance on "nontrivial" data sets]
>
> **Yes**  In the end this the only way to really answer your question.
> I'm sure you have been able to rule out a number of products, but for
> many of the requirements you have mentioned, a side-by-side test is
> the only way to get a truly useful comparison.

I've read dozens of blog entries, white papers, "whiny rants", etc.
All in all, they say absolutely nothing!  Everyone has an axe to
grind or a product to pitch (even if it is "free").  Often the
"I don't like being told I *have* to do something" attitude is
thinly veiled...  The same sort of thing that pops up when you
have to deal with "coding standards"...

I've been told which products to "avoid at all costs".  I have a
sh*tload of firsthand experience with (my bastardized) cvs.  And,
I don't have to concoct bogus "examples" (and wonder what hidden
lazinesses/biases have crept into them) with which to exercise
a VCS candidate.  I can use my existing repository to stress the
tools and see how they behave.

But, I have to make sure I design the experiment *before* starting
on it.  I want to be able to compare apples to apples so I need
to ensure I have "servers" and "clients" of each candidate.  And,
host them on different OS's as well as the same set of iron (else
someone will always "complain" that their favorite was tested in
an environment in which it doesn't excel -- hence the reason it
did so poorly, etc.)

I figure I can set up a single machine (x86) and install Windows,
then Solaris (Intel) and BSD and use this to host each server.
Use a second machine (I don't think I have two identical machines)
to host the Windows, Solaris and BSD *clients* -- in addition
to trying to run the client on the server!

That way, I should be able to isolate the effects of hardware
variations on the client and server ends.  And, the "local client"
would let me eliminate the effect of network transport delays.

Then, just see how each combination handles various repositories
and actions on those repositories.  I.e., what constitutes "too
big" for each VCS?

[Of course, there is still the issue of "$FavoriteCVS was crippled
because you didn't provide enough RAM in the server; the disk was
too slow; cache was too small; yadayadayadayada..."  This is why
all benchmarks are silly -- except those that apply to *your*
actual usage patterns!]

I'm most excited about Perforce as it, at least, appears to
have approached the problem systematically.  E.g., a *real*
database used for metadata -- instead of ad hoc "text files"
littering the hierarchy.  (I will also want to test deploying
the repository on a read-only medium.  IMO, this is a "belts
and braces" approach to ensuring integrity!)  I see they
have published the schema that they use so, perhaps, I can
access and augment it to fit my needs.  (it will be nice to be
able to extract data from the DBMS directly instead of having
to rummage through the filesystem)

>> How many FOSS *programmers* are busy crafting extensions to
>> $Favorite_VCS that supports PDF objects?  DWG drawings?  MathCAD
>> notebooks?  PFM font definitions?  etc.?  (if so, where are they
>> all hiding their works??)  Making an extensible "product" is
>> pretty useless if the extensions don't exist!
>>
>> ...
>> Again, this just reinforces my point that FOSS software is written
>> by folks concerned with aspects of WRITING SOFTWARE and little more.
>> (alternatively, perhaps they lack skillsets in other disciplines?
>> how else do you explain the lack of support for other *popular* file
>> types??)
>
> Hmmm... good questions.  I suspect that you may be correct for an
> average free-time FOSS developer.  Such a person likely does not have
> sufficient need to bother working on something like this, especially
> since whichever VCS they choose to use *will* keep track of different
> versions of any binary files they toss in.  I suppose this is enough.

I think it is more fundamental than that.  How many FOSS developers
design hardware?  Write formal specifications?  Produce camera ready
artwork/adverts?  etc.

I.e., they can be completely clueless to the issues involved.  And,
on top of that, have no "personal stake" in creating a solution that
addresses these issues.

E.g., I care about CAD drawings because I *make* CAD drawings -- and
have to version track them.  Ditto schematics, specs, artwork, etc.
(tracking written correspondence, etc. is just icing on the cake)

OTOH, I *don't* write music!  So, whatever tools exist to assist in
writing music (undoubtedly creating "special files" in the process)
are of no interest to me.

Similarly, I only use AutoCAD for CAD(D) work.  So, I don't bother
trying to handle CATIA or MicroStation files.  Ditto for Schematic
and PCB tools, etc.  As Tom said: force everyone to standardize
on one tool for each type of job.  (If a client asks me to use a
different tool, I make it very painful -- $$ -- for them to
continue down that path... because it costs me A LOT to handle
something "different")

So, I may have more "vested interests" than the average FOSS
developer, but I am just as "selfish" with my time (i.e., I
haven't bothered doing anything to solve this problem for others
*or* for solving it "properly" for myself!)

I suspect a "commercial concern" will be more likely to handle
more "file types" as their motivation will be some request from
a potential cu$tomer.  (E.g., Perforce apparently has a merge/diff
that can handle many graphic image formats presenting the results
*visually*)

The fact that so many mainstream tools run on Windows complicates
things.  E.g., I can't render an AutoCAD DWG on one of my UN*X
boxes and pipe the output into any sort of diff -- since AutoCAD
runs on a Windows machine (hence the reason I stipulated running
the VCS server on a *BSD box)

> From my own experience, in one of my slightly larger projects, I
> dumped most everything in the VCS (which was using CVS originally,
> then I switched it to Subversion).  Nearly all of it was text of one
> form or another.  This included all of the source, of course, but also
> the ChangeLog and README documents, the contents of the website, and
> some binary resources.  The binary parts were some bitmap resources,
> the website images, and a small number of pre-built binaries to make
> things a bit easier for users.

Sure!  CVS has been "adequate/effective" at managing *software* over
the years.  And, (true) BLOBs just get checked in and checked out.
The problem has always been those things that aren't text but are
still *common* objects (photos, scanned images, PDFs of datasheets,
etc.).  In those cases, all CVS did was act like a (slower!) disk
drive since all it could tell me was "version X is different from
version Y" (D'uh...)

> Since the amount of binary data was so small, I didn't need to bother
> with anything beyond text change tracking provided by the VCS.  It
> also kept track of versions of the binary files and that was enough.
> I mention this because I (and this project) likely fall into that
> category of FOSS developers who are sufficiently served by the
> available tools and therefore don't need to extend the tool to handle
> other/newer binary file formats.

But organizations have *lots* of different objects that need
versioning!  A tool that just works for the "programmers" isn't
going to help the "hardware jocks" or the accountants (who want
to version spreadsheets, ledgers, reports, etc.) or the personnel
folks (who want to version policies, records, etc.) or the IT
guys (who want to keep track of machine and system configurations),
etc.

When one "department" needs to interface with another, its nice if
that can be through a *common* tool -- instead of having to
get someone to explain the current version of their *particular*
tool to them in enough detail that they can use it effectively.

Imagine writing a manual and having to learn whatever VCS the
"testing" folks used to record their test suites; yet another
to access the "most recent versions" of the illustrations the
art department prepared; etc.  Then, when Tech Support wants
to access your manual, *they* have to learn about *your*
VCS...

And, as more and more businesses (esp small) "discover" the need
to maintain histories in a structured manner (instead of the
electronic equivalent of a "shoebox full of receipts"), this
will become even more pressing.

I'm more of a "business"/organization than a "FOSS" and, thus,
am worried about more issues than just writing software (perhaps
10% of my time/effort?).  I *shirley* don't want to release all
of this stuff in its kludgey condition!  Given that alternative,
I'd just elide all the history and pretend it was all "born
fully formed"  :>  (which, I think, would be a serious compromise
as I have greatly benefited from examining the evolution of much
of the software that I've "inherited")

But, there's no incentive for me to adopt a tool that just favors
one aspect of the process over all others!