Digital technology promises to support exciting new kinds of study and handling of text; and evidently there is a vast proliferation of different materials available on the Internet. But elation can quickly turn to dismay when we confront the realities of e-texts on the Internet: far from being a well organized and managed collection of usefully edited materials, such as we are accustomed to finding in libraries, the Internet proves to be a wild nobody's-land with a changeable climate and an uneven terrain....
The actualities of the Internet belie the promises of the hype; still its offerings are formidable when they can be realized. And if locating, selecting, and configuring electronic texts for use in a variety of ways involves us in practical difficulties, this very fact creates an opportunity for skilled scholars and professionals to provide an added value and contribute to development of long-term solutions.
These guidelines are presented to help evaluate electronic text as you may find it, either offered to you directly or discovered and downloaded from the Internet. The same considerations of quality will apply even more particularly if you are responsible for the production and publication of text in electronic form.
Cost-benefit considerations are always foremost when finding and evaluating electronic texts. These are factors to consider:
The usual way of locating texts at present is via the Worldwide Web; other methods include gopher and direct ftp (file transfer protocol) from a known site (both these older methods are also supported by web browsers). Depending on the capabilities and settings of your client software (for example Netscape) and the configuration of the site you are looking at, a text will either appear to you as an item on a list (such as a directory listing of files on the remote host), or displayed directly on screen (in which you can usually discern its filename by looking at your client, for example in Netscape's "location" window). Your client software will give you a way to save the file on your system in either case.
These are the categories of file formats you will encounter:
ASCII is the underlying format of almost all existing text file types. In fact, ASCII is a character set, not a file type: "ASCII" stands for "American Standard Code for Information Interchange." It specifies the numerical codes for 128 characters (the number provided by a "seven bit word," i.e. a unit of seven binary digits): the 52 upper- and lower-case English letters, the numerals, a selection of punctuation marks and a series of "control characters" (allocated to controlling how a system handles a text, for example by providing a carriage return and line feed). A text file in "ASCII format," then, is a plain file containing only ASCII characters. Since ASCII has been subsumed in the ISO 646 standard character set, other text formats are generally built as extensions or specialized uses of it. With the exception of certain of the proprietary text-file types (which may use an "8-bit" or extended ASCII character set and/or a binary code of some kind), all the other categories described here, including both HTML and SGML, are various ways of using ASCII to provide encoding along with the "raw text."
HTML is, of course, the file format of web pages. An HTML file is an ASCII file (actually a more-or-less valid SGML file; see below) containing markup which can be used by an HTML browser such as Netscape, Microsoft Explorer or Lynx to format the text for the screen or to refer to external data such as other HTML, image or sound files. If you are looking at a web page formatted in Netscape, you are looking at an HTML file--you can use the "Document Source" option in the "View" menu to see the underlying HTML code. The Worldwide Web works as a "web" because HTML files can invoke one another, thus enabling a browser to link from one to the next with a mouse click.
virtues:
An HTML file can be read and printed directly with a web browser, now a
ubiquitous piece of software. It supports a limited kind of "multimedia."
Most browsers support string searching; i.e. you can search the entire
file for a word or phrase. A simple automated routine can be used to strip
the HTML tags and create an unformatted ASCII file (some of the browsers
will even do this for you).
liabilities:
HTML markup describes how a page should be displayed in a browser, but
does not serve well to indicate any structures or phenomena in the text
other than those proper to web pages (such as paragraphs, certain levels
of headers, certain types of lists). It is thus unsuited for use in any
other kind of application, such as a word processor, text-analysis package
or database. HTML supports pointing within and between texts; but since
text structures are largely unidentified as such, even in its native environment
it becomes unwieldy for texts longer than several hundred words. HTML also
does not in itself support a range of common non-standard characters (characters
outside of the ASCII character set). Driven forward by corporations such
as Netscape and Microsoft, and coordinated by the Internet Engineering
Task Force (IETF), the HTML standard is still evolving; but it is not clear
that much legacy HTML is likely to be converted to include richer markup--and
in any case, HTML will always be HTML.
These formats include ad hoc text encoding, along with encoding schemes such as COCOA (named after an early general-purpose concordancing program for which it was designed). Supplementary documentation about a file will usually indicate such markup; or it will be evident from visual inspection. Many such schemes are not standardized except within the context of a given project. The best texts using such encoding schemes have been carefully composed and proofed for consistency and integrity; some programs such as the Oxford Concordance Program or the University of Toronto's TACT package (for text analysis) are able to use such kinds of encoding to provide a reference scheme (so that, for example, a computer-generated concordance can indicate chapter and line number where citations occur).
At the opposite extreme, some "home brew" schemes of this kind are created by researchers for single uses when they find they can't do the kinds of processing they want with a plain ASCII file. (For example, many of us have used *asterisks around a text* as a kludge for rudimentary markup, unrecognized by any application program, to indicate emphasis or set off a text for another purpose.) To be sure, such texts are not likely to be generally available on the Internet; when they are found they can usually be considered a form of less-than-clean ASCII.
virtues:
These schemes can be very useful in the context of their development, but
more or less useless outside them. In the best cases such as COCOA markup,
they can be converted into another more immediately useful form such as
a rudimentary SGML.
liabilities:
Ad hoc schemes can be very capricious and inconsistent. Even when consistent
they can get in the way. Such markup can sometimes be stripped by a scripted
routine or simple program, in which case you will usually have a plain
ASCII file of some kind.
Most of us work with electronic text within a particular application. If you use WordPerfect, for example, the program saves the file in its own format, with its own kind of encoding for features such as bold or italic type, changes in fonts or font size, headers and footers, etc. Sometimes you will be offered these texts or will encounter these files on the net--the provider has decided that the value added by the formatting exceeds the price paid by its dependence on the application program.
This is the case with Postscript files (Adobe Postscript is a proprietary code which controls the printing of documents; the Adobe Corporation licenses its use to a wide range of printer manufacturers). This is a fairly common way of providing documentation over the net, providing a certain measure of security for distributors who want to release texts in print without incurring the cost of printing and shipping, but who don't care to see the e-text manipulated directly. Another example is Adobe's .pdf format for its Acrobat application. By giving away copies of the Acrobat reader (a client program which displays .pdf files), Adobe hopes to create a market for networked publication of files which can include layout, fonts, graphics and hyperlinking better than that afforded by HTML. Unfortunately, if you want to do anything with a .pdf file not supported in Acrobat, you're out of luck.
virtues:
If you use the particular application which supports the format in question,
and it can do what you want to do, you're in business. Or, with some of
these formats (most of the word processing formats), you can call the file
up in the application program and then convert the format, for example
using the "Save As" option to save it in plain ASCII, thereby
getting it ready for another kind of encoding. The better word processors
also support file format conversion to read each other's formats, which
may work more or less well.
liabilities:
are pretty clear. Quite often, a file like this is completely unusable.
Postscript files can be printed, for example, if you have a Postscript
printer; but that's about it. (Some might argue a Postscript file isn't
"electronic text" at all, but rather a kind of "electronic
print.") If the application is obsolete, or merely unavailable, proprietary
encoding is usually more trouble than it's worth.
As is clear from the foregoing examples, the great need for text encoding to support various applications of electronic text--plain ASCII is seldom enough--has spawned a variety of proprietary and non-proprietary schemes, which has created problems of its own. Standard Generalized Markup Language (SGML) is not really an encoding scheme in its own right: it is actually a standard (ISO 8879:1986) for designing and specifying encoding schemes--so-called "markup languages," or "tag sets" because they take the form of tags to be embedded in the text.
These tag sets are defined formally: every SGML text, if it is valid SGML, conforms to a DTD ("Document Type Definition"). An SGML DTD is a set of rules (its syntax is defined in the standard) which specifies what tags are to be called and how they may be used. Different DTDs are used to define tags for different document types--for example, aircraft maintenance manuals, transcriptions of medieval manuscripts, or sections of the tax code. Because documents of a single type all conform to a single DTD, once a system is set up to process one of them, any of them can be handled the same way.
In fact, HTML is an SGML tag set and has its own DTD (HTML 2.0 is the current standard DTD; the next version will be HTML 3.2). Most HTML actually encountered on the Web, however, is not valid; that is, it does not fully conform to the rules set out by the DTD. As far as the Worldwide Web is concerned this is okay, because browsers such as Netscape and Explorer only display the text and provide simple operations like inserting images in line and linking to other files. Since strict conformity to the formal rules is not much of an issue to most creators of HTML, but only how the text looks and works in the browser, we generally spare ourselves from learning technicalities of the tag set and the validation of instances. If HTML texts were to support more sophisticated functions, however (for example, if HTML provided tags to mark bibliographic citations for indexing), validation would become an issue.
virtues:
Because it allows for validation against the standard, SGML is by far the
most versatile and stable form of text encoding presently available. Formally,
SGML texts are application- and platform-independent--an SGML tag set supports,
at least in principle and often in practice, sophisticated screen display
and navigation, analytical functions, indexing, formatting for printing;
and when you can't process the text the way you want in one SGML-conformant
software application, you can use another. Since SGML tag sets can be defined
by users independently of software companies, there is a vast, and exciting,
range of ways to encode texts. Markup can be used to describe virtually
anything at all; its conformance to the SGML standard assures that it can
be used in any application program supporting SGML. This means SGML is
also suitable for providing "archival" encoding. If a repository
wants to put its texts up on the Web, SGML can be readily converted into
HTML (the reverse conversion is quite cumbersome); alternatively, it can
be indexed for searching, provided with layout for printing, or adapted
to other uses.
liabilities:
Although most of the electronic text extant in the world is already in
SGML (its advantages in large-scale document management are so considerable
that many large corporations and industries have already moved to it--the
Department of Defense, the IRS, the Boeing Corporation, the governments
of Canada and Switzerland), it is still fairly rare to encounter it as
a publication format in its native form. (One major exception is hypertext
documentation of computer hardware.) More to the point, although SGML texts
on the net may come with some documentation and support on using them (sometimes
including links to free browser software), even then it is likely to be
quite opaque to the inexperienced user. In order to make good use of SGML,
expertise in the standard and its implementation is essential. Users and
repositories experienced in creating SGML texts will be the ones best able
to use SGML texts created by others.
In other words, while SGML is powerful and flexible, it requires considerable support--in the form of expertise even more than software. This entails an investment of time and resources which can be daunting and expensive at the beginning. As a result, the use of SGML tends to create user communities around cores of expertise, distributed within more-or-less formally defined institutional or cultural settings where its benefits are manifest. The U.S. Defense industry, which uses an SGML DTD to manage DOD procurement and project management (called the CALS DTD for "Continuous Acquisition and Life-Cycle Support"), is one such user community; much more loosely, the Worldwide Web, which "speaks" HTML, constitutes another. Use of SGML in Humanities disciplines was spurred by the Text Encoding Initiative (TEI), which in 1987 set about developing tag sets and guidelines for application-independent text encoding for scholarly texts: the TEI is at the core of another such community. Another is now emerging in the world of research libraries and archives with the release of the Encoded Archival Description (EAD), an SGML tag set designed for the creation of archival finding aids and electronic access tools.
There is no way any file format, as such, can guarantee quality, and the quality of texts varies extremely widely. At one extreme are texts which could be published in any medium, at the other are texts with various errors, inconsistencies and gaps. The only rule is caveat emptor. Visual inspection of a file can usually tell you a lot; if you plan to do extensive work with a given file, however (or are responsible for vouching for it), careful inspection and sampling is advised.
Certain kinds of authentication (see below) can assure quality, if not absolutely guarantee it. One problem sometimes encountered with texts on the Internet is a defective file: a creator has set out to do more than proves feasible, and quits before finishing, leaving significant (sometimes large) portions missing from the text. Only careful, informed inspection, and/or the authentication of a trustworthy source, can assure that the text you are looking at isn't one of these.
When dealing with a text which has been converted from some other medium (probably print, possibly an MS), concerns for quality may be divided into two categories: the quality of the transcription, and the kind and quality of the encoding.
Traditional criteria for judging the quality of a transcription apply equally, of course, to an electronic text. "Is the transcription accurate and complete?" is the main question. Taking a close look at the collection policies of the repository can indicate to what degree you can take this on faith or not. Sometimes a text comes with documentation, which is in itself a positive indicator that its creator(s) have taken the work seriously; it may also relate how carefully the transcription was created and proofed.
In a different category from plain errors in transcription, sometimes you may see "acquired errors" in a text. These usually take the form of extra non-information, "garbage" created when a text has been converted from one format into another, changing what is good (but perhaps irrelevant) encoding in one format into meaningless hash in another. This might warn you off against a text--it's a sign a text hasn't been proofed and may well have other problems. Cleaning up the traces of earlier encoding is sometimes possible, but often tedious and more work than it is worth.
Another problem bearing on quality of transcription has to do with the way in which non-standard (non-ASCII) characters are represented. This will have greatly to do with the text format itself and the application, if any, on which it is based. To the extent that the text uses "non-standard" characters ("non-standard" is shorthand for saying "fairly uncommon in American English and therefore not well supported by standards in the computer industry"; at the near end of the spectrum will be a language using the Latin alphabet with diacritical marks, such as French or Spanish; Chinese, Japanese or Arabic will be at the far end), there will generally be a trade-off between a standard file type such as ASCII, which offers a generically useful text but represents such writing systems poorly, and more proprietary systems, which may handle the writing system well but which suffer from "lock-in" of the data. Many languages have developed formal or informal standards for transcription into ASCII; in such cases, the questions are how workable these methods are in practice, and how closely the text at hand conforms to them. The best guides for dealing with these issues will be found within the community of users who have to grapple with the problems of representing a particular character set.
SGML has a way of referring to characters outside its base character set (which is usually, though not necessarily, ISO 646, i.e. ASCII), but depends on application programs to convert these references (called "text entities") into representations (the characters themselves, printed or displayed). This is the most workable standards-based solution, at least until the emergence of Unicode, the international standard encoding for character sets, and the integration of Unicode support into SGML systems.
Another transcription issue is how well an electronic text serves to represent the structure and organization of the original, such elements of the text as lines of verse or stanzas, paragraphs or chapters--since the text stream itself, the string of ASCII characters, commonly does not in itself represent this "meta-information." If the text is not encoded (if there is no markup), such information must be represented through other means such as systematic uses of spacing and typographic indicators. It is essential that such indicators be implemented with perfect consistency, or they fail to be useful for any application beyond reading the plain text. If they are consistent, they may potentially be converted automatically into markup by a text-processing script (consult your local hacker). Typographic markup is by nature ambiguous, and the text may well have to be encoded by hand.
A good plain ASCII text is recognizable to the experienced user by the consistency of its handling of such classic problems as the representation of structure and non-standard characters. One character to watch out for in particular is the em-dash "--", which is not included in 7-bit ASCII. The savvy creator of an ASCII text may represent the em-dash with two hyphens surrounded by spaces: " -- ". This is convenient because it assures a program which performs automated processing of the text (such as a concordancer or text analysis program) will not treat two words with an em-dash between them as one long word.
If you are looking at an SGML text in a browser, on the other hand, your em-dash may appear as a real em-dash "--"; if you are looking at the source code you may see an ASCII reference to the character, the text entity "—". That is, if there is good quality text encoding (markup), problems of representing non-standard characters, or structure, can take care of themselves to whatever extent the text encoding provides solutions. Since the quality of encoding is essentially dependent on how useful it is to you, the questions become highly relative to your situation. You will want to ask:
When not immediately evident on inspection, all these will become clear as you work with a text.
One question you will want to ask when you locate a text is whether and how it is authenticated. When an electronic text carries the imprimatur of some reputable person or institution, this in itself serves as some indication and assurance of quality.
This means looking for information about the text, what is called "metadata" (data aboutthe data) in the form of documentation within or alongside the text (for example in a text file, often called "readme," which sits in the subdirectory with the file on a unix system). In the case of Humanities texts, the Text Encoding Initiative has taken a major step forward by including provision for a "TEI Header," a required element within every TEI SGML text, designed specifically for purposes of authentication: the TEI Header documents a text's sources, creator(s), sponsor(s), conditions of use, methods of creation and encoding, revision history and other facts important to the user of a text. Any text encoded in TEI SGML should contain its authentication within it.
Many texts available electronically, but by no means all, are in the public domain. Legally, even if a work has itself passed into the public domain, its encoding can be protected by copyright. The site where you find a text will usually provide guidance as to its distribution and use policies. The application of copyright law to networked information is a vexed area not yet much addressed in the law or the courts; so be judicious, and if copyright protection is claimed for a text (for example on a web page), be careful that your use of it falls clearly within "fair use."