- Open Source Software Meta-data and Taxonomy
Melanie Gardner, National Agricultural Library, Mgardner@nal.usda.gov
John Kane, National Agricultural Library, Jkane@nal.usda.gov
Tim Lynch, Cornell University, Tim.Lynch@cornell.edu
- Case Study: AgNIC
Agricultural Networked Information Center
-Objectives for AgNIC
-Requirements to meet those objectives
-Role of Open Source Software, Metadata, and Taxonomy in meeting requirement
and attaining our objectives
- AgNIC - The Organization
-An alliance of land grant institutions, cooperative extension agencies
and other related agricultural institutions
-Goal: Provide comprehensive network access to agricultural resources
- AgNIC - The Service
-A web portal to all things agricultural
-www.agnic.org
- Power Point Slide
(see slide graphic)
- Power Point Slide
(see slide graphic)
- Power Point Slide
(see slide graphic)
- Objectives for the new AgNIC
-Browseable hierarchy of subject categories
-Advanced search options (by language, geopolitical region, etc)
-Increase participation by alliance members
- Adding resources
- Extending functionality
- Administration
- Key Requirements for New AgNIC
-Well-structured information
-"Tagged"elements: title, author, language, geopolitical region,
subject term, etc.
-Infrastructure designed to keep pace with Web growth; flexible to meet
evolving alliance member needs
- View from 10,000 feet
-We need standards:
-Comprehensive description of overall subject domain
-Consistent methods to describe each information resource
-System built on extensible software base:
- Responsive to alliance needs
- Configurable to information standards set by alliance
- View from 5,000 Feet
-We need:
-A taxonomy of our subject domain
-A robust and extensible metadata description for our information resources
-A system based on Open Source Software
- Power Point Slide
The standards for administering the information provided by AgNIC
must themselves be administered by alliance members.
The software that forms the foundation of AgNIC must, likewise, be open
to configuration by alliance members.
- Open Source Software
-"Give away the recipes; open a restaurant"
-Software source code has traditionally been seen as a crown jewel,
to be closely guarded
- Open Source Examples
-Microsoft's Internet Explorer? - no
-Apache Web Server? - yes
-Windows? - no
-Linux? - yes
-ROADS (Resource Organisation
And Discovery in Subject-based Services)?- yes
- Open Source - Is it any good?
-Highly reliable
- Sendmail
- Internet Domain Name System
-Peer reviewed by a "thousand eyes"
- Quick bug fixes
- Highly secure
- Open Source - Is it free?
-Software purchase cost is typically a small percentage of total cost
of development
-Open Source Software is more about reducing risk and crafting system
that meet users' needs
- For more on Open Source
-Open Source: Software Gets Honest www.opensource.org
-Eric Steven Raymond's Home Page www.tuxedo.org/~esr/
- Meta-data
- What is it #1?
-Descriptive data that provides information about or documentation of
other data managed within an application or environment
-Data that describes other data
- What is it #2?
-When it's an unhyphenated word with initial capital letter it's
a registered trademark.
Jack E. Myers coined the term in 1969. After a search turned up no use
of either "metadata" or "meta data" he used the
term in a 1973 product brochure and made it a registered U.S. Trademark.
It refers to products developed by his Metadata Company. http://wombat.doc.ic.ac.uk/foldoc/index.html
about data
- Example
-Card index catalog
-Citations
-Reference or bibliography for a document
-Catalog of videos
-Phone book
-Inventory records
-Map notation (i.e. scale)
-Medical records (?)
- Types
-Descriptive
-Structural
- navigation and presentation
-Administrative
- info management
- date and method of creation and ownership or access authorization
- Web Problems up to now:
-The glut of information on the web is making meta-data part of
everyone's vocabulary
-But when it's used it's often not used consistently or reliably
- not a concern for most authors
- or used in a very superficial way
- no standard formats or semantics
- used to promote rather than to describe
- Web Solutions:
-Processes
- creation process (authors) .. too optimistic?
- management process (info managers) ... too expensive?
- archiving process (libraries) .. too late?
-Crosswalks
- the equating of one set of meta-date with another
- because terminology is often discipline specific it can be difficult
to come to agreement on definitions.
-Standards
-RDF and Handles
- Meta-data review
-2 resources available on the web:
- Meta-data Standards
-Directory Interchange Format (DIF)
-Digital Object Identifier (DOI)
-Dublin Core
-Encoding Archival Description (EAD)
-Federal Geographic Data Committee (FGDC)
-Government Information Locator Service (GILS)
-Internet Anonymous FTP Archive (IAFA/WHOIS++ Templates)
-Instructional Management System (IMS)
-Machine Readable Catalogue Format (MARC)
-Open Archives Metadata Set (OAMS)
-Platform for Internet Content Selection (PICS)
-Summary Object Interchange Format (SOIF)
-Text Encoding Initiative (TEI)
-Handle System/RDF
- Directory Interchange Format
-The directory Interchange Format (DIF) is a de-facto standard with
five mandatory fields and 28 optional fields. The DIF is compatible
with the U.S. federally mandated Federal Geographic Data Committee's
(FGDC) Content Standard on Digital Geospatial Metadata (CSDGM).
http://gcmd.gsfc.nasa.gov/difguide/difman.html
- Digital Object Identifier
-The DOI is an identification system for intellectual property in
the digital environment. Developed in 1997 by the International DOI
Foundation on behalf of the Association of American Publishers, its
goals are to provide a framework for managing intellectual content,
link customers with publishers, facilitate electronic commerce, and
enable automated copyright management. It uses a persistent identifier
resolution system called the Handle System developed by CNRI. http://www.doi.org/index.html
- Dublin Core
-'Dublin Core' is shorthand for the Dublin Metadata Core Element
Set which is a core list of 15 metadata elements agreed at the OCLC/NCSA
Metadata Workshop in March 1995. The workshop was organised by OCLC
and the National Centre for Supercomputer Applications (NCSA) to promote
development of a metadata record that describes networked electronic
information. The Dublin Core is a simple information resource description
that provides a basis for semantic interoperability between other, often
more complicated, formats. http://purl.oclc.org/dc/
- Encoding Archival Description
-EAD consists of an SGML DTD, a Tag Library, Guidelines for its
use, and examples. The Library of Congress Network Development/MARC
Standards Office acts as the maintenance agency and the Society of American
Archivists is the owner of the emerging standard, and is responsible
through a committee representing the archival community for ongoing
oversight and development. Often called “finding aids” it has been developed
for use with archives and manuscripts collections. http://lcweb.loc.gov/ead/
- Federal Geographic Data Committee
-The Federal Geographic Data Committee (FGDC) initiated work in
June 1992 on a common set of terminology and definitions for the documentation
of geospatial data. The resulting standard was approved by the committee
in June 1994 as the Content Standards for Digital Geospatial Metadata.
The name of the format is strictly speaking the Content Standards for
Digital Geospatial Metadata (CSDGM) however it is more commonly referred
to as the FGDC standard. In 1995 a clearinghouse was established through
which geospatial data is made available to the public. http://www.fgdc.gov/
- Government Information Locator Service
-GILS was setup by the US Federal Government in order to provide
the general public and its own employees with a means of locating government
information. GILS was intended to force each agency to provide a set
of locators that "together cover all of its information dissemination
products" (Executive Office of the President, Office and Management
and Budget, OMB Bulletin, no. 95-01, Dec. 7, 1994. http://www.gils.net.
- Internet Anonymous FTP Archive (IAFA/WHOIS++
Templates)
-IAFA template formats were drawn up for the various categories
of information present on FTP archives. The recently developed directory
service software, whois++, offers the possibility of searching across
multiple databases. Experimental work is being done using the Common
Indexing protocol (CIP) which gathers together a 'centroid' or summary
from a number of database to form an 'index server'. The index server
contains an index of all unique attribute values contributed by the
centroids, and searches can be referred from one index server to another
by interlinking the servers in a mesh. The underlying philosophy is
that it must be the information providers who create metadata records
if indexing of the Internet is to be a viable proposition. http://www.roads.lut.ac.uk
- Instruction Management System (IMS)
-IMS was initiated with EDUCAUSE (EDUCOM). IMS Global Learning Consortium,
Inc. is developing and promoting open specifications for facilitating
online distributed learning activities such as locating and using educational
content, tracking learner progress, reporting learner performance, and
exchanging student records between administrative systems.http://www.imsproject.org/aboutims.html
- Machine Readable Catalogue Format (MARC)
-MARC originated in the 1960s as a means of exchanging library catalogue
records. It was in response to the need for a standardized format for
co-operating libraries to exchange and share catalogue records. It also
met the requirements of national bibliographies for a format for their
printed bibliographies, and it was used by bibliographic agencies for
their supply of records to libraries. As library systems became computerized,
MARC was used in library automation software as the basis for manipulating
library records for display and indexing. http://lcweb.loc.gov/marc/
- Open Archives Metadata Set (OAMS)
-The OAMS is a simple set of eight elements announced as part of
the Santa Fe Convention in 19999 by the the Open Archives initiative
(Oai). The metadata is in support of the Oai’s stated goals to promote
and support technical solutions, especially in the area of interoperability,
for organizations who wish to establish and maintain open e-print archives.
http://www.openarchives.org/
- Platform for Internet Content Selection (PICS)
-Specifications which enable people to distribute metadata about
content in the form of "labels". Computers can then process the labels
in the background according to settings previously specified by the
user, filtering out undesirable material or directing users to sites
that may be of special interest to them. The PICS specification was
originally designed to allow parents and teachers to screen out materials
unsuitable for children using the Internet. Rather than simply censoring
the information itself, as various legislative bodies have suggested,
PICS gives responsibility of control to users. http://www.w3.org/PICS/
- Summary Object Interchange Format (SOIF)
-SOIF was designed as part of the Harvest Architecture developed
at the University of Colorado at Boulder. Records in SOIF are designed
to be generated by Harvest gatherers and then used for user searches
by Harvest brokers. In March 1996, Netscape Communications announced
that they were also going to use SOIF in their catalog server product
and a number of other search engine manufacturers are said to be looking
at supporting it.(Note: there is no further development of the Harvest
search engine) http://www.tardis.ed.ac.uk/harvest/
- Text Encoding Initiative (TEI)
-TEI was published in 1994 and consists of set of a generic guidelines
for the representation of textual materials in electronic form (SGML/XML).
TEI is sponsored by the Association for Computers and the Humanities,
the Association for Computational Linguistics, and the Association for
Literary and Linguistic Computing. TEI text must be preceded by a TEI
header that was formulated as part of the project by the Committee on
Text Documentation comprising librarians and archivists from Europe
and North America. The overall layout is grounded in a cataloguing tradition
. Headers can be used to describe networked resources which are not
necessarily themselves TEI encoded. http://www.tei-c.org/
- Commonalties
-All are some form of metadata implementation for web resources
-Most if not all are looking at some sort of interoperability or crosswalking
- And most are doing it by mapping to and from the Dublin Core.
-Most if not all are using XML. -Each is made more utilitarian by the
Handle System (or something like it) and RDF.
- Handle system
“The Handle System® (described
as a confederated name service) is a distributed computer system
which stores names, or handles, of digital items and which can quickly
resolve those names into the information necessary to locate and access
the items. It was designed by CNRI (Corporation for National Research
Initiatives) as a general purpose global system for the reliable management
of information on networks such as the Internet over long periods of
time” . The “handle” consists of two parts: a naming authority and a
unique ID. Information associated with the system is changed as needed
to reflect the current state of the identified resource without changing
the handle, thus allowing the name of the item to persist. http://www.handle.net/
- The Resource Description Framework (RDF)
RDF is based on the assumption that everyone on the Web is not going
to use the same meta-data. It provides a standard way to represent metadata
in the form of statements about properties and relationships of almost
anything, provided it has a Web address. RDF provides a framework in
which vocabularies can be developed to suit specific needs. The RDF
language allows each document containing metadata to clarify which vocabulary
is being used by assigning each vocabulary a Web address. One of the
best-known schemas is the Dublin Core. http://www.w3.org/RDF/
- RDF
-RDF is an attempt to use those commonalties to make meta-data sets
interoperable.
-RDF model.. Simple 3-part record .. and a vocabulary
- Resource > Property Type > Value
- Vocabulary .. any meta-data set
- RDF
(see slide graphic)
- Suggestions
-Begin using a meta-data standard that best fits the community you
serve.
-If there isn't one or the document is for web access use the Dublin
Core.
-If you're not using Dublin Core be aware of how the format you are
using "maps" to the Dublin Core
-Begin learning XML if you don't know it already.
-Take advantage of automatic meta-data generating tools like http://www.ukoln.ac.uk/metadata/dcdot/
- Taxonomy
You've got data... can anyone else find it?
- Major Points Today
-PROBLEM: Information retrieval
-SOLUTION: Controlled vocabulary
-AgNIC Prototype and controlled vocabulary
- Information Retrieval Problems
-Synonyms - different words with same meaning
- spelling variants
- acronyms
-Sugar Beet vs. Sugarbeet vs. Sugar Beets Vs. Sugarbeets vs. Beta vulgaris
-GIS vs. geographical information system vs. geographic information system
- Information Retrieval Problems
Homographs - words with same spelling and different meanings
- Tillers (machinery) vs. Tillers (plant part)
- Lime (fruit) vs. Lime (soil amendment)
- Turkey (animal) vs. Turkey (meat) vs. Turkey (country)
- What is a controlled vocabulary?
A controlled vocabulary is a set of allowed terms that can be used
to describe the subject content of a resource.
- Controlled Vocabularies
-List of terms
-Thesaurus
- NISA Z39.19 Thesaurus
-One term represents one concept
-Meaning of term is clear
-Hierarchical structure
-Synonyms are link
-Related terms, Scope Notes, Definitions
- Thesaurus example
(see slide graphic)
- Thesaurus Example
(see slide graphic)
- Thesaurus enables "Smart Searching"
Thesaurus can be a TOOL in the foreground or background of the system
to aid information retrieval and a searcher does not even have to be
aware of the thesaurus component.
- "Smart Searching"
(see slide graphic)
- "Smart Searching"
(see slide graphic)
- Agricultural Classification Prototype
-16 top headings
-1,961 terms
- 1,098 descriptors (can be used for resource description)
- 863 non-descriptors (lead-in terminology that points to descriptor)
- Power Point Slide
The Gateway to Education Materials http://www.geminfo.org/
|