Open Source Software Meta-data and Taxonomy

Text Only (Outline) View


  1. Open Source Software Meta-data and Taxonomy
    Melanie Gardner, National Agricultural Library, Mgardner@nal.usda.gov
    John Kane, National Agricultural Library, Jkane@nal.usda.gov
    Tim Lynch, Cornell University, Tim.Lynch@cornell.edu

  2. Case Study: AgNIC
    Agricultural Networked Information Center
    -Objectives for AgNIC
    -Requirements to meet those objectives
    -Role of Open Source Software, Metadata, and Taxonomy in meeting requirement and attaining our objectives

  3. AgNIC - The Organization
    -An alliance of land grant institutions, cooperative extension agencies and other related agricultural institutions
    -Goal: Provide comprehensive network access to agricultural resources

  4. AgNIC - The Service
    -A web portal to all things agricultural
    -www.agnic.org

  5. Power Point Slide
    (see slide graphic)

  6. Power Point Slide
    (see slide graphic)

  7. Power Point Slide
    (see slide graphic)

  8. Objectives for the new AgNIC
    -Browseable hierarchy of subject categories
    -Advanced search options (by language, geopolitical region, etc)
    -Increase participation by alliance members
    • Adding resources
    • Extending functionality
    • Administration

  9. Key Requirements for New AgNIC
    -Well-structured information
    -"Tagged"elements: title, author, language, geopolitical region, subject term, etc.
    -Infrastructure designed to keep pace with Web growth; flexible to meet evolving alliance member needs

  10. View from 10,000 feet
    -We need standards:
    -Comprehensive description of overall subject domain
    -Consistent methods to describe each information resource
    -System built on extensible software base:
    • Responsive to alliance needs
    • Configurable to information standards set by alliance

  11. View from 5,000 Feet
    -We need:
    -A taxonomy of our subject domain
    -A robust and extensible metadata description for our information resources
    -A system based on Open Source Software

  12. Power Point Slide
    The standards for administering the information provided by AgNIC must themselves be administered by alliance members.
    The software that forms the foundation of AgNIC must, likewise, be open to configuration by alliance members.

  13. Open Source Software
    -"Give away the recipes; open a restaurant"
    -Software source code has traditionally been seen as a crown jewel, to be closely guarded

  14. Open Source Examples
    -Microsoft's Internet Explorer? - no
    -Apache Web Server? - yes
    -Windows? - no
    -Linux? - yes
    -ROADS (Resource Organisation And Discovery in Subject-based Services)?- yes

  15. Open Source - Is it any good?
    -Highly reliable
    • Sendmail
    • Internet Domain Name System
    -Peer reviewed by a "thousand eyes"
    • Quick bug fixes
    • Highly secure

  16. Open Source - Is it free?
    -Software purchase cost is typically a small percentage of total cost of development
    -Open Source Software is more about reducing risk and crafting system that meet users' needs

  17. For more on Open Source
    -Open Source: Software Gets Honest www.opensource.org
    -Eric Steven Raymond's Home Page www.tuxedo.org/~esr/

  18. Meta-data

  19. What is it #1?
    -Descriptive data that provides information about or documentation of other data managed within an application or environment
    -Data that describes other data

  20. What is it #2?
    -When it's an unhyphenated word with initial capital letter it's a registered trademark.
    Jack E. Myers coined the term in 1969. After a search turned up no use of either "metadata" or "meta data" he used the term in a 1973 product brochure and made it a registered U.S. Trademark. It refers to products developed by his Metadata Company. http://wombat.doc.ic.ac.uk/foldoc/index.html about data

  21. Example
    -Card index catalog
    -Citations
    -Reference or bibliography for a document
    -Catalog of videos
    -Phone book
    -Inventory records
    -Map notation (i.e. scale)
    -Medical records (?)

  22. Types
    -Descriptive
    • discovery and retrieval
      • MARC, Dublin Core
    -Structural
    • navigation and presentation
      • DTD..relationships
    -Administrative
    • info management
      • date and method of creation and ownership or access authorization

  23. Web Problems up to now:
    -
    The glut of information on the web is making meta-data part of everyone's vocabulary
    -But when it's used it's often not used consistently or reliably
    • not a concern for most authors
    • or used in a very superficial way
    • no standard formats or semantics
    • used to promote rather than to describe

  24. Web Solutions:
    -Processes
    • creation process (authors) .. too optimistic?
    • management process (info managers) ... too expensive?
    • archiving process (libraries) .. too late?
    -Crosswalks
    • the equating of one set of meta-date with another
    • because terminology is often discipline specific it can be difficult to come to agreement on definitions.
    -Standards
    -RDF and Handles

  25. Meta-data review
    -2 resources available on the web:
  26. Meta-data Standards
    -Directory Interchange Format (DIF)
    -Digital Object Identifier (DOI)
    -Dublin Core
    -Encoding Archival Description (EAD)
    -Federal Geographic Data Committee (FGDC)
    -Government Information Locator Service (GILS)
    -Internet Anonymous FTP Archive (IAFA/WHOIS++ Templates)
    -Instructional Management System (IMS)
    -Machine Readable Catalogue Format (MARC)
    -Open Archives Metadata Set (OAMS)
    -Platform for Internet Content Selection (PICS)
    -Summary Object Interchange Format (SOIF)
    -Text Encoding Initiative (TEI)
    -Handle System/RDF

  27. Directory Interchange Format
    -The directory Interchange Format (DIF) is a de-facto standard with five mandatory fields and 28 optional fields. The DIF is compatible with the U.S. federally mandated Federal Geographic Data Committee's (FGDC) Content Standard on Digital Geospatial Metadata (CSDGM).
    http://gcmd.gsfc.nasa.gov/difguide/difman.html

  28. Digital Object Identifier
    -The DOI is an identification system for intellectual property in the digital environment. Developed in 1997 by the International DOI Foundation on behalf of the Association of American Publishers, its goals are to provide a framework for managing intellectual content, link customers with publishers, facilitate electronic commerce, and enable automated copyright management. It uses a persistent identifier resolution system called the Handle System developed by CNRI. http://www.doi.org/index.html

  29. Dublin Core
    -'Dublin Core' is shorthand for the Dublin Metadata Core Element Set which is a core list of 15 metadata elements agreed at the OCLC/NCSA Metadata Workshop in March 1995. The workshop was organised by OCLC and the National Centre for Supercomputer Applications (NCSA) to promote development of a metadata record that describes networked electronic information. The Dublin Core is a simple information resource description that provides a basis for semantic interoperability between other, often more complicated, formats. http://purl.oclc.org/dc/

  30. Encoding Archival Description
    -EAD consists of an SGML DTD, a Tag Library, Guidelines for its use, and examples. The Library of Congress Network Development/MARC Standards Office acts as the maintenance agency and the Society of American Archivists is the owner of the emerging standard, and is responsible through a committee representing the archival community for ongoing oversight and development. Often called “finding aids” it has been developed for use with archives and manuscripts collections. http://lcweb.loc.gov/ead/

  31. Federal Geographic Data Committee
    -The Federal Geographic Data Committee (FGDC) initiated work in June 1992 on a common set of terminology and definitions for the documentation of geospatial data. The resulting standard was approved by the committee in June 1994 as the Content Standards for Digital Geospatial Metadata. The name of the format is strictly speaking the Content Standards for Digital Geospatial Metadata (CSDGM) however it is more commonly referred to as the FGDC standard. In 1995 a clearinghouse was established through which geospatial data is made available to the public. http://www.fgdc.gov/

  32. Government Information Locator Service
    -GILS was setup by the US Federal Government in order to provide the general public and its own employees with a means of locating government information. GILS was intended to force each agency to provide a set of locators that "together cover all of its information dissemination products" (Executive Office of the President, Office and Management and Budget, OMB Bulletin, no. 95-01, Dec. 7, 1994. http://www.gils.net.

  33. Internet Anonymous FTP Archive (IAFA/WHOIS++ Templates)
    -IAFA template formats were drawn up for the various categories of information present on FTP archives. The recently developed directory service software, whois++, offers the possibility of searching across multiple databases. Experimental work is being done using the Common Indexing protocol (CIP) which gathers together a 'centroid' or summary from a number of database to form an 'index server'. The index server contains an index of all unique attribute values contributed by the centroids, and searches can be referred from one index server to another by interlinking the servers in a mesh. The underlying philosophy is that it must be the information providers who create metadata records if indexing of the Internet is to be a viable proposition. http://www.roads.lut.ac.uk

  34. Instruction Management System (IMS)
    -IMS was initiated with EDUCAUSE (EDUCOM). IMS Global Learning Consortium, Inc. is developing and promoting open specifications for facilitating online distributed learning activities such as locating and using educational content, tracking learner progress, reporting learner performance, and exchanging student records between administrative systems. http://www.imsproject.org/aboutims.html

  35. Machine Readable Catalogue Format (MARC)
    -MARC originated in the 1960s as a means of exchanging library catalogue records. It was in response to the need for a standardized format for co-operating libraries to exchange and share catalogue records. It also met the requirements of national bibliographies for a format for their printed bibliographies, and it was used by bibliographic agencies for their supply of records to libraries. As library systems became computerized, MARC was used in library automation software as the basis for manipulating library records for display and indexing. http://lcweb.loc.gov/marc/

  36. Open Archives Metadata Set (OAMS)
    -The OAMS is a simple set of eight elements announced as part of the Santa Fe Convention in 19999 by the the Open Archives initiative (Oai). The metadata is in support of the Oai’s stated goals to promote and support technical solutions, especially in the area of interoperability, for organizations who wish to establish and maintain open e-print archives. http://www.openarchives.org/

  37. Platform for Internet Content Selection (PICS)
    -Specifications which enable people to distribute metadata about content in the form of "labels". Computers can then process the labels in the background according to settings previously specified by the user, filtering out undesirable material or directing users to sites that may be of special interest to them. The PICS specification was originally designed to allow parents and teachers to screen out materials unsuitable for children using the Internet. Rather than simply censoring the information itself, as various legislative bodies have suggested, PICS gives responsibility of control to users. http://www.w3.org/PICS/

  38. Summary Object Interchange Format (SOIF)
    -SOIF was designed as part of the Harvest Architecture developed at the University of Colorado at Boulder. Records in SOIF are designed to be generated by Harvest gatherers and then used for user searches by Harvest brokers. In March 1996, Netscape Communications announced that they were also going to use SOIF in their catalog server product and a number of other search engine manufacturers are said to be looking at supporting it.(Note: there is no further development of the Harvest search engine) http://www.tardis.ed.ac.uk/harvest/

  39. Text Encoding Initiative (TEI)
    -TEI was published in 1994 and consists of set of a generic guidelines for the representation of textual materials in electronic form (SGML/XML). TEI is sponsored by the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing. TEI text must be preceded by a TEI header that was formulated as part of the project by the Committee on Text Documentation comprising librarians and archivists from Europe and North America. The overall layout is grounded in a cataloguing tradition . Headers can be used to describe networked resources which are not necessarily themselves TEI encoded. http://www.tei-c.org/

  40. Commonalties
    -All are some form of metadata implementation for web resources
    -Most if not all are looking at some sort of interoperability or crosswalking
    • And most are doing it by mapping to and from the Dublin Core.
    -Most if not all are using XML. -Each is made more utilitarian by the Handle System (or something like it) and RDF.

  41. Handle system
    “The Handle System® (described as a confederated name service) is a distributed computer system which stores names, or handles, of digital items and which can quickly resolve those names into the information necessary to locate and access the items. It was designed by CNRI (Corporation for National Research Initiatives) as a general purpose global system for the reliable management of information on networks such as the Internet over long periods of time” . The “handle” consists of two parts: a naming authority and a unique ID. Information associated with the system is changed as needed to reflect the current state of the identified resource without changing the handle, thus allowing the name of the item to persist. http://www.handle.net/

  42. The Resource Description Framework (RDF)
    RDF is based on the assumption that everyone on the Web is not going to use the same meta-data. It provides a standard way to represent metadata in the form of statements about properties and relationships of almost anything, provided it has a Web address. RDF provides a framework in which vocabularies can be developed to suit specific needs. The RDF language allows each document containing metadata to clarify which vocabulary is being used by assigning each vocabulary a Web address. One of the best-known schemas is the Dublin Core. http://www.w3.org/RDF/

  43. RDF
    -RDF is an attempt to use those commonalties to make meta-data sets interoperable.
    -RDF model.. Simple 3-part record .. and a vocabulary
    • Resource > Property Type > Value
    • Vocabulary .. any meta-data set

  44. RDF
    (see slide graphic)

  45. Suggestions
    -Begin using a meta-data standard that best fits the community you serve.
    -If there isn't one or the document is for web access use the Dublin Core.
    -If you're not using Dublin Core be aware of how the format you are using "maps" to the Dublin Core
    -Begin learning XML if you don't know it already.
    -Take advantage of automatic meta-data generating tools like http://www.ukoln.ac.uk/metadata/dcdot/

  46. Taxonomy
    You've got data... can anyone else find it?

  47. Major Points Today
    -PROBLEM: Information retrieval
    -SOLUTION: Controlled vocabulary
    -AgNIC Prototype and controlled vocabulary

  48. Information Retrieval Problems
    -Synonyms - different words with same meaning
    • spelling variants
    • acronyms
    -Sugar Beet vs. Sugarbeet vs. Sugar Beets Vs. Sugarbeets vs. Beta vulgaris
    -GIS vs. geographical information system vs. geographic information system

  49. Information Retrieval Problems
    Homographs - words with same spelling and different meanings
    • Tillers (machinery) vs. Tillers (plant part)
    • Lime (fruit) vs. Lime (soil amendment)
    • Turkey (animal) vs. Turkey (meat) vs. Turkey (country)

  50. What is a controlled vocabulary?
    A controlled vocabulary is a set of allowed terms that can be used to describe the subject content of a resource.

  51. Controlled Vocabularies
    -List of terms
    -Thesaurus

  52. NISA Z39.19 Thesaurus
    -One term represents one concept
    -Meaning of term is clear
    -Hierarchical structure
    -Synonyms are link
    -Related terms, Scope Notes, Definitions

  53. Thesaurus example
    (see slide graphic)

  54. Thesaurus Example
    (see slide graphic)

  55. Thesaurus enables "Smart Searching"
    Thesaurus can be a TOOL in the foreground or background of the system to aid information retrieval and a searcher does not even have to be aware of the thesaurus component.

  56. "Smart Searching"
    (see slide graphic)

  57. "Smart Searching"
    (see slide graphic)

  58. Agricultural Classification Prototype
    -16 top headings
    -1,961 terms
    • 1,098 descriptors (can be used for resource description)
    • 863 non-descriptors (lead-in terminology that points to descriptor)

  59. Power Point Slide
    The Gateway to Education Materials http://www.geminfo.org/