Recently, I read a paper named . Here is some notes about it.
#Abstract
Extraction
- The automatic extraction of metadata and other information from scholarly documents
- In academic digital libraries, search engines, and document management systems
- To allow for the management and categorization of documents and for search to take place
A Web-accessible API can simplify this extraction
by providing a single point of operation for extraction that can be incorporated into multiple document workflows without the need for each workflow to implement and support its own extraction functionality
CiteSeerExtractor
a RESTful API for scholarly information extraction that exploits the fact that there is duplication in scholarly big data and makes use of a near duplicate matching backend
the duplicate document matching results in a difference of 8.46% in the time required to extract header and citation information from approximately 3.5 million documents compared to a baseline.
#Introduction
Scholarly big data
refers to the vast amount of data produced as the result of scholarly undertaking and includes journals, conference proceedings, theses, books, patents and experimental data.
three V’s => big data
- volume
Microsoft Academic contains over 50 million records for academic documents and that about 43% of the articles published between the years 2008 and 2011 are freely available online - velocity
in 2010, the annual growth rates of several popular academic databases between 1997 and 2006 ranged from 2.7 to 13.5% - variety
the the different types of scholarly output that is produced
furthermore
- value
- veracity
- viscosity
- vulnerability
services for managing and providing access to scholarly big data
- Google Scholar
- Microsoft Academic
- CiteSeerχ
- the ArXiv
they are automatic metadata extraction.
Web-accessible API => simplify this extraction
- provide a single point of operation that can be incorporated into multiple document and scientific workflows
- allow for easier processing of data
since in scholarly publications <= duplication is common
in big data scale, methods are need to improve ectractor performance
CiteSeerExtractor
a Web service for scholar information extraction that deals with the issue of big data by storing metadata after it is extracted
when a new paper is submitted, check if it matches with a previously submitted document
The document matching algorithm is able to deal with matches that are not bitwise identical and that might have minor differences.
#Related Work
ParsCit
users submit the plain text of papers then returns the parsed citations
GROBID
a library for extracting metadata form scholarly documents
- header metadata, citation metadata & parse the metadata
includes a RESTful API -> access the service from other programs
- match extracted metadata with Crossref
- if core metadata, such as the title or first author, is matched,
- then the system attempts to retrieve the full publisher metadata.
FreeCite
- based on ParsCit
- users submit a single citation string or list of citation strings and they are parsed and tagged.
CiteSeerExtractor
- provides a generic framework that can easily be extended to allow for additional extractors to be incorporated (Section IV)
none of these services specifically try to address the challenges of big data
by making use of near duplicate matching
Some tools make use of Web services to perform or improve metadata extraction showing how Web services can be incorporated into metadata extraction workflows
- PDFMeat
- Mendeley
- Gao’s similar system
#API Design
Resource Oriented Architecture
CiteSeerExtractor
a RESTful Web service based on the Resource Oriented Architecture (ROA 资源导向架构).
benefits:
- being lightweight
- scalable
- easily accessible
ROA’s main concepts:
- resources:
something that is important enough that it is worth being referenced
- identifiers:
a URI that is unique for the resource and that allows for one of the representations of the resource to be accessed, where a representation of a resource is some view of that resources
- representations of resources
- the links between resources
ROA’s main properties:
- addressability: information is exposed through URIs
- statelessness: HTTP requests are independent of each other and can happen in isolation
- connectedness: there are links between content
- a uniform interface: HTTP provides a uniform interface
- Resources
Documents (PDF, PS, TXT)
- Submit a
POSTrequest to the extractor URL - Create a new document resource in CiteSeerExtractor
- the text from the document is automatically extracted(PDFBox->pdf, ps2txt->ps)
- return different status code(201 success & XML/JSON document, 503 error, over size)
<?xml version=“1.0” encoding=“UTF-8”?> <CSXAPIMetadata> <file>base url/extractor/token/file</file> <header>base url/extractor/token/header</header> <citations>base url/extractor/token/citations</citations> <body>base url/extractor/token/body</body> <text>base url/extractor/token/text</text> </CSXAPIMetadata>
- identifiers
a unique and random identifier is assigned when a resource has successfully been created- violates the ROA practice of having wellnamed resources
- simplifies the resource naming procedure
- since the resources are for the most part temporary, was considered a reasonable approach
representations
different views of a resource
in CiteSeerExtractor represent different types of information extracted from the original document as well as the document itself
–> access a resource in CiteSeerExtractor,
an HTTPGETrequest is made tohttp://$url/extractor/resource id/representationand if success, returns anHTTP 200 OKstatus code.- file: The original document that was submitted.
- header: The header of the document, including the
title, authors, abstract, venue and any other information
that may be extracted. - citations: The citations extracted from the document.
- body: The main body text of the document, excluding
the citations. - text: The full text of the document as extracted by an
appropriate text extraction tool.
Addressability, Statelessness, Connectedness, and Uniformity
HTTP Methods
Cite-SeerExtractor supports different HTTP methods and output formats(XML,JSON).
Architecture
- stand-alone
- able to run in isolation
- able to be integrated with a number of services
Figure 2 shows the overarching architecture of CiteSeerExtractor
As can be seen from the figure, the RESTful API is the entry point and communicates directly with the Python Web Server, which is responsible for handling the creation of resources and for serving various representations of those resources. Security and permissions can also be controlled and implemented at the Python Web Server level.
the RESTful API:
handle the creation of resources and serve various representations of those resources.
provides the functionality as described in Section III.
Python Web Serer
CiteSeerExtractor is run as a stand-alone Web server and is implemented using the web.py framework.
Extractors
- Text Extractor: PDFBox for pdf, ps2txt for ps
- Citation Extractor: ParsCit
- Header Extractor: based on a tool that classifies various aspects of a header using a support vector machine
- Body Extractor: the body of text, excluding the citations.
File Store
Documents and their associated text representations
Access, config the permissions or delete the files on the Web server
##Duplicate Matching Backend
NoSQL backend
Duplicate Matching Backend
The purpose of this backend is to store metadata that has already been extracted and retrieve the metadata if a document submitted is a near duplicate of a document that has previously been submitted.
the performing duplicate matching does not have a detrimental effect on the performance of the system.
Near Duplicate Matching Algorithm
