Research Proposal: Mining the Process of Extracting

Pages: 10 (2900 words)  ·  Bibliography Sources: 10  ·  Level: Doctorate  ·  Topic: Education - Computers  ·  Buy This Paper


[. . .] Thus, a summarized data can be obtained by recognizing keywords from a vast amount of online database. An automated process needs to be developed in order to extract keywords from news articles. A keyword extraction module is used for the extraction of candidate keywords after collecting some news articles in HTML from an internet site, after which the cross domain comparison module is used for keyword extraction. It can be explained more thoroughly. In the relational database, tables for 'term occur fact', 'document', 'TFIDF weight' and 'dictionary' is made. Initially the 'Document' table is used to stock up the downloaded news pages and then extraction of nouns from the documents takes place in the 'Document' table (Sergio, 2002).

After this, the 'Term occur fact' table is used to update the words appearing in the document. The 'Term occur fact' is then utilized to calculate the TFIDF weights for the words and the outcome is kept in the 'TFIDF weight' table. Conclusively, a 'Candidate keyword list' is made using the 'TFIDF weight' table which shows the ranking of the words (Sergio, 2002).

The tracking of a given news event through various news stories' database is carried out through topic tracking. The assemblage of lexically similar terms into supposed lexical chains is known as lexical chaining. Significant locations, names and normal terms are extracted into different sub-vectors of document demonstration through multi-vector topic tracking system. The sub-vectors are compared to compute the similarity between two or more documents. The number of characteristics casting an impact on the topic tracking system is studied. Firstly, it is required to choose a particular attribute like words, or phrases which are appropriate to be taken as examples describing that attribute of a given event (Sergio, 2002).

C. Summarization

The worthiness and significance of a lengthy document is well extracted from its text summarization. Text summarization software summarizes a lengthy document in a time equivalent to that required by a human to just read the first paragraph of the document. Text summarization revolves around the reduction of the length, yet keeping up with the retention of its sense and meaning. The dilemma faced by computers is that they do not process semantics and meanings of the words and only deal with the identification of people, places, and time (Haralampos, 2001).

Mostly, humans create text summaries by first getting an idea of the full text by going through it wholly and then creating a summary by focusing on the core points. As computers do not cater with language capabilities of humans, other methods need to be devised. Sentence extraction, is utilized by text summarization tools to find out the sentences describing the central idea of the text, in a statistical manner. Position information also serves as an important tool in text summarization (Haralampos, 2001).

Summarization tools may pick up sentences followed by key phrases in the conclusion, because usually the key points are mainly present here. Headings and subtopic markers are also focused by the summarization tools for selecting the main points. An example of text summarization tool is Microsoft Word's AutoSummarize function. Most of the text summarization tools, ask the users to specify the percentage of the text to be extracted as a summary. Topic tracking tools and categorization tools use summarization to summarize documents collected on a particular topic. If enormous amount of documents are given to organizations, medical personnel or researchers according to their area of interests, then it will reduce the sorting out time for the summarization tools. Eventually, relevant information could be accessed by individuals based upon their interests (Haralampos, 2001).

There are three steps of an automatic summarization process: (1) First is the preprocessing step, in which it is required to get a well structured format of the original text; (2) Second lies the processing step of converting the structured text into the summary structure; (3) The third is the generation step, which involves the extraction of final summary from the summary structure. The methods of summarization are categorized on the basis of linguistic space level and are divided into two vast groups: (a) One is shallow approaches, which deal with the syntax and representation of the text, and extract the salient features of the text in an easy way; and (b) the other is the deeper approaches, which deals with the semantics of the text and revolve around linguistic processing (Liritano S. And Ruffolo, 2001).

The preprocessing step of the first approach aims at reducing the dimensions of the document text, which consists of: (i) stop-word elimination - irrelevant common words, having no significant meaning, for example "the," "a" etc., are removed from the text; (ii) case folding - changing the characters from upper to lower case or vice versa; (iii) stemming - words which are syntactically similar are aggregated; this caters the purpose of obtaining the radix of each word emphasizing the semantics. The vector model is a most commonly used text model. When the preprocessing has been done, each sentence is considered as a N - dimensional vector.

MATLAB supports the simulation of fuzzy logic, so it is used in implementing text summarization based on fuzzy logic. Firstly, we need to choose a text characteristic like the sentence lengths, then the input of fuzzy system are similarity to little, similarity to key etc. The knowledge base of the system is then catered with all the rules of summarization. Later on, a value ranging between zero and one is obtained at the output for every sentence, which is based upon sentence characteristics and the rules mentioned in the knowledge base. This output value represents the extent to which a particular sentence is important for the final summary (Liritano S. And Ruffolo, 2001).

D. Categorization

The document, if segmented into structured set of divisions that are pre-defined, can be identified for the essentially core elements of it. In the process of categorization of a document digitally, software utility will assess all the data as a single pack. None of the sub-processes of the utility recognizes the information within the data, contrary to information extraction procedure. On the other hand, process of segmentation processes on the appearing words in the data alongside, it identifies the elements that the database comprises of. The categorization process employs use of thesaurus. Thesaurus is that structured set of divisions that are pre-defined. This structured set of divisions is used to identify the topic of documents. This categorization process ranks the document after assessing it through segmentation procedures, ranking the document on the basis of depth and number of topics present on the document (Gupta and Lehal, 2009).

As done in summarization, if topic tracking utility is used along with the categorization utility, then it can help a person to find the relevant information from the database as per the requirement. Topic tracking will collect different documents and them rank them on the basis of relevancy of it with required information. This relevancy is decided on the basis of amount of relevant content found in the document. There are a number of fields where categorization can be employed. For instance, service and complaint management in customers service utilizes categorization to extract relevant results (Gupta and Lehal, 2009).

Using categorization procedures will allow the customer service centers to segment the database according the content and topic of the database, hence the customers will find it convenient to browse through the relevant topics. Text categorization aims to divide a document according to different elements of topics it comprises of. It also compares documents for content relevance with one another (Gupta and Lehal, 2009).

One method to learn the algorithms of categorization is to understand the procedures from available classified documents. Furthermore, use of these algorithms for categorization of unclassified documents can be useful. For instance, let there be two sets D. And C. containing n and p elements respectively, where'd is a set of categorized documents that belong to set of C, which represent the classes. The objective of learning categorization process is to ascertain which element of set D. corresponds to element of C. Hence, n documents corresponding to different classes are classified into p types. Hence, data collected is run through feature selection process for preparation (Shantanu and Shourya, 2008).

Data collected consists of text that comprises of [END OF PREVIEW]

Four Different Ordering Options:

Which Option Should I Choose?

1.  Buy the full, 10-page paper:  $28.88


2.  Buy + remove from all search engines
(Google, Yahoo, Bing) for 30 days:  $38.88


3.  Access all 175,000+ papers:  $41.97/mo

(Already a member?  Click to download the paper!)


4.  Let us write a NEW paper for you!

Ask Us to Write a New Paper
Most popular!

Text Mining Term Paper

Extracting Information Sentiment From Blogs Research Proposal

Tennessee Side Effects of Coal Research Paper

Data Mining Thesis

Vendor Management IT Management Process Research Paper

View 102 other related papers  >>

Cite This Research Proposal:

APA Format

Mining the Process of Extracting.  (2012, January 9).  Retrieved June 17, 2019, from

MLA Format

"Mining the Process of Extracting."  9 January 2012.  Web.  17 June 2019. <>.

Chicago Format

"Mining the Process of Extracting."  January 9, 2012.  Accessed June 17, 2019.