GENERATING QUERIES TO CRAWL HIDDEN WEB USING KEYWORD SAMPLING AND RANDOM FOREST CLASSIFIER

Shwetanshu Rohatgi, Sabarni Kundu

Abstract


One of the most challenging aspects in information retrieval systems is to crawl and index deep web. A deep web is part of World Wide Web which is not visible publically and therefore can’t be indexed. There is a huge amount of scholarly data, images and videos available in deep web which if indexed can serve purpose of research and stop illegal activities. We propose an efficient hidden web crawler that uses Sampling and Associativity Rules in order to find the most important and relevant keywords which are used to generate queries that can extract information from databases and web forms. Further, we use random forest technique to index out search results. Our web crawler has capabilities to efficiently overcome various prior challenges that we have stated in this paper.

Keywords


Deep Web; Dark Web; Random Forest Classifier; Apriori Intuition; Keyword Sampling; TF-IDF; NLP; Database Querying

Full Text:

PDF

References


A comparative study on web crawling for searching hidden web by IJCSIT

Trupti V. Udapure, Ravindra D. Kale and Rajesh C. Dharmik,”Study of web crawler and its Different types”, IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 1, Ver. VI (Feb. 2014), PP 01-05

Ali Mesbah , Arie van Deursen , Stefan Lenselink, Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes, ACM Transactions on the Web (TWEB), v.6 n.1, p.1-30, March 2012

BERGMAN, M. 2000. The deep Web: Surfacing the hidden value. BrightPlanet, www.completeplanet.com/Tutorials/DeepWeb/index.asp.

BERGMAN, M. 2000. The deep Web: Surfacing the hidden value. BrightPlanet, https://brightplanet.com/2014/03/clearing-confusion-deep-web-vs-dark-web.asp

C. J. Kaufman, Rocky Mountain Research Laboratories, Boulder, Colo., personal communication, 1992. (Personal communication)

A. Bergholz, B. Chidlovskii, “Crawling for Domain- Specific Hidden Web Resources” In the Proc. of the 4th Int. Conf. on Web Information System Engineering,2003

S. Liddle, D. Embley, Del Scott and S. Ho Yau, ” Extracting Data Behind Web Forms” In the Proc. of the 28th Int. Conf. on Very Large Data Bases, China, 2005

S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In VLDB, 2001.

LUO Xin; XIA De-lin; YAN Pu-liu. Improved feature selection method and TF-IDF formula based on word frequency differentia. Computer Applications, 2005, 25(9): 2031-2033.

Markus Hegland. The Apriori Algorithm – a Tutorial. CMA, Australian National University, WSPC/Lecture Notes Series, 22-27. March 30, 2005.

L. Barbosa and J. Freire, “Siphoning hidden-web data through keyword-based interfaces,” in Proceedings of the 19th Brazilian Symposium on Databases SBBD, 2004.

Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling through URL ordering. Computer Networks and ISDN Systems, 30(1–7), 161–172.

De Bra, P.M.E. & Post, R.D.J. (1994). Information retrieval in the World- Wide Web: Making client-based searching feasible. In Proceedings of the First World-Wide Web Conference (pp. 183–192). New York: ACM Press.

L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.




DOI: https://doi.org/10.26483/ijarcs.v8i9.4936

Refbacks

  • There are currently no refbacks.




Copyright (c) 2017 International Journal of Advanced Research in Computer Science