Document Type : Review Article

Authors

Monash University

Abstract

In this paper, we develop a non-visual automatic wrapper to extract data records from search engine results pages which contain important information for computer users. Our wrapper consists of a series of data filter to detect and remove irrelevant data from the web page. In the filtering stages, we incorporate two main algorithms which are able to check the similarity of data records and to detect and extract the correct data region based on their component sizes. To evaluate the performance of our algorithm, we carry out experimental and deletion tests. Experimental tests show that our wrapper outperforms the existing state of the art wrappers such as ViNT and DEPTA. Deletion studies by replacing our novel techniques with state of the art conventional techniques show that our wrapper design is efficient and could robustly extract data records from search engine results pages. With the speed advantages, our wrapper could be beneficial in processing large amount of web sites data, which could be helpful in meta search engine development.

Keywords

[1] Weiyi Meng H. and Yu, C.; “Mining Templates from Search Result Records of Search Engines”, in Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining San Jose, California, USA: ACM, (2007)
[2] Zhao H., Meng W., Wu Z., Raghavan V. and Yu C.; “Fully Automatic Wrapper Generation for Search Engines”, in Proceedings of the 14th international conference on World Wide Web Chiba, Japan: ACM, (2005)
[3] Ricardo A. Baeza-Yates, “Algorithms for String Searching” SIGIR Forum, Vol. 23, pp. 34-58, (1989)
[4] Zhai Y. and Liu B.; “Structured Data Extraction from the Web Based on Partial Tree Alignment” IEEE Transaction on Knowledge and Data Engineering, Vol. 18, pp. 1614-1628, (2006)
[5] Liu B. and Zhai Y.; “NET – A System for Extracting Web Data from Flat and Nested Data Records”, in Web Information Systems Engineering – WISE 2005, pp. 487-495, (2005)
[6] Chang,Ch.H, Kayed M., Ramzy Girgis M. and Shaalan Kh.; “A Survey of Web Information Extraction Systems”, Transactions on Knowledge and Data Engineering, Vol. 18, pp. 1411-1428, (2006)
[7] Tao Cui and David W. Embley, “Automatic Hidden-Web Table Interpretation, Conceptualization, and Semantic Annotation”, Data Knowl. Eng., Vol. 68, pp. 683-703, (2009)
[8] Gusfield D.; Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology: Cambridge University Press, (1997)
[9] Sankoff David and Kruskal Joseph, Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison: Technical Report of Center for the Study of Language and Inf, (1999)
[10] Tanaka E. and Tanaka K., “The Tree-to-tree Editing Problem," Int’l J. Pattern Recognition and Artificial Intelligence, pp. pp. 221-240, (1988)
[11] Valiente G.; “An Efficient Bottom-up Distance between Trees”, in Proc. Eighth Int’l Symp. String Processing and Information Retrieval, pp. pp. 212-219, (2001)
[12] Das G., Fleischer R., Gasieniec L., Gunopulos D. and Karkkainen Juha; “Episode Matching”, in Proceedings of the 8th Annual Symposium on Combinatorial Pattern Matching, (1997)
[13] Miao G., Tatemura J., Hsiung W.P., Sawires A. and Louise E. Moser; “Extracting Data Records from the Web Using Tag Path Clustering”, in Proceedings of the 18th international conference on World Wide Web, Spain, Madrid, (2009)
[14] Navarro Gonzalo; “A Guided Tour to Approximate String Matching”, ACM Comput. Surv., Vol. 33, pp. 31-88, (2001)
[15] Alberto H.F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, and Juliana S. Teixeira; “A Brief Survey of Web Data Extraction Tools”, SIGMOD Rec., Vol. 31, pp. 84-93, (2002)
[16] Apostolico A. and Guerra C.; The Longest Common Subsequence Problem Revisited, Algorithmica 2, (1987)
[17] Hong J.L., Siew E. and Egerton S.; “DTM- Extracting Data Records from Search Engine Results Page using Tree Matching Algorithm”, in Proceedings of the 1st international conference on Soft computing and pattern recognition: IEEE, (2009)
[18] Hong J.L., Siew E. and Egerton S.; “Information Extraction for Search Engines Using Fast Heuristic Techniques”, Data Knowledge Engineering, Vol. 69, pg 169-196, (2010)
[19] Wang J. and Frederick H. Lochovsky; “Data Extraction And Label Assignment For Web Databases”, in Proceedings of the 12th international conference on World Wide Web Budapest, Hungary: ACM, (2003)
[20] Simon K. and Lausen G.; “Viper: Augmenting Automatic Information Extraction with Visual Perceptions”, in Proceedings of the 14th ACM international conference on Information and knowledge management Bremen, Germany: ACM, (2005)
[21] Tai K. Ch.; “The Tree-to-Tree Correction Problem”, J. ACM, Vol. 26, pp. 422-433, (1979)
[22] Li L., Liu Y., Obregon A. and Weatherston M.; “Visual Segmentation-Based Data Record Extraction from Web Documents”, in Information Reuse and Integration, 2007. IRI 2007. IEEE International Conference, pp. 502-507, (2007)
[23] lvarez M., Pan A., Raposo J., Bellas F. and Cacheda F.; “Extracting Lists Of Data Records From Semi-Structured Web Pages”, Data Knowl. Eng., Vol. 64, pp. 491-509, (2008)
[24] Song M., Song Il-Yeol, Hu Xiaohua, and Robert B. Allen; “Integration of Association Rules and Ontologies for Semantic Query Expansion”, Data Knowl. Eng., Vol. 63, pp. 63-75, (2007)
[25] Arasu A. and Garcia-Molina H.; “Extracting Structured Data from Web Pages”, in Proceedings of the 2003 ACM SIGMOD international conference on Management of data San Diego, California: ACM, (2003)
[26] Saul B. Needleman and Christian D. Wünsch; “A General Method Applicable To the Search for Similarities In The Amino Acid Sequences Of Two Proteins”, Journal of Molecular Biology, (1970)
[27] Jiang T., Wang L. and Zhang K.; “Alignment of Trees - An Alternative to Tree Edit”, “in Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching: Springer-Verlag, (1994)
[28] Crescenzi V., Mecca G. and Merialdo P.; “RoadRunner: Towards Automatic Data Extraction from Large Web Sites”, in Proceedings of the 27th International Conference on Very Large Data Bases: Morgan Kaufmann Publishers Inc., (2001)
[29] Levenshtein Vladimir I; “Binary Codes Capable Of Correcting Deletions, Insertions, And Reversals” Soviet Physics Doklady, Vol. 10, pp.707, (1966)
[30] Liu Wei, Meng Xiaofeng, and Meng,Weiyi; “Vision-based Web Data Records Extraction”, ACM Ninth International Workshop on the Web and Databases (WebDB 2006), (2006)
[31] Liu W., Meng X., and Meng W.; “ViDE: A Vision-based Approach for Deep Web Data Extraction”, IEEE Transaction on Knowledge and Data Engineering, (2009)
[32] Su W., Wang J. and Frederick H. Lochovsky; “ODE: Ontology-assisted Data Extraction”, ACM Transactions on Database Systems, (2009)
[33] Yang W.; “Identifying Syntactic Differences between Two Programs”, Softw. Pract. Exper., Vol. 21, pp. 739-755, (1991)
[34] Zhai Y. and Liu B.; “Web Data Extraction Based on Partial Tree Alignment”, in Proceedings of the 14th international conference on World Wide Web Chiba, Japan: ACM, (2005)
[35] Liu B., Grossman R. and Zhai Y.; “Mining Data Records in Web Pages”, in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining Washington, D.C.: ACM, (2003)