A MapReduce-based Efficient H-bucket PMR Quadtree Spatial Index
Keywords:
Hadoop, MapReduce, Quadtree, Spatial IndexAbstract
Majority of the MapReduce-Hadoop based indexes are based on either non-disjoint decomposition or the data-dependent disjoint decomposition of space. Quadtree index based regular disjoint decomposition in MapReduce takes different forms of spatial data as point data. Lines, curves, polygons and other higher dimensional data are transformed to point data through a mapping process. Though, the mapping makes index-building quite easy, but it is not suitable for answering search queries. This paper proposes H-bucket PMR Quadtree, a parallel implementation of the existing bucket-PMR Quadtree to handle curvilinear or polygonal map data, in MapReduce. The proposed index uses a two-level of indexing: a global index that indexes the decomposed dataset among cluster nodes to support parallel index building and a local bucket-PMR Quadtree index maintained by each participating cluster node. The proposed index is compared with the state-of-the-art MapReduce based R+-tree indexing and the default key-value storage (non-indexed) Hadoop towards index build-time and spatial queries, such as line search and range search queries. The experimental results demonstrate the effectiveness of the proposed index in MapReduce environment.
References
Achakeev, D. et al., 2012. Sort-based parallel loading of R-trees. In 1st ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data, ACM. pp. 62–70.
Antonin Guttman, 1984. R-trees: A dynamic index structure for spatial searching. ACM Sigmod Record, 14(2), pp.47–57.
Apache, 2015. Hadoop. http://Hadoop.apache.org, (Accessed on April 16, 2015).
Beckman, N. et al., 1990. The R*-tree: an efficient and robust access method for points and rectangles. In Proceedings of the ACM SIGMOD Conference on Management of Data. pp. 322–331.
Bestul, T., 1992. Parallel paradigms and practices for spatial data. University of Maryland.
Blelloch, G.E. & Little, J.J., 1994. Parallel solutions to geometric problems on the scan model of computation. Journal of Computer and System Sciences, 48(1), pp.90–115.
Cary, A. et al., 2010. Leveraging Cloud Computing in Geodatabase Management. In Proceeding of the 2010 IEEE International Conference on Granular Computing. pp. 73–78.
Dean, J. & Ghemawat, S., 2013. MapReduce: SimplifiedDataProcessing on Large Clusters. Communications of the ACM - 50th anniversary issue: 1958-2008, 51(1), pp.107–113.
Eldawy, A. & Mokbel, M.F., 2013. A Demonstration of SpatialHadoop: An Efficient MapReduce Framework for Spatial Data. Proceedings of the VLDB Endowment, 6(12), pp.1230–1233.
ESRI, 2016. Tiger Products - Geography U.S. Census Bureau. www.esri.com/data/download/census2000-tigerline.
Hayes, J.P. & Mudge, T.N., 1989. Hypercube Supercomputer. In Proceedings of IEEE. pp. 1829–1841.
Hoel, E.G. & Samet, H., 1992. A qualitative comparison study of data structures for large line segment databases. SIGMOD Record, 21(2), pp.205–214.
Hoel, E.G. & Samet, H., 1995a. Benchmarking spatial join operations with spatial output. In Proceedings of the 21st International Conference on Very Large Databases. pp. 606–618.
Hoel, E.G. & Samet, H., 2003. Data-parallel polygonization. Parallel Computing, 29(10), pp.1381–1401.
Hoel, E.G. & Samet, H., 1995b. Data-parallel primitives for spatial operations using PM Quadtrees. In Proceedings of Computer Architectures for Machine Perception, doi:10.1109/CAMP.1995.521049.
Hoel, E.G. & Samet, H., 1994a. Data-parallel spatial join algorithms. In Proceedings of the 23rd International. Conference on Parallel Processing, doi:10.1109/ICPP.1994.82.
Hoel, E.G. & Samet, H., 1994b. Performance of Data-Parallel Spatial Operations. In Proceedings of the 20th International Conference on Very Large Data Bases. pp. 156–167.
Jeong, S., Kim, S.-W. & Choi, B.-U., 2016. Effective indexing and searching with dimensionality reduction on high-dimensional space. International Journal of Computer Systems Science and Engineering, CRL Publishing, 31(4).
Jun, F. et al., 2014. HQ-Tree: A Distributed Spatial Index Based on Hadoop. In China communications, 11(7), pp.128–141.
Kamel, I. & Faloutsos, C., 1992. Parallel R-trees. SIGMOD Record, 21(2), pp.195–204.
Liao, H., Han, J. & Fang, J., 2010. Multi-dimensional Index on Hadoop Distributed File System. In Fifth IEEE International Conference on Networking, Architecture, and Storage, IEEE Computer Society. pp. 240–249.
Liu, Y. et al., 2009. Parallel bulk-loading of spatial data with MapReduce: An R-Tree case. Wuhan University Journal of Natural Sciences, 16(6), pp.513–519.
McCreadie, R., MacDonald, C. & Ounis, I., 2012. MapReduce indexing strategies: Studying scalability and efficiency. Information Processing and Management, 48(5), pp.873–888.
Nelson, R.C. & Samet, H., 1986. A consistent hierarchical representation for vector data. Computer Graphics, 20, pp.197–206.
Nievergelt, J., Hinterberger, H. & Sevcik, K., 1984. The grid file: An adaptable, symmetric multikey file structure. ACM Transactions on Database Systems (TODS), 9(1), pp.38–71.
Papadopoulos, A. & Manolopoulos, Y., 2003. Parallel bulk-loading of satial data. Parallel computing, 29(10), pp.1419–1444.
Samet, H., 1990. The Design and Analysis of Spatial Data Structures,
Schnitzer, B. & Leutenegger, S.T., 1999. Master-Client R-Trees: A new parallel R-Tree architecture. In Proceedings of the 11th International Conference on Scientific and Statistical Database Management. pp. 68–77.
Sellis, T., Roussopoulos, N. & Faloutsos, C., 1987. The R+-Tree: A dynamic index for multi-dimensional objects. In Proceedings of the 13th VLDB Conference. pp. 507–518.
Singh, H. & Bawa, S., 2017. A MapReduce-based scalable discovery and indexing of structured big data. Future Generation Computer Systems, 73(August 2017), pp.32–43.
Singh, H. & Bawa, S., 2016. Spatial Data Analysis with ArcGIS and MapReduce. In Proceedings of International conference on Conference Computing, Communication and Automation, IEEE. p. doi:10.1109/CCAA.2016.7813687.
Tan, K.-L., Ooi, B.C. & Abel, D.J., 2000. Exploiting Spatial Indexes for Semijoin-Based Join Processing in Distributed Spatial Database. IEEE Transactions on Knowledge and Data Engineering, 12(6), pp.920–937.
Wang, Y. & Weng, S., 2010. Research and implementation on spatial data storage and operation based on Hadoop platform. In Second IITA International Conference on Geoscience and Remote Sensing (IITA-GRS), IEEE, Vol. 2. pp. 275–278.
Xun, L. & Wenfeng, Z., 2013. Parallel Spatial Index Algorithm Based on Hilbert Partition. In International Conference on Computational and Information Sciences, IEEE. pp. 876–879.
Zhang, C., Li, F. & Jestes, J., 2012. Efficient Parallel kNN Joins for Large Data in MapReduce. In Proceedings of the 15th International Conference on Extending Database Technology, ACM. pp. 38–49.
Zheng, X. et al., 2016. An optimization model of Hadoop cluster performance prediction based on Markov process. International Journal of Computer Systems Science and Engineering, CRL Publishing, 31(2).
Zhong, Y. et al., 2012. Towards Parallel Spatial Query Processing for Big Spatial Data. In 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), IEEE. pp. 2085–2094.
Published
Issue
Section
License
Copyright (c) 2020 International Journal of Computer Systems Science and Engineering
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
The submission of a paper implies that, if accepted for publication, it will not be published elsewhere in the same form, in any language, without the prior consent of the publisher. Before publication, authors are requested to assign copyright to CRL Publishing Ltd. This allows CRL to sanction photocopying, and to authorize the reprinting of issues or volumes according to demand. Authors' traditional rights will not be jeopardized by assigning Copyright in this way, as they retain the right to reuse the material following publication, and to veto third-party publication.