DESIGN AND DEVELOPMENT OF SIMULATION TOOL FOR TESTING SEO COMPLIANCE OF A WEB PAGE – A CASE STUDY

: Efficient search engines enable the end user to avail the target information as early as and as accurately as possible. Search engine optimization conforms to the act of optimizing the webpage in particular and a website in general so as to acquire a highest ranking in search engine results. The ranking is based on some pre-defined criteria which SEO encapsulates for generating right signal to the search engine. The basic methodology adopted for this purpose primarily focuses on enriching the website content and improving readability by making the web pages more search engine friendly. Several optimization tools exist for rendering the website search engine friendly. But all these tools operate like a black box where the internal details are not immediately revealed to an end user. To facilitate this, in the current work the authors have designed and developed a simulation tool for testing w3C compliance which is a necessary condition for making a webpage search engine friendly. The browsers do not complain against the violation of W3C compliance rules, however such violations do result in rendering the web page invisible to the search engine which in turn results less network traffic flow to the website. Among other things W3C and CSS compliance are the most basic rules which a website designer cannot afford to ignore. In the current work, the tool is designed and developed for the purpose which accepts an HTML document and computes its W3C compliance score. The tool meticulously analyzes the HTML code and generates several reports conforming to violation of rules, if any. Further, the tool has the provision for converting the document into W3C compliance document based on the specified ruleset.Finally, the tool has been tested by applying it to the case study of a hypothetical organization.


INTRODUCTION
In the era of Internet and worldwide web, digital marketing has emerged as the most promising and powerful trend in marketing which focuses on retaining the existing customers and acquiring the new customers based on the digital content. Academic institutions are not an exception to this. Due to the extensive competition among educational institutions for acquiring quality students, institutions are also adopting these advanced cutting edge technologies in establishing their brand names and reaching the aspirants across the globe. There is vast amount of digital content already present on Internet and more content is constantly generated at an exponential rate. Due to an intense competition, every organization intends to see itself at the top of the content generated by the search engines. It has been predicted that in the coming decade, in this entire competition only those organizations which quickly adopt themselves to these cutting edge technologies will survive which gives them an upper hand over their competitors and will consequently be the emerging winners conforming to the Darwinian principle of survival of the fittest. Currently most of the academic institutions are adopting Search Engine Optimization (SEO) for rendering their sites search engine friendly thereby sailing up to higher positions in search engine results.
Search engine optimization is a digital marketing methodology dealing with the measures taken towards improvement of a ranking of a website in search engine results thereby resulting in an increase in page views. SEO optimization can be achieved with the help of a set of rules pertaining to keyword selection, image optimization, focusing on rich web content to name a few which in turn renders the website search engine friendly. There exist a fistful of freelancing companies providing these facilities. SEO focuses on improving web visibility through a set of specific keywords. SEO in essence is a guiding theory for increasing web traffic. Keyword is the most crucial factor which plays a key role in influencing search engine results to a large extent. The generic optimization principle namely, early termination technique is employed by majority of search engines which involves computation of results based on indexing and employing pruning techniques which reduces the huge computational task involved in exhaustive traversal.
The most challenging task for the search engine is the tremendous workload to be handled by it and on an average the typical search engine needs to address hundreds of millions of queries on per-day basis involving billions of documents. Search engine addresses this issue by involving millions of machines distributed over multiple data centers with an intention to reduce query processing cost.The SEO techniques primarily fall into two distinct categories • On-site optimization • Off-site optimization On-site optimization demands meticulous attention during the phase of website development targeting domain names, file names, directory structure, title, keyword selection and keyword density. Another area which requires meticulous attention is keyword consistency within the headings, anchor tags, alternate tags, descriptions etc.
Off-site optimization on the other hand focuses on inclusion of a website on various social networking sites for increasing network traffic to the website, availing a reputed web hosting service, blog publishing, link building etc. The parameter employed for measuring the off-site optimization is Click-Through-Rate (CTR) of a website which is computed by dividing the number of times the page containing the link is displayed by number of times the link is clicked. For example if the page containing the link is clicked 10,000 times and the link is clicked 100 times, then CTR for the page is (100/10000)*100 % = 1%. SEO optimization is based on four optimization techniques: structure optimization, keyword optimization, content optimization and link optimization. In the current work, authors focus on content optimization based on few key rules and validate the HTML document against those rules. Based on the validation process a compliance score is assigned to the document. The tool also contains the provision for converting invalid HTML documents into their valid counterparts. Currently, the tool operates only with a few set of rules which can easily be extended to incorporate additional rules.

LITERATURE REVIEW
With the rapid evolution of information technology and explosion in the number of websites, SEO technology is continuously gaining a tremendous importance and is catching the attention of every website developer. SEO techniques focus on few parameters pertaining to keyword selection, back linking, rich dynamic content and target towards gaining a better ranking in the search engine results [1,2]. Website promotion is just a single application of SEO technology. The application of search engine optimization is many fold. Apart from improving the page ranking, there are numerous applications which exploit search engine optimization techniques such as spamming application, six sigma management application for which SEO provides a conceptual background [2 -4]. The guidelines for building search engine friendly websites resulting in optimization of search queries have been presented [5][6][7]. Further, the authors present a review of search engine optimization tool as a part of digital marketing regime. Different categories of SEO tools are highlighted to aid website promotion. Search engine filtering system based on two-tier link extractor is presented to decrease the traffic of irrelevant pages in search result [8]. Search engine optimization based on soft computing technique is presented by Wang et.al [9] where a back propagation learning method of a neural network is employed for speedy retrieval of data from the web. Web caching based on semantic web technique is employed for designing a cluster of search engines with a goal to reduce load on the web server and latency time [10]. The different algorithms, lowest relative value algorithm, LUV algorithm and least weighted usage algorithm are employed to study their relative merits and demerits.

Research Gaps
Most of the work on search engine optimization techniques focus on on-site and off-site optimization techniques and primary focus is on keyword optimization. There are no simulation tools available for revealing the basic operation of these optimization tools. To account for this the current research focuses on design and development of simulation tool for testing SEO compliance of a website. The scope of the tool is rather too restricted to account for only the partial content optimization. The paper provides the first hand information for new researchers in this area.

CONCEPTUAL MODEL DESIGN
A. Problem Definition XYZ institute is a leading management institute in south western Maharashtra accredited by NACC with "A" grade. The institute has in-house website development team which is actively involved in maintenance of the website. The management of the institute is interested in improving the website ranking in search engine queries.

B. Proposed Solution
The problem can be addressed at two levels. As a first alternative, hiring a freelancer, who works for 25% cheaper than SEO company. But at the same time it can be risky and time consuming as freelancer many not be aware of the specificwebsite strategy and may struggle to understand the exact needs. Finding a good freelancer is time consuming and is more often a trial and error process.

Current Status
There are plethora of free and open source tools available on Internet which enable the analysis of the website content and give SEO ranking of the website. The tools generate the report pertaining to • Broken links analysis and suggest improvements for making the website SEO affine.
As a second alternative, the in-house tool can be designed and developed for verifying W3C and CSS Compliance which can then be input to available tools for improving it further. This enables the website to rank top in search engine results thereby improving the volume of traffic towards website by search engine. This demands the technical expertise which may be expected from XYZ institution due to the availability of in-house website development team.
In a simpler manner, at a first level search engine optimization can be achieved by targeting search terms which commonly include title, description, keywords, headings and alternate text. Few prominent rules for SEO are listed below:

C. Few Prominent Rules for SEO
Optimization rules exist on HTML filename and several HTML elements such as title, anchor element etc.
• Matching filenames with a page title • Using hyphens rather than keyword for separating keywords. • File size restriction to 101K, since text above it is chopped by most of the search engines. • No use of frames and JavaScript embedded in HTML • Usage of more text than HTML elements • Using meta tag with the attribute "name" set to "keywords" and the "content" attribute listing about 25-30 keywords.
• Using Meta tag with the attribute "name" set to "description" and the "content" attribute listing about 150 characters. • Limiting title size to maximum 9 words or 60 characters. • Adding title to anchor element.

D. W3C Compliance Test
Browsers do not consider any violation against W3C compliance but search engines do. On useful link to validate the webpage is validator.w3.org

Rules of W3C Compliance
1. Every tag must be closed explicitly. Empty tag must use terminating slash. 2. The head tag is mandatory. 3. All tags must be in lower case. 4. All attributes must be enclosed in double quotes. 5. Form tags cannot be nested. 6. Use entity references wherever necessary. 7. CSS rules should be in lower case.

E. Application Architecture
The application architecture and operation of SEO analysis tool is presented in Figure 1. The various steps involved in testing SEO compliance are depicted below: 1. Accept a webpage from the end user. 2. Check no. of significant rules ( A rule is considered to be significant if the corresponding element exists in an input web page) 3. Distribute the percentage among the significant rules equally. 4. Analyze the webpage and generate SEO compliance report. 5. Modify the webpage to render it more SEO friendly.
The tool operates by accepting an HTML document from an end user. HTML document is then analyzed to check the presence of some prominent tags. Seven rules for W3C compliance stated above are considered. Out of the seven rules stated above, some rules may be insignificant under the context based on the structure of HTML document input by an end user. The primary focus is in locating un-terminated tags, unquoted attributes, missing head element etc. The score is equally distributed among the tags present and finally the SEO compliance score for the current ruleset is computed.

F. Implementation of Rules
All the seven rules carry equal priority and the percentage score reserved for each rule is given by, Percentage = 100 / N where, N is the number of significance rules. The net score is computed using the formula ∑ S i where 1 <= i <= N The computation of a score for first four rules is depicted below: Implementation of Rule 1: The rule for the matching tags (un-terminated tags) is, S = E + T where, S, E and T refer to the no of start tags, end tags and terminating tags, respectively.
The score awarded after testing rule 1 is Score 1 = (Percentage) * (E+T) / S

Implementation of Rule 2:
Implementation of rule 2 is trivial since the full/zero score is awarded based on the presence/absence of a head node.

Implementation of Rule 3:
Score 3 = (Percentage) * L / T where T has the same meaning as above and L is the number of tags in lower case.

Implementation of Rule 4:
Score 4 = (Percentage) * Q / A where Q is the total number of quoted attributes and A is the total number of attributes. The final SEO compliance score is then given by, S = S 1 + S 2 + . . . . . . +S 7

RESULTS AND DISCUSSIONS
The model proposed above is implemented in Visual Basic 6 and is applied to the case study of XYZ educational institute stated above. The partial source code for implementation of tool is listed in Appendix A. The tool is tested for different test cases to check its validity and scope. The tool is also compared with the markup validation service provided by W3C by employing the website validator.w3c.org. The results are depicted in Figure 2 Figure 5(a) depicts the ruleset currently employed by the tool. On analyzing the HTML document entered by the end user for the presence and absence of certain tags, the evaluation criteria computed by the tool is depicted in Figure 5(b). Figure 5(c) depicts in-depth analysis of HTML tag set. Based on the compliance of the rules, the score is computed and is displayed to an end user as depicted in Figure 6(a) and 6(b) for two different HTML documents.

Figure 8. Report Generated by the W3C Validation Tool in Text Format
Converting the Invalid HTML document in to Valid HTML Document. The structure of HTML document, "seo.html" input to the tool is shown in Figure 9. As revealed from the HTML content head tag is missing, some tags (hr and html end tag) are in uppercase. Br tags are not terminated.

Figure 10(a)-(b). Report Generated by the W3C Validation Tool on Performing Tag Analysis
On selecting the, "Convert" option from the main menu, a new file with the name <filename>_new.html is created taking care of all the rules, where <filename> is the primary filename of the HTML file entered by an end user. The structure of the new document created by the tool is shown in Figure 11.