| There are today search engine and internet | | | | terms against the query vector, this is expressed |
| marketing services, in fact a new industry has | | | | as a vector corresponding to the sought column |
| materialised to exploit the fear of low search | | | | and therefore the document we are after, all we |
| rankings.This is not a new trend, back when | | | | need do is present this to the user, right, |
| simply resubmitting your website to the engines | | | | well....The issue is that a search engine needs to |
| resulted in keeping your site at the top of the | | | | generate a linear index, ie convert the vectors |
| index, there was an accompanying boom in | | | | corresponding to the minimum cosine angles into a |
| resubmitting "companies", as we know, these | | | | human readable format, and until such time as |
| were just men in back bedrooms with a host of | | | | someone thinks of a better way to do it, all |
| CGI and Perl submitting scripts and a | | | | engines output lists, like your shopping list, it has a |
| timetable.Search Engine optimisation or "SEO", is | | | | start, a middle and an end, therein lies the |
| the latest incarnation of this bedroom profiteering, | | | | problem, how to order the list!The hypothesis |
| the important difference is that now the | | | | seems simple, ordering information that might look |
| webmaster's are not just passively involved but | | | | chaotic at first, using the fact that closely |
| are being forced to adopt totally artificial and | | | | associated documents tend to be relevant to |
| unsocial practices that ultimately serve only to | | | | similar requests. However, the internet (being a |
| help damage the Internet!SEO is supposedly the | | | | scale free network) is so vast that it is not |
| methodology and processes related to designing | | | | possible to present a chosen feature space that |
| search engine "friendly" web content, the basic | | | | represents the x closest documents to the |
| premise is something like "If I follow all the | | | | convergence point in a given cluster from the |
| engines formatting and connectivity criteria, then | | | | common Euclidean distance. This is what should |
| my website will rank higher then a comparable | | | | then be presented to the user in a more intelligible |
| website that does not".All other things being equal, | | | | (semantic) display.The engines could just present |
| this seems quite positive given that the quality of | | | | the returns as produced by the matching |
| a search engines database (index) directly effects | | | | algorithms after decomposition, because the |
| its output; then webmaster's optimising their | | | | grouping generated using probabilistic/fuzzy |
| content so that search engines can correctly | | | | patterns directly from the cluster might belong to |
| categorise the internet should logically improve the | | | | more than one class, but the strength (degree of |
| speed and quality of "the crawl".SEO then, logically, | | | | membership) value measured on a scale; using |
| should be good for the search providers, being | | | | probability on a [0,1] interval, is quite adequate.The |
| able to maintain an efficient index, this should use | | | | reason decomposition in singular values works for |
| less raw processing power, require less equipment | | | | ordering is related to the fact that the occurrence |
| and thus less energy; this must also be good for | | | | of two terms (say tomato and potato) is very |
| the users, being able to quickly and intuitively find | | | | high is reflected in the term-by-document matrix |
| what they want from a reliable source. Sounds | | | | by showing that only x of the n terms are used |
| reasonable right?Well that's the happy version. | | | | very frequently.The idea is that since the term |
| The fact is that initially this may be true, you may | | | | say pepper is used/mentioned very little, then its |
| gain a short term advantage, but once we have | | | | axis/dimension does not affect much the search |
| all optimised our content for analysis and (in so | | | | space, making it flat and relevant only in the other |
| doing) ignored our users; We will then be back to | | | | two dimensionsHowever the engine's demonic |
| where we started, and the search providers will | | | | creators can't do this because they are still |
| just think up some even more ridiculous "laws" by | | | | essentially using an inverted file structure, but |
| which to "judge" us by, and like sheep we will all | | | | they still want absolute correctness in their |
| do that as well, thus the causal paradox is | | | | indexes and returned results which means trouble, |
| perpetuated and the users feel abused!Even this is | | | | because this assumes your index is perfect, |
| a vast oversimplification, the true nature of SEO | | | | incapable of being manipulated and that you can |
| is a lot more complicated; The heart of the | | | | somehow order the returns in a meaningful |
| problem and the real issue here is related to the | | | | way!So the returned results can't generally |
| search providers task, which is to strip mine the | | | | represent the documents that match semantically, |
| information junk yard otherwise known as the | | | | we now need to account for some subjective |
| Internet, it may be full of interesting stuff but also | | | | quantities, that can not be derived directly from |
| plenty of garbage and they need to devise | | | | the corpora, they attempt to deal with this by a |
| intelligent techniques to mine the interesting | | | | cocktail of criteria that rank the returns in such a |
| stuff!The current "solution" is literally for the | | | | way as its more likely that the "better" results |
| search engines to use their hegemonic standing to | | | | are closer to the top of the list.There are many |
| bully the webmaster's into organising their work in | | | | ways of doing this, the current trend is to use |
| ways that have the primary effect of allowing | | | | inference about the quality of web sites were |
| quick "analysis" so they can categorise the | | | | possible because such quantities are beyond the |
| website, but this has the secondary effect of | | | | direct control of the content creators and the |
| requiring content to be designed "for" analysis, | | | | webmaster's.PageRank provides a more |
| which typically translates to highly distributed | | | | sophisticated way of citation counting but this is |
| connectivity, ie the website being effectively | | | | embodied in the consept of link analysis, using a |
| divided into "micro sites", which makes the | | | | relative value of importance for a page measured |
| maintenance of links and content more | | | | based on the average number of citations per |
| troublesome!This is not necessarily a bad thing, | | | | referance item.PageRank is currently one of the |
| most of these imposed linking and design | | | | main ways to determine who gets into the top of |
| methodologies are often positive and beneficial for | | | | the listings, but soon this will all become irrelevant |
| a lot of subjects. My problem is that this is | | | | when the engines stop using inverted file |
| unilaterally enforced and it is this type of issue | | | | structures, because they can just use the |
| that is generating all the money for the SEO | | | | grouping generated using probabilistic/fuzzy |
| boys.However this will soon be of no | | | | patterns resulting from the convergence point in a |
| consequence. To understand the problem with this | | | | given cluster from the common Euclidean |
| type of SEO operation, it is necessary to think | | | | distance.When the changeover from inverted file |
| about how we can approximate and simulate the | | | | structures occurs, there will be two direct |
| human process of mining information and | | | | consequences:The corpora will be capable of |
| knowledge.Let us assume we have set our | | | | vastly more representative and more detailed |
| Crawlers to work, automatically indexing pages (at | | | | data then is Currently possible.The corpora will no |
| random, looking at previous indexing and guided | | | | longer be indexed as is currently done, they will |
| by user requests); we then format the resulting | | | | embody semantic meaning and value, where |
| text: ASCII is usually used and validation follows | | | | some subjective quantities can be derived directly |
| this, search engines tend to ignore some tags and | | | | from the corpora without the need for cocktails |
| make use of good ones that help identify the | | | | or totally artificial rules. |
| content. At this point we would have reduced the | | | | The effect is that corpora will be more accurate |
| Internet to a corporation, ie the collection of all | | | | and incapable of manipulation, thus variations of |
| HTML documents about no particular subject.We | | | | SEO that involve indirect manipulation of the index |
| then would set about item normalisation, ie | | | | will become pointless overnight.It is worth noting |
| identification of tokens (words), characterisation | | | | that the search providers are becoming |
| of tokens (tagging meaning to words), and finally | | | | increasingly pessimistic about website promotion in |
| running stemming algorithms to remove suffixes | | | | all forms, they currently penalise many things that |
| (and/or prefixes) to derive the final database of | | | | can effect the results such as duplicated content |
| terms; this can be efficiently and compactly | | | | (which can be perfectly legitimate), and satellite |
| represented in lower term dimensional spaces, | | | | sites, ie one webmaster interlinking seemingly |
| (Goggle are still essentially using inverted file | | | | separate but highly relevant website's.They may |
| structures).Imagine each document of a corpus as | | | | well start penalising webmaster's that promote |
| a point ie a term in an N dimensional space, here | | | | their website's through articles they submit for |
| the literal word matching type search is lost, but | | | | third party distribution, as they do for people that |
| we acquire more of a semantic flavour, where | | | | post their sites information to bulletin boards!Being |
| closely related information can be grouped in to | | | | banned from the top search engines can |
| clusters of documents bearing similarities, | | | | effectively destroy your business, if not directly |
| however N dimensional vector spaces are of no | | | | through loss of visibility then indirectly in that |
| help to the users.After applying our algorithms to | | | | people tend to judge you on weather your are |
| the corpora, we get a term by document matrix, | | | | organised enough to be listed !The criteria are |
| where terms and documents are represented by | | | | continually changing, as the amoral SOE boys |
| vectors, a query can also be represented by a | | | | attempt to pervert the resultes, these "laws" are |
| vector. So we have a query and our corpora | | | | not always clear and there are no appeals, where |
| (represented as vectors, bo!th having the same | | | | we are all subject to the providers up ending a |
| dimensions), we can now start matching the | | | | drum then dispensing swift and hard |
| query against all the available documents using the | | | | "judgements", that can doom us at any time!The |
| cosine angle between these two vectors.But we | | | | part that erks the most is that as the indexes |
| now have a new artificial "problem"; we know the | | | | converge, (goggle's index is used directly by 2 of |
| general answer to the question "which website's | | | | the 3 top engines and 5 others indirectly use it for |
| best match my search terms", this information | | | | their rankings) a bann by anyone of these engines |
| now exists in our mathematical object, at a high | | | | is enforced by them all. |
| level of abstraction, ie the cosine angles for all | | | | |