Search Engine Optimisation

There are today search engine and internetterms against the query vector, this is expressed
marketing services, in fact a new industry hasas a vector corresponding to the sought column
materialised to exploit the fear of low searchand therefore the document we are after, all we
rankings.This is not a new trend, back whenneed do is present this to the user, right,
simply resubmitting your website to the engineswell....The issue is that a search engine needs to
resulted in keeping your site at the top of thegenerate a linear index, ie convert the vectors
index, there was an accompanying boom incorresponding to the minimum cosine angles into a
resubmitting "companies", as we know, thesehuman readable format, and until such time as
were just men in back bedrooms with a host ofsomeone thinks of a better way to do it, all
CGI and Perl submitting scripts and aengines output lists, like your shopping list, it has a
timetable.Search Engine optimisation or "SEO", isstart, a middle and an end, therein lies the
the latest incarnation of this bedroom profiteering,problem, how to order the list!The hypothesis
the important difference is that now theseems simple, ordering information that might look
webmaster's are not just passively involved butchaotic at first, using the fact that closely
are being forced to adopt totally artificial andassociated documents tend to be relevant to
unsocial practices that ultimately serve only tosimilar requests. However, the internet (being a
help damage the Internet!SEO is supposedly thescale free network) is so vast that it is not
methodology and processes related to designingpossible to present a chosen feature space that
search engine "friendly" web content, the basicrepresents the x closest documents to the
premise is something like "If I follow all theconvergence point in a given cluster from the
engines formatting and connectivity criteria, thencommon Euclidean distance. This is what should
my website will rank higher then a comparablethen be presented to the user in a more intelligible
website that does not".All other things being equal,(semantic) display.The engines could just present
this seems quite positive given that the quality ofthe returns as produced by the matching
a search engines database (index) directly effectsalgorithms after decomposition, because the
its output; then webmaster's optimising theirgrouping generated using probabilistic/fuzzy
content so that search engines can correctlypatterns directly from the cluster might belong to
categorise the internet should logically improve themore than one class, but the strength (degree of
speed and quality of "the crawl".SEO then, logically,membership) value measured on a scale; using
should be good for the search providers, beingprobability on a [0,1] interval, is quite adequate.The
able to maintain an efficient index, this should usereason decomposition in singular values works for
less raw processing power, require less equipmentordering is related to the fact that the occurrence
and thus less energy; this must also be good forof two terms (say tomato and potato) is very
the users, being able to quickly and intuitively findhigh is reflected in the term-by-document matrix
what they want from a reliable source. Soundsby showing that only x of the n terms are used
reasonable right?Well that's the happy version.very frequently.The idea is that since the term
The fact is that initially this may be true, you maysay pepper is used/mentioned very little, then its
gain a short term advantage, but once we haveaxis/dimension does not affect much the search
all optimised our content for analysis and (in sospace, making it flat and relevant only in the other
doing) ignored our users; We will then be back totwo dimensionsHowever the engine's demonic
where we started, and the search providers willcreators can't do this because they are still
just think up some even more ridiculous "laws" byessentially using an inverted file structure, but
which to "judge" us by, and like sheep we will allthey still want absolute correctness in their
do that as well, thus the causal paradox isindexes and returned results which means trouble,
perpetuated and the users feel abused!Even this isbecause this assumes your index is perfect,
a vast oversimplification, the true nature of SEOincapable of being manipulated and that you can
is a lot more complicated; The heart of thesomehow order the returns in a meaningful
problem and the real issue here is related to theway!So the returned results can't generally
search providers task, which is to strip mine therepresent the documents that match semantically,
information junk yard otherwise known as thewe now need to account for some subjective
Internet, it may be full of interesting stuff but alsoquantities, that can not be derived directly from
plenty of garbage and they need to devisethe corpora, they attempt to deal with this by a
intelligent techniques to mine the interestingcocktail of criteria that rank the returns in such a
stuff!The current "solution" is literally for theway as its more likely that the "better" results
search engines to use their hegemonic standing toare closer to the top of the list.There are many
bully the webmaster's into organising their work inways of doing this, the current trend is to use
ways that have the primary effect of allowinginference about the quality of web sites were
quick "analysis" so they can categorise thepossible because such quantities are beyond the
website, but this has the secondary effect ofdirect control of the content creators and the
requiring content to be designed "for" analysis,webmaster's.PageRank provides a more
which typically translates to highly distributedsophisticated way of citation counting but this is
connectivity, ie the website being effectivelyembodied in the consept of link analysis, using a
divided into "micro sites", which makes therelative value of importance for a page measured
maintenance of links and content morebased on the average number of citations per
troublesome!This is not necessarily a bad thing,referance item.PageRank is currently one of the
most of these imposed linking and designmain ways to determine who gets into the top of
methodologies are often positive and beneficial forthe listings, but soon this will all become irrelevant
a lot of subjects. My problem is that this iswhen the engines stop using inverted file
unilaterally enforced and it is this type of issuestructures, because they can just use the
that is generating all the money for the SEOgrouping generated using probabilistic/fuzzy
boys.However this will soon be of nopatterns resulting from the convergence point in a
consequence. To understand the problem with thisgiven cluster from the common Euclidean
type of SEO operation, it is necessary to thinkdistance.When the changeover from inverted file
about how we can approximate and simulate thestructures occurs, there will be two direct
human process of mining information andconsequences:The corpora will be capable of
knowledge.Let us assume we have set ourvastly more representative and more detailed
Crawlers to work, automatically indexing pages (atdata then is Currently possible.The corpora will no
random, looking at previous indexing and guidedlonger be indexed as is currently done, they will
by user requests); we then format the resultingembody semantic meaning and value, where
text: ASCII is usually used and validation followssome subjective quantities can be derived directly
this, search engines tend to ignore some tags andfrom the corpora without the need for cocktails
make use of good ones that help identify theor totally artificial rules.
content. At this point we would have reduced theThe effect is that corpora will be more accurate
Internet to a corporation, ie the collection of alland incapable of manipulation, thus variations of
HTML documents about no particular subject.WeSEO that involve indirect manipulation of the index
then would set about item normalisation, iewill become pointless overnight.It is worth noting
identification of tokens (words), characterisationthat the search providers are becoming
of tokens (tagging meaning to words), and finallyincreasingly pessimistic about website promotion in
running stemming algorithms to remove suffixesall forms, they currently penalise many things that
(and/or prefixes) to derive the final database ofcan effect the results such as duplicated content
terms; this can be efficiently and compactly(which can be perfectly legitimate), and satellite
represented in lower term dimensional spaces,sites, ie one webmaster interlinking seemingly
(Goggle are still essentially using inverted fileseparate but highly relevant website's.They may
structures).Imagine each document of a corpus aswell start penalising webmaster's that promote
a point ie a term in an N dimensional space, heretheir website's through articles they submit for
the literal word matching type search is lost, butthird party distribution, as they do for people that
we acquire more of a semantic flavour, wherepost their sites information to bulletin boards!Being
closely related information can be grouped in tobanned from the top search engines can
clusters of documents bearing similarities,effectively destroy your business, if not directly
however N dimensional vector spaces are of nothrough loss of visibility then indirectly in that
help to the users.After applying our algorithms topeople tend to judge you on weather your are
the corpora, we get a term by document matrix,organised enough to be listed !The criteria are
where terms and documents are represented bycontinually changing, as the amoral SOE boys
vectors, a query can also be represented by aattempt to pervert the resultes, these "laws" are
vector. So we have a query and our corporanot always clear and there are no appeals, where
(represented as vectors, bo!th having the samewe are all subject to the providers up ending a
dimensions), we can now start matching thedrum then dispensing swift and hard
query against all the available documents using the"judgements", that can doom us at any time!The
cosine angle between these two vectors.But wepart that erks the most is that as the indexes
now have a new artificial "problem"; we know theconverge, (goggle's index is used directly by 2 of
general answer to the question "which website'sthe 3 top engines and 5 others indirectly use it for
best match my search terms", this informationtheir rankings) a bann by anyone of these engines
now exists in our mathematical object, at a highis enforced by them all.
level of abstraction, ie the cosine angles for all