A Method for Enhancing Search Using Transliteration of Mandarin Chinese
Vijay John
March 2006
Although search engines are increasingly capable of answering queries, they do not yet look for all possible transliterations of the same word. This paper describes a new approach to enhance search engine performance on terms originally from Mandarin Chinese. We propose an algorithm that can expand searches to include relevant transliterations of Mandarin Chinese search terms in addition to the original Hànyŭ Pīnyīn (unaccented) search terms. The algorithm improves searches by extracting parts of search terms using a right-to-left search method. This algorithm, Xiăozhĭ, also implements a program specifically designed for Mandarin Chinese search terms written in Pīnyīn. The program includes a list of transliteration replacement sets, within which the elements are possible transliterations of the same sound. The first element in each set is the unaccented Pīnyīn transliteration. Xiăozhĭ, however, may be extended to other languages using a different list of sets.
A large proportion of web pages are in English. When dealing with foreign terms, such pages generally contain transliterations of foreign terms. There is little standardization of transliteration, thus it is often difficult to find relevant results when search queries involve transliterated terms originally from non-Romanized languages. This paper describes a new approach that enhances search engine performance on terms originally from Mandarin Chinese.
The purpose of Xiăozhĭ and associated computer programs proposed in this paper is, ultimately, to facilitate searches consisting of terms of Chinese origin using several transliterations of the same word and to produce as many proper results as possible.
Generally, the user of a search engine could transliterate words from any language in several different ways. However, the original search term may not necessarily be the same as the most commonly accepted transliteration of the word in Roman letters.¹
An English-speaker, for example, might transcribe the word that is usually spelled pot as "pawt." In certain cases, Google® will search for the entered word and a few syntactic variations. However, it will not search for transliteration variations. A Google search of "pawt," for example, may produce results that use abbreviations, acronyms, etc. that are spelled with those four letters. Perhaps, it may also include slight variations such as "pawts" and "pawting." Only a few of these results may contain any references to pot.
In a language such as Mandarin, this problem is even more
complicated. Mandarin's official script is either Simplified or Traditional Hànzĭ
("Chinese characters").²
There are not one but several official and reputed transliteration systems that
exist for the language, depending on the particular country. Other well-known
entities (e.g. Yale University, Herbert Giles's A Chinese-English Dictionary)
devise phonetic transliteration systems and thus introduce even more
variations. In addition, of course, searchers from different cultural and
social backgrounds often misspell words and transcribe words in very different
ways.
Furthermore, even accepted transliteration systems for Mandarin Chinese may have significantly different spellings not only of single phonemes (as in English) but also of entire syllables that are between 1-4 Roman characters long. For example, Mandarin syllables that are spelled zhun in Hànyŭ Pīnyīn proper would be written as chun in Wade-Giles (a widely used transliteration system in Taiwan), jwun in Yale Transcription, and juen in Gwoyeu Romatzyh (formerly the official transliteration system in Taiwan).
Clearly, detailed tables of replacement patterns are needed to search for alternate transliterations to terms entered by a searcher. Replacement algorithms, in combination with other strategies such as eliminating unlikely search terms, can then be used to search effectively for alternate transliterations of search phrases. This paper explains one such algorithm (Xiăozhĭ) that has the potential to improve search engines.
Many researchers³ have written about various other aspects of Chinese transliteration, in particular:
Still others have focused on similar transliteration
problems in other languages, e.g. Arabic and Japanese. Their techniques are
more similar to the algorithm described below than to Xiăozhĭ. [Stalls and Knight] mention that:
(Arbabi et al., 1994) developed an algorithm at
IBM for the automatic forward transliteration of Arabic personal names into the
Roman alphabet. Using a hybrid neural network and knowledge-based system
approach, this program first inserts the appropriate missing vowels into the
Arabic name, then converts the name into a phonetic representation, and maps
this representation into one or more possible Roman spellings of the name. The
Roman spellings may also vary across languages (Sharif in English
corresponds with Chérife in French).
They then extend
this previous algorithm in order to create a program transliterating the Arabic
versions of the Western names back into Roman script. Their program deals with
an aspect unique to languages using the Hebrew and Arabic scripts: most texts
in Arabic (as well as in Hebrew, Persian, Yiddish, Urdu, Pashtu, and many other
languages) substantially neglect to write vowels to represent their spoken
equivalents.
[Stalls and Knight] also mention a method by Knight and
Graehl that probabilistically constructs alternate terms, finally picking
optimal search terms through graph algorithms.
Two papers have used similar algorithms in order to solve such problems in Japanese and Chinese, i.e. to use accepted transliterations of names common among those who speak the particular language and convert the Romanizations into kanji and hànzĭ, respectively. Yet a third paper [Mettler] uses a syllable-by-syllable transliteration method in order to transliterate foreign names from Roman letters into katakana, a writing system normally used in order to write foreign words in Japanese (as well as short sounds like pii ("beep") and, occasionally, other short particles such as naa).
The algorithm introduced in this paper is what we may call a "right-to-left" algorithm. This means that a string is scanned from right to left in order to find transliterated phonemes. Most of the algorithms for transliteration process strings from left to right.
[Kuo] describes how to find transliterated pairs, i.e. terms
in languages other than Mandarin and their transliterations into traditional hànzĭ,
from the Web. The target of the transliteration is Chinese. It mentions various
transliteration systems (Wade-Giles, Tōngyòng Pīnyīn, and
Hànyŭ Pīnyīn). Terms are segmented from left to right.
[Wan] introduces another left-to-right algorithm focused on transliterating
English terms, especially English names, into Chinese. [Gao] creates Chinese
terms out of English terms, such as names. It tries to find matching Chinese
characters for English phonemes.
Initially, we decided that the most convenient method for transliterating Mandarin Chinese terms into Roman script would be to create a program that includes a long list of transliterated sets. A transliterated set is a set of strings. For example, {zhun, chun, jwen, juen} is a transliterated set. The sets were created according to the initials and finals in the Pīnyīn alphabet.
New sets have also been created that include other sounds, of four letters maximum, that are not formally listed in the alphabet but may have unusual or special transliteration concerns. For example: zhun appears to be considered a syllable rather than an "initial" or "final." (Examples of "initials" and "finals" in Pīnyīn include c, ch, zhi, uang). These new sets have been created in an effort to simplify the process of transliteration for the computer program in a relevant manner.
The program was originally designed to operate in the following manner: it was assumed that the original search term and the first element in each set were written in Pīnyīn without any tones. The other elements in each set reflected both relatively accepted systems of transliteration and various cultural backgrounds; "iao" is also written as "yau" (Yale Transcription) and "iaor" (to acknowledge the "retroflexed r" in Beijing dialect, i.e. the tendency of Mandarin-speakers from Beijing to add "r" to the end of many words, such as diànyĭngr for "movie theater"). All sets may be altered at any time if desired; thus, the program can be perpetually improved as more investigation reveals more facts about transliteration.
This computer program was written in Java. In a Command window, the searcher could type in a term in Pīnyīn and not bother to use tone marks. The program would, in turn, search Google for all results of the original term and of a modified term. However, a major flaw was soon found with this approach: the program would make several unnecessary changes before searching for the second term. The term ying was, according to the list, to be transliterated only as ying and yingr. Instead, the program searched in the following manner:
As a result,
Google ultimately searched only for results containing ying and those concerning the Yarmuk River that flows through Turkey and Iraq.
The program searched in this manner because of the following reasons:
In summary, the
problem was the tendency to search for only two terms in addition to a
significant lack of proper organization dictating the necessary transformations
of the original search term. Clearly, the program needed to be restructured so
that it would at least search somewhat sensibly instead of merely making as
many changes as possible to the search term, starting at the end,
transliterating letter by letter right-to-left, and reinitiating the process.
The problem would
have been relatively less grave, I found, if the program had used the original
Pīnyīn search term to
search the list for individual sets rather than individual letters. The program
was changing each result letter by letter. Furthermore, the conventional
left-to-right scanning technique used in the program resulted in poor matches.
I found that the
following algorithm improves the replacement process:
This approach is
compatible even with most multisyllable words, although the program does not
yet include a system indicating the division of syllables. For example, when
the word pengren (pēngrèn
in Mandarin Chinese means "cooking" but is composed of two syllables
with distinct meanings: pēng-rèn) is entered into the system, the
results should be determined by looking first for "pengren," then
"pengr," etc. until it finds the true components: p, eng, r, en. Replacements can then be done by
replacing each of these components by another string in its set. The sets in
this example are:
...
{"p", "ph", "bh", "p'",
"bp", "pch"}...
{"r", "j", "zh", "l"}...
{"eng", "engr", "uu", "u"}...
{"en", "in", "enr", "on"}...
Here p could be replaced with ph, bh, etc. Similarly, r could
be replaced with j, zh, etc. Resultantly, several search
patterns can be obtained from the original search term pengren (e.g. phuujin).
The method
described here depends on knowing the relationships between different
transliteration schemes such as Wade-Giles, Yale transcription, etc. This
information is stored in simple tables in the current version. The program
simply uses tables of replacement patterns within the algorithm described in
the last section.
The information
about alternate transliterations can be simply given in an array as above. A
match on a term inside a particular sub-array, such as {"p",
"ph", "bh", "p'", "bp",
"pch"}, means that
alternate transliterations within the same array may be used to replace that
particular matched part.
Wade-Giles and
Yale transliterations have been included into the program mainly based on a
table provided in Janey Chen's A Practical English-Chinese Pronouncing
Dictionary [Chen]. These tables consist of all syllables (without tones)
that exist in Mandarin Chinese. They are organized alphabetically using the
Yale syllabary.
The specific information in some parts of the table has been included in separate sets. Thus, all information pertaining to the unaccented Pīnyīn syllable quan has been included under set quan, i.e. the set beginning with quan. Generally, however, information pertaining to the transliteration of one syllable has been decomposed into elements of the sets created in the program. For example, transliterations for Pīnyīn syllable bo include po (Wade-Giles) and bwo (Yale). For this reason, p has been entered as an element in set b, and wo has been entered as an element in set o.
Information concerning other accepted transliteration systems (e.g. Gwoyeu Romatzyh) is in accordance with various online sources. 5 In addition to these systems, we have also included some unofficial transcriptions that could be, or have been, used in transliterating Mandarin Chinese terms. For instance, Mandarin-speakers may pronounce Pīnyīn ch, sh, zh, chi, shi, zhi as c, s, z, ci, si, zi. Therefore, some may enter a misspelled query, e.g. with c instead of ch.
Since the Beijing accent of Mandarin often includes the letter r at the end of each syllable, the addition of r at the end of a set's first element has been reflected where possible (e.g. er, ianr, iar have been included in sets e and ia).
Each Chinese
dialect has its own pronounciation of a Mandarin character. Also, many other
languages have borrowed vocabulary from Chinese dialects and adjusted the
pronounciation to suit the particularities of their own language.
Transliterations based on such pronounciations of Mandarin words have been
extensively included in the implementation of Xiăozhĭ.
For example,
under set shi, the element "shek" has been included because the Mandarin
syllable shi could be transliterated as shek, based on its
Cantonese equivalent. (In particular, the name that is written in Pīnyīn
as Jiăng
Jièshí is commonly transliterated as "Chiang Kai-shek" in English
texts. The transliteration "Kai-shek" is based on the Cantonese
pronunciation of Mandarin Jièshí.) Similarly, "you" has been
included in set iao.
The Mandarin word for "cuisine," which Japanese has borrowed from a
Chinese dialect, is spelled in Pīnyīn as liàolĭ.
This word is pronounced in Japanese as [rjo:ri] and often transliterated as ryouri
in Roman script.
The
implementation of Xiăozhĭ is given in Appendix A.
Naturally, there
are still a great number of important problems; this is merely a first small
step toward proper transliteration of Chinese in search engines. Evidently,
searching for "dino," "dim," "din,"
"chino," etc. would be unlikely to help the searcher find anything
related to Pīnyīn "ding," even if they are common results. (The situation is
similar for any search term, including the other example pengren).
Sometimes,
particularly in cases of multisyllabic entries, it is important for
transliteration purposes to recognize what constitutes a syllable in the language
in question. Consider, for example, the Mandarin word written in Pīnyīn
as zhúnián, which means "year after year." One of the sets in the list
begins with the element zhun.
However, searching for zhun
and ian is inappropriate
here, although that is what the program is bound to do in its present state. In
reality, the word zhúnián is pronounced zhú-nián.
Indubitably,
there are various other significant flaws to be corrected in the
transliterating procedure as it now remains. However, these do not appear to
devalue the algorithm by any means. In fact, the algorithm is merely the
foundation for many more improvements.
The
transliteration knowledge described in Section 6 is only an introduction of the
sources and methods used to construct transliteration tables. The construction
of more extensive transliteration tables is a separate topic that needs further
investigation.
It is possible to
generalize the algorithm to other non-English languages. A different set of
component replacements would be needed for each language. The right-to-left
replacement method can also have other applications such as stemming and
matching. We hope to consider these later.
This algorithm proposes the use of a program that transliterates individual Mandarin Chinese words in as many different ways as possible. The program first searches for an entire term in a list of sets containing Pīnyīn sounds followed by other possible transliterations of the same sounds. If the entire term is not included in the list, the program subtracts one letter at a time from the end of the term and searches the list again. After finding a part of the term that is included in the list, the program repeats the process with the part of the term for which it has not yet found a corresponding set. Finally, the program uses the sets to create all possible combinations, searches, and keeps those combinations that are most common. This could help search engines improve their facilities for those searching for any information regarding a Chinese word, phrase, city name, etc.
¹Note
that, unlike a standardized transliteration, "the most commonly
accepted transliteration" does not necessarily have to be accepted in all
countries. For example, in the case of Mandarin, Hànyŭ Pīnyīn
is the official transliteration system of the People's Republic of China, which
is the most populated country in the world. Therefore, we will consider Hànyŭ
Pīnyīn to be "the most commonly accepted
transliteration." However, Taiwan officially uses Tōngyòng
Pīnyīn, a system not yet implemented in Xiăozhĭ.
²Simplified
characters are officially used in the People's Republic of China and in
Singapore whereas traditional characters are the official script in the
Republic of China (Taiwan). Other languages that use hànzĭ to a
slightly limited extent, such as Japanese (which calls them kanji), Korean
(which uses the term hanja), and other Chinese languages (Cantonese, Wú,
Mín languages, Gan, etc.) sometimes use a combination of both. However, they
tend to adopt traditional characters, because in such areas, hànzĭ
has often been used for hundreds of years. (The simplified characters have only
been in existence since the 20th century.)
³More
specifically: Wu et al., 2005; Wei, 2004; Kuo and Yan, 2004; Meng et al., 2000;
Wan and Verspoor.
4The
program determines how many results are "significant" by comparing
the number of results for all combinations. For example, if "ding"
yields five billion results and "tino" yields only two results,
"tino" has not produced a "significant" number of results.
Therefore, "tino" is eliminated.
5For
example, you could find some of this information by searching for "Gwoyeu
Romatzyh" on Google.
[Chen] Chen, Janey. A Practical English-Chinese Pronouncing Dictionary. Boston: Tuttle, 1991.
[Gao] Gao Wei, Phoneme-Based Statistical Transliteration of Foreign Names for OOV Problem, Master's Thesis, The Chinese University of Hong Kong, June 2004.
[Kuo] Jin-Shea Kuo and Ying-Kuei Yan, "Generating Paired Transliterated-Cognates Using Multiple Pronunciation Characteristics from Web Corpora," PACLIC 18, December 8th-10th, 2004, Waseda University, Tokyo, pp. 275-282.
[Meng] Helen Meng, Berlin Chen, Erika Grams, Sanjeev Khudanpur, Wai-Kit Lo, Gina-Anne Levow, Douglas Oard, Patrick Schone, Karen Tang, Hsin-Min Wang, and Jian Qiang Wang Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval, University of Pennsylvania, October 2000.
[Mettler] Matt Mettler, TRW Japanese Fast Data Finder, TRW Systems Development Division, Redondo Beach, California.
[Stalls and Knight] BG Stalls, K Knight, Translating Names and Technical Terms in Arabic Text, COLING/ACL Workshop on Computational Approaches to Semitic Languages, Montreal, Quebéc, 1998
[Wan] Stephen Wan and Cornelia Maria Verspoor, "Automatic English-Chinese Name Transliteration for Development of Multilingual Resources," Microsoft Research Institute, Macquarie University, Sydney, Australia.
[Wu] Jian-Cheng Wu, Tracy Lin, and Jason S. Chang, "Learning Source-Target Surface Patterns for Web-Based Terminology Translation," Proceedings of the ACL Interactive Poster and Demonstration Sessions, pp. 37-40, Ann Arbor, June 2005.
There are three aspects to implementing the Xiăozhĭ, algorithm. The first part is creating a list of transliteration sets. The second part is implementing the algorithm from section 5 and the third part is utilizing an Application Programming Interface (API) from Google.
The following rule sets were created from a variety of sources as described in section 6. This information is stored in a table of patterns. The algorithm scans strings from right to left, to find parts that match (for now) the first term in each row. Currently the algorithm only tries to replace Hànyŭ Pīnyīn terms, but this can be expanded to include other terms. After finding these matching locations, the algorithm replaces each found pattern in the search term with all the alternates available from the list in the row that starts with the pattern.
1.
ya,
ja, ia, eea, yea, yeah, yar, iar, jar, eear, year
2.
yi,
i, yir, ir, yat, yat-
4.
z,
ts, dz, j, zh
5.
ci,
tz'u, tsz, tzu, ts'u, ts'uh, tsu, tsuh
6.
si,
szu, sz, se
7.
zi,
tzu, dz, ja, ji
8.
ju,
chu, jyu, zhu, jü, jew, jur, jyur, jür, jewr
9.
qu,
ch'u, chyu, chu, chue, chü, chee, chi, ch'ue, ch'ü, ch'u, ch'ee, ch'i, qur,
chyur, chur, chuer, chür
10.
xu,
hsü, hsu, syu, xur, syur
11.
a, o,
u, ar, ah, aa
12.
b, p,
bp, bh, pp
13.
c,
ts, ts'
14.
d, t,
ch
15.
e, o,
ih, i, uh, u, a, er, ir, ø, ö, oe, ör, oer, ør, yr
16.
f,
ph, h, ff
17.
g, k,
j
18.
h,
h', kh, r, k, x, ch
19.
i,
ih, r, e, uh, u, a, er, ir, ø, ö, oe, ör, oer, ør, yr, y
20.
j,
ch, t, ch', t', k,c, dz, cs, qu
21.
k,
k', c
22.
l, n,
r
23.
m, n,
mh
24.
n, m,
l
25.
o,
wo, a, or, wo, aw, å
26.
p,
ph, bh, p', bp, pch
27.
q,
ts, ch, ch', c, chh, kv, sh
28.
r, j,
zh, l
29.
t, d,
t'
30.
u,
yu, ur, yur, o, oe
31.
wu,
u, wur, ur, vu
32.
ü, u,
yu, ür, ur, yur, ue, uer
33.
x,
hs, sh, s, sy, sj, h, k, g, hsz
34.
ong,
ung, ongr, ungr, oeng, oung, unas
35.
jun,
zhun, chun, jwun, jyun, junr, zhunr, jwunr, jyunr
36.
zhun,
chun, jwun, zhunr, jwunr
37.
qun,
ch'un, chyun, qunr, chyunr
38.
xun,
hsun, syun, xunr, syunr
39.
yu,
yü, yur
40.
ün,
ünr, uen, uenr
41.
yun,
yün, van, vân, yunr
42.
jie,
kai-, kai, kaj
43.
hong,
hung, ang, hongr
44.
lang,
ookami, oukami, okami, langr, lar
45.
ing,
ingr, ino, im, in
46.
ang,
angr, ou
47.
eng, engr,
uu, u
48.
juan,
chuan, jywan, zhuan, juanr, jywanr, zhuanr, juar, zhuar
49.
quan,
ch'uan, chywan, chuan, chiuan, quanr, chywanr, chiuanr, quar, chiuar
50.
xuan,
hsuan, sywan, xuanr, sywanr, xuar
51.
jue,
chueh, jywe, juer, jywer
52.
que,
ch'ueh, chueh, chywe, quer, chywer
53.
xue,
hsueh, sywe, xuer, sywer
54.
zh,
ch, j, d, cs
55.
sh,
s, si, x, sch
56.
ch,
c, chh, ch', cch, chch
57.
zhi,
chih, jr, der, dir, sa
58.
chi,
chih, chr
59.
shi,
shih, shr, shy, shek, sek, chek, shin
60.
ri,
jih, r, ni, zh
61.
ian,
ien, yan, ianr, iar, en
62.
yan,
yen, yanr, yar, jen, jün
63.
uan,
wan, uanr, wanr
64.
wan,
wanr, van
65.
un,
wun, uen, un-, unr, wunr, uenr, yunr
66.
ai,
air, a'i, a-i, ei, eir, aj, ay
67.
ei,
eir, e, er, ee
68.
ui,
uei, wei, uir, weir, way, wayr, ware, wair, wear
69.
an,
anr, anu
70.
en,
enr, on
71.
in,
inr, on
72.
ie,
ieh, ye, ier, yer
73.
ye,
yeh, yer
74.
iu,
you, iou, iur, your, iour
75.
you,
yu, iou, your, iour
76.
ao,
au, ow, aor, o, aur
77.
ua,
wa, uar, war
78.
wa,
war, ua, uar
79.
ou,
our, ov
80.
ue,
uer, ak
81.
wei,
weir, we, way, wayr, ware, wair, wear
82.
uo,
o, wo, uor, wor
83.
wo,
wor, ga, a
84.
er,
erh, el, ar
85.
iao,
yau, iaor, yaur, yo, you
86.
yao,
yau, yaor, yaur
88.
yong,
yung, yongr, yungr
89.
uang,
wang, uangr, wangr
90.
wang,
wangr, ou
91.
üan,
üanr
92.
yuan,
ywan, yuanr, ywanr, yuar, ywar
93.
iang,
ang, yang, iangr, yangr, an
94.
yang,
yangr
95.
yin,
in, yinr, inr
96.
ying,
yingr
97.
üe, üeh,
ywe, ue, ueh, üer, ywer, uer
98.
yue,
yueh, ywe, yuer, ywer
99.
uai,
wai, uair, wire
100.
wai,
wair, wire
The web location http://www.google.com/apis/ contains information about the API. This API requires the use of a key that is specific to each user. Users have to obtain the key from Google. There are some limitations to the API (such as returning only 10 results per call, and allowing only 1000 calls per day) but these are not a serious limitation to research work using this API.
The following references explain the API.
1. Google Web API Reference http://www.google.com/apis/reference.html
2. API FAQ http://www.google.com/apis/api_faq.html
3. Web API discussion group http://groups.google.com/group/google.public.web-apis
4. Java API reference tree file:///c:/googleapi/javadoc/index.html
(Note that the actual location of 4 is the location where you have installed the Web API.)
The Google API is used in implementing a search function for the Xiăozhĭ algorithm. This implementation is as follows:
import com.google.soap.search.*;
import java.io.*;
import java.util.*;
public class search {
static
GoogleSearch gsearch;
// note:
the clientKey should be changed to a valid user key
static String
clientKey = "3gFfnZ9QFABJ59F0QmVBnJb12UAQQIHQa";
static
String directive = "search";
static
final int cutoff = 10000;
public
search () {
gsearch =
new GoogleSearch();
gsearch.setKey(clientKey);
}
public
Vector best (Vector terms) {
Vector
best = new Vector ();
for (int
i=0; i<terms.size() && i<20; i++) {
String
term = (String)terms.elementAt (i);
gsearch.setQueryString (term);
try {
GoogleSearchResult r = gsearch.doSearch();
int
count = r.getEstimatedTotalResultsCount();
System.out.println (""+i+":"+term+"
"+count);
if
(count > cutoff) best.add (term);
}
catch
(Exception e) {
System.out.println ("Error searching for "+term);
}
}
return
best;
}
};