2009年12月12日星期六
2009年8月12日星期三
Grammer checking project
1. open office project:
From: http://lingucomponent.openoffice.org/grammar.html
start:
Lingucomponent Sub-Project: Grammar Checking
One of the goals of the Lingucomponent project is to design, develop, and implement a Grammar checker for English and other supported languages.
News
* October 2008: A grammar checking API is now part of OpenOffice.org 3.0. LanguageTool now makes use of this new API which allows on-the-fly checking, i.e. checking text while you type. More information about the API is available at the following sources:
o Grammar Checking page in Wiki
o Specification (.odt)
o All grammar checker issues
If you have any interest in helping to implement a grammar checker for the OpenOffice.org project, please subscribe to the mailing list dev@lingucomponent.openoffice.org and introduce yourself, your skills, and your willingness to help this project.
Links to Open Source grammar checkers
* An Gramadóir, a grammar checker for the Irish language
* CoGrOO, a grammar checker for Portuguese
* GRAC, corpus-based grammar checker written in Python
* graviax, grammar rules and grammar checker for the English language
* Higgins, a prototype English-language parser
* LanguageTool, a style and grammar checker with OpenOffice.org integration, for English, German, Polish, Dutch, and other languages
* Link Grammar, not really a grammar checker but a parser for the English language, also see the link grammar page at AbiWord
Links to commercial grammar checkers
* Cysgliad, a Welsh grammar checker
Created: 2001 June. Last Modified: $Date: 2008/11/08 21:32:59 $, $Revision: 1.25 $
2. Requirement (not totally true)
From: http://wiki.services.openoffice.org/wiki/Grammar_Checking
Grammar checking of mixed language text
It is believed that even for sentences that uses several languages there is only a single language the whole sentence is in. (How that language is identified is a completely different matter and probably a complex task though!) And thus that sentence should only be grammar checked in that single language. For example:
The German word for television is Fernseher.
This sentence should be grammar checked in English and not German
If possible though (for example if language attributes are set correctly) it should be noted that Fernseher is not in English and thus at the very least no spelling error should for English should be reported for that word. And probably it is also impossible to report any grammar error that involves embedded foreign words. Thus the best to hope for probably is for the foreign word to be recognized as correct by the respective spell checker.
Even with completely embedded sentence like
In Gallica Caesar said 'Alea iacta est.' and continued his battle.
the above text is in a single language English and not Latin. If an existing grammar checker is smart enough to cope with embedded sentences of a different language I don't know. To keep it simple for the time being the whole text should be grammar checked as one sentence in English and in only that language.
Grammar checking and spell checking at the same time
Should spell checking have an iterator of it's own with a thread of it's own? Or should spell checking be handled by the GrammarCheckingIterator as well?
Other Questions / problems:
* checking is limited to paragraphs (unless the implementation of XFlatParagraph chooses to hide sth. more behind it which is unlikely). Though one could think of enumerations as a possible application for this behavior.
* in the case of several grammar checkers for one languages, what do we do if they report different end-of-sentence positions? We really can't handle each checker individually here.
* does a grammar checker that requires knowledge of the previous text in this paragraph need to have those text presented even if it is in a language it does not know?
* How to achieve consistency of usage (e.g. spelling) when having grammar checkers in multiple languages? E.g. e-mail vs. email? Or does it need to be consistent on a per language base only?
* How to determine the language of a sentence? Use the language of the first word, or language guessing, or the language with the most words,... ?
* Problems related to a specific UI, namely the grammar checking dialog still to be defined, not yet covered.
* The troublesome case of having for example three grammar checkers for one language and two of them wanting to use their own dialog while the third will go with the office internal one is left out. Because if all of them report errors in the same sentence and like to use their own dialog as well we will have to cope with switching between three dialogs just to edit a single sentence. That's just plain awful to even think about. And I doubt there will be even one user to appreciate such a scenario.
* Should the document (e.g. XFlatParagraph) be in charge to determine the language for checking or should it be the GrammarCheckingIterator? Probably the latter...
From: http://lingucomponent.openoffice.org/grammar.html
start:
Lingucomponent Sub-Project: Grammar Checking
One of the goals of the Lingucomponent project is to design, develop, and implement a Grammar checker for English and other supported languages.
News
* October 2008: A grammar checking API is now part of OpenOffice.org 3.0. LanguageTool now makes use of this new API which allows on-the-fly checking, i.e. checking text while you type. More information about the API is available at the following sources:
o Grammar Checking page in Wiki
o Specification (.odt)
o All grammar checker issues
If you have any interest in helping to implement a grammar checker for the OpenOffice.org project, please subscribe to the mailing list dev@lingucomponent.openoffice.org and introduce yourself, your skills, and your willingness to help this project.
Links to Open Source grammar checkers
* An Gramadóir, a grammar checker for the Irish language
* CoGrOO, a grammar checker for Portuguese
* GRAC, corpus-based grammar checker written in Python
* graviax, grammar rules and grammar checker for the English language
* Higgins, a prototype English-language parser
* LanguageTool, a style and grammar checker with OpenOffice.org integration, for English, German, Polish, Dutch, and other languages
* Link Grammar, not really a grammar checker but a parser for the English language, also see the link grammar page at AbiWord
Links to commercial grammar checkers
* Cysgliad, a Welsh grammar checker
Created: 2001 June. Last Modified: $Date: 2008/11/08 21:32:59 $, $Revision: 1.25 $
2. Requirement (not totally true)
From: http://wiki.services.openoffice.org/wiki/Grammar_Checking
Grammar checking of mixed language text
It is believed that even for sentences that uses several languages there is only a single language the whole sentence is in. (How that language is identified is a completely different matter and probably a complex task though!) And thus that sentence should only be grammar checked in that single language. For example:
The German word for television is Fernseher.
This sentence should be grammar checked in English and not German
If possible though (for example if language attributes are set correctly) it should be noted that Fernseher is not in English and thus at the very least no spelling error should for English should be reported for that word. And probably it is also impossible to report any grammar error that involves embedded foreign words. Thus the best to hope for probably is for the foreign word to be recognized as correct by the respective spell checker.
Even with completely embedded sentence like
In Gallica Caesar said 'Alea iacta est.' and continued his battle.
the above text is in a single language English and not Latin. If an existing grammar checker is smart enough to cope with embedded sentences of a different language I don't know. To keep it simple for the time being the whole text should be grammar checked as one sentence in English and in only that language.
Grammar checking and spell checking at the same time
Should spell checking have an iterator of it's own with a thread of it's own? Or should spell checking be handled by the GrammarCheckingIterator as well?
Other Questions / problems:
* checking is limited to paragraphs (unless the implementation of XFlatParagraph chooses to hide sth. more behind it which is unlikely). Though one could think of enumerations as a possible application for this behavior.
* in the case of several grammar checkers for one languages, what do we do if they report different end-of-sentence positions? We really can't handle each checker individually here.
* does a grammar checker that requires knowledge of the previous text in this paragraph need to have those text presented even if it is in a language it does not know?
* How to achieve consistency of usage (e.g. spelling) when having grammar checkers in multiple languages? E.g. e-mail vs. email? Or does it need to be consistent on a per language base only?
* How to determine the language of a sentence? Use the language of the first word, or language guessing, or the language with the most words,... ?
* Problems related to a specific UI, namely the grammar checking dialog still to be defined, not yet covered.
* The troublesome case of having for example three grammar checkers for one language and two of them wanting to use their own dialog while the third will go with the office internal one is left out. Because if all of them report errors in the same sentence and like to use their own dialog as well we will have to cope with switching between three dialogs just to edit a single sentence. That's just plain awful to even think about. And I doubt there will be even one user to appreciate such a scenario.
* Should the document (e.g. XFlatParagraph) be in charge to determine the language for checking or should it be the GrammarCheckingIterator? Probably the latter...
2009年1月19日星期一
Top coder
假设有这样一种字符串,它们的长度不大于26,而且若一个这样的字符串其长度为
m,则这个字符串必定由a, b, c, . . . , z 中的前m 个字母构成,同时我们保证每个字母出
现且仅出现一次。比方说某个字符串长度为5,那么它一定是由a, b, c, d, e 这5 个字母
构成,不会多一个也不会少一个。嗯嗯,这样一来,一旦长度确定,这个字符串中有哪些
字母也就确定了,唯一的区别就是这些字母的前后顺序而已。
现在我们用一个由大写字母A 和B 构成的序列来描述这类字符串里各个字母的前
后顺序:
1. 如果字母b 在字母a 的后面,那么序列的第一个字母就是A(After),否则序列的
第一个字母就是B(Before);
2. 如果字母c 在字母b 的后面,那么序列的第二个字母就是A,否则就是B;
3. 如果字母d 在字母c 的后面,那么??不用多说了吧?直到这个字符串的结束。
这规则甚是简单,不过有个问题就是同一个AB 序列,可能有多个字符串都与之相
符,比方说序列"ABA",就有"acdb"、"cadb" 等等好几种可能性。说的专业一点,这一
个序列实际上对应了一个字符串集合。那么现在问题来了:给你一个这样的AB 序列,
问你究竟有多少个不同的字符串能够与之相符?或者说这个序列对应的字符串集合有多
大?注意,只要求个数,不要求枚举所有的字符串。
注:如果结果大于10 亿就返回-1。
我把题目总结为:给定一个字符串"ABAB..."(以下称为模式串),计算出符合要求的所有"abcd..."字符串(以下称为结果串)的个数.
要解答这个问题,首先需要明确两点(为什么就不要说了吧,呵呵):
1. 如果模式串的长度为n,那么结果串的长度为n+1
2. 一个模式串至少对应一个结果串,一个结果串对应而且只能对应一个模式串
下面我把我的思考过程写出来:
一. 从一个给定的模式串递推出所有的结果串,假设给定的模式串是"ABAB":
1.对于第一个字母"A",那么结果串只能是"ab"
2.对于第二个字母"B",说明c在b之前,而b之前有2个空位可以插入,所以可能的结果串是"cab","acb"
3.对于第三个字母"A",说明d在c之后.在"cab"中,d可以插到3个空位中,可能的结果串是"cdab","cadb","cabd";在"acb"中,d可以插到2个空位中----"acdb","acbd".所以总的可能数是5
4.对于第四个字母"B",同样的道理可以算出总的可能的结果串有16个.
二. 总结一下刚才的递推过程,可以发现每一次递推只和前一次递推的结果和模式串相应位置的字母有关,而模式串是给定的,所以确切的说,只和前一次递推结果中最大字母(最大字母指的是按字母排序的大小,如z>y>x>...>c>b>a)的位置有关.
现在有点事没时间写了,明天接着写,不过规律已经给出了,就是根据前一次结果的最大字母的位置来计算.大家可以先自己想想.
接着写.
既然找到了问题的本质,那么就构造一个数据结构存储每次递推的最大字母的位置即可.本题可以用一个简单数组记录便可.如下所示,用一个整型数组num记录.(数组的具体大小视情况而定,在此不做细究,题目中说不会大于26).num[a] = b的含义是,在位置a上(第a位上)最大字母出现的次数是b次.
const int maxNum = 30;
int num[maxNum];
用string moduleString;记录模式串.
程序的思想是:
1. 输入模式串的长度为n时,从模式串的第一个字母开始递推.
2. 数组num的第一位即num[0]做边界处理,真正的记录从num[1]开始.
3. 在递推时,如果当前字母是”A”,那么只关心”这次最大字母”在”前一次递推的结果中最大字母”右边的情况有几种.如果是”B”,那么只关心”这次最大字母”在”前一次递推的结果中最大字母”左边的情况有几种
下面举个具体的例子来分析:以输出”ABAB”为例,每一行表示每次递推之后num数组的变化
模式串/位置 第零位 第一位 第二位 第三位 第四位 第五位
A 0 0 1
AB 0 1 1 0
ABA 0 0 1 2 2
ABAB 0 5 5 4 2 0
1) A: 此时最大字母为b,显然它出现在第2位(记住第零位永远是0,真正的记录从第一位开始)
2) AB:此时最大字母为c,它要求出现在b之前,分别是cab,acb.所以c在第一位,第二位上各出现一次.
3) ABA: 此时最大字母为d,它要求出现在c之后.显然d出现在第一位是不可能的,所以在第一位出现的次数是0.因为c在前一次结果的第一位出现过一次,所以d在那种情况下可以在第二位出现一次.下面来看d在第三位可能出现的次数,因为前一次结果中,在第三位之前c出现过2次,那么d可以在那两种情况下出现在第三位,所以d在第三位出现的次数为2.最后看d在第四位出现的次数,还是因为前一次结果中c在第四位之前出现的次数是2次,所以d在第四位出现的次数是2
4) ABAB: 此时最大字母是e,计算它出现的次数的方法和前面的一样,只是它要求是在前一次最大字母d的左边,只要反过来从右往左遍历就行了,在此不再累述.
最终的程序为:
int main(int argc, char* argv[])
{
const int maxNum = 30;
int num[maxNum]; //num[0] = 0;
string moduleString;
int length;
cout<<"Input Module String"< cin>>moduleString;
length = moduleString.length();
if(moduleString[0] == 'A')
{
num[1] = 0;
num[2] = 1;
}
else
{
num[1] = 1;
num[2] = 0;
}
num[0] = 0;
int preNum0 = 0;
int preNum1 = 0;
for(int i = 1; i < length; i++)
{
if(moduleString[i] == 'A')
{
preNum0 = num[0];
preNum1 = num[1];
for(int j = 1; j <= i+2; j++)
{
num[j] = num[j-1] + preNum0;
preNum0 = preNum1;
preNum1 = num[j+1];
}
}
else
{
num[i+2] = 0;
for(int j = i+1; j > 0; j--)
{
num[j] = num[j+1] + num[j];
}
}
}
int totalNum = 0;
for(int i = 0;i < length + 2;i++)
{
totalNum += num[i];
}
cout< < totalNum < < endl;
return 0;
}
对程序的几点说明:
1. 该程序缺少头部信息.(,)
2. 该程序已调试通过,为可用版本
3. 该程序的输入为一个形如”ABA..”的模式串,输出为符合要求的结果串的个数.
4. 该程序的时间复杂度为n^2
5. 该程序只为展示如何解决此类问题,输入输出可能与原题不符,相信读者可根据此程序作少许修改以通过Top coder的测试.
m,则这个字符串必定由a, b, c, . . . , z 中的前m 个字母构成,同时我们保证每个字母出
现且仅出现一次。比方说某个字符串长度为5,那么它一定是由a, b, c, d, e 这5 个字母
构成,不会多一个也不会少一个。嗯嗯,这样一来,一旦长度确定,这个字符串中有哪些
字母也就确定了,唯一的区别就是这些字母的前后顺序而已。
现在我们用一个由大写字母A 和B 构成的序列来描述这类字符串里各个字母的前
后顺序:
1. 如果字母b 在字母a 的后面,那么序列的第一个字母就是A(After),否则序列的
第一个字母就是B(Before);
2. 如果字母c 在字母b 的后面,那么序列的第二个字母就是A,否则就是B;
3. 如果字母d 在字母c 的后面,那么??不用多说了吧?直到这个字符串的结束。
这规则甚是简单,不过有个问题就是同一个AB 序列,可能有多个字符串都与之相
符,比方说序列"ABA",就有"acdb"、"cadb" 等等好几种可能性。说的专业一点,这一
个序列实际上对应了一个字符串集合。那么现在问题来了:给你一个这样的AB 序列,
问你究竟有多少个不同的字符串能够与之相符?或者说这个序列对应的字符串集合有多
大?注意,只要求个数,不要求枚举所有的字符串。
注:如果结果大于10 亿就返回-1。
我把题目总结为:给定一个字符串"ABAB..."(以下称为模式串),计算出符合要求的所有"abcd..."字符串(以下称为结果串)的个数.
要解答这个问题,首先需要明确两点(为什么就不要说了吧,呵呵):
1. 如果模式串的长度为n,那么结果串的长度为n+1
2. 一个模式串至少对应一个结果串,一个结果串对应而且只能对应一个模式串
下面我把我的思考过程写出来:
一. 从一个给定的模式串递推出所有的结果串,假设给定的模式串是"ABAB":
1.对于第一个字母"A",那么结果串只能是"ab"
2.对于第二个字母"B",说明c在b之前,而b之前有2个空位可以插入,所以可能的结果串是"cab","acb"
3.对于第三个字母"A",说明d在c之后.在"cab"中,d可以插到3个空位中,可能的结果串是"cdab","cadb","cabd";在"acb"中,d可以插到2个空位中----"acdb","acbd".所以总的可能数是5
4.对于第四个字母"B",同样的道理可以算出总的可能的结果串有16个.
二. 总结一下刚才的递推过程,可以发现每一次递推只和前一次递推的结果和模式串相应位置的字母有关,而模式串是给定的,所以确切的说,只和前一次递推结果中最大字母(最大字母指的是按字母排序的大小,如z>y>x>...>c>b>a)的位置有关.
现在有点事没时间写了,明天接着写,不过规律已经给出了,就是根据前一次结果的最大字母的位置来计算.大家可以先自己想想.
接着写.
既然找到了问题的本质,那么就构造一个数据结构存储每次递推的最大字母的位置即可.本题可以用一个简单数组记录便可.如下所示,用一个整型数组num记录.(数组的具体大小视情况而定,在此不做细究,题目中说不会大于26).num[a] = b的含义是,在位置a上(第a位上)最大字母出现的次数是b次.
const int maxNum = 30;
int num[maxNum];
用string moduleString;记录模式串.
程序的思想是:
1. 输入模式串的长度为n时,从模式串的第一个字母开始递推.
2. 数组num的第一位即num[0]做边界处理,真正的记录从num[1]开始.
3. 在递推时,如果当前字母是”A”,那么只关心”这次最大字母”在”前一次递推的结果中最大字母”右边的情况有几种.如果是”B”,那么只关心”这次最大字母”在”前一次递推的结果中最大字母”左边的情况有几种
下面举个具体的例子来分析:以输出”ABAB”为例,每一行表示每次递推之后num数组的变化
模式串/位置 第零位 第一位 第二位 第三位 第四位 第五位
A 0 0 1
AB 0 1 1 0
ABA 0 0 1 2 2
ABAB 0 5 5 4 2 0
1) A: 此时最大字母为b,显然它出现在第2位(记住第零位永远是0,真正的记录从第一位开始)
2) AB:此时最大字母为c,它要求出现在b之前,分别是cab,acb.所以c在第一位,第二位上各出现一次.
3) ABA: 此时最大字母为d,它要求出现在c之后.显然d出现在第一位是不可能的,所以在第一位出现的次数是0.因为c在前一次结果的第一位出现过一次,所以d在那种情况下可以在第二位出现一次.下面来看d在第三位可能出现的次数,因为前一次结果中,在第三位之前c出现过2次,那么d可以在那两种情况下出现在第三位,所以d在第三位出现的次数为2.最后看d在第四位出现的次数,还是因为前一次结果中c在第四位之前出现的次数是2次,所以d在第四位出现的次数是2
4) ABAB: 此时最大字母是e,计算它出现的次数的方法和前面的一样,只是它要求是在前一次最大字母d的左边,只要反过来从右往左遍历就行了,在此不再累述.
最终的程序为:
int main(int argc, char* argv[])
{
const int maxNum = 30;
int num[maxNum]; //num[0] = 0;
string moduleString;
int length;
cout<<"Input Module String"<
length = moduleString.length();
if(moduleString[0] == 'A')
{
num[1] = 0;
num[2] = 1;
}
else
{
num[1] = 1;
num[2] = 0;
}
num[0] = 0;
int preNum0 = 0;
int preNum1 = 0;
for(int i = 1; i < length; i++)
{
if(moduleString[i] == 'A')
{
preNum0 = num[0];
preNum1 = num[1];
for(int j = 1; j <= i+2; j++)
{
num[j] = num[j-1] + preNum0;
preNum0 = preNum1;
preNum1 = num[j+1];
}
}
else
{
num[i+2] = 0;
for(int j = i+1; j > 0; j--)
{
num[j] = num[j+1] + num[j];
}
}
}
int totalNum = 0;
for(int i = 0;i < length + 2;i++)
{
totalNum += num[i];
}
cout< < totalNum < < endl;
return 0;
}
对程序的几点说明:
1. 该程序缺少头部信息.(
2. 该程序已调试通过,为可用版本
3. 该程序的输入为一个形如”ABA..”的模式串,输出为符合要求的结果串的个数.
4. 该程序的时间复杂度为n^2
5. 该程序只为展示如何解决此类问题,输入输出可能与原题不符,相信读者可根据此程序作少许修改以通过Top coder的测试.
订阅:
博文 (Atom)