我在Solr 3.4中索引中文/日文文本时遇到问题 . 我使用DIH导入数据,连接块是
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/db_development?useUnicode=true&characterEncoding=UTF-8&characterSetResults=UTF-8"
user="user"
useUnicode="true"
characterEncoding="UTF-8"
encoding="UTF-8"
password="password"
zeroDateTimeBehavior="convertToNull"
name="app" />
该字段的字段类型定义为
<fieldType name="text_commongrams" class="solr.TextField">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory" />
<tokenizer class="solr.ICUTokenizerFactory" />
<filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.ICUNormalizer2FilterFactory" name="nfkc_cf" mode="compose"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
<filter class="solr.TrimFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt"
ignoreCase="true"
expand="true" />
<filter class="solr.CommonGramsFilterFactory"
words="stopwords_en.txt"
ignoreCase="true" />
<filter class="solr.StopFilterFactory"
words="stopwords_en.txt"
ignoreCase="true" />
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
splitOnNumerics="0"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
preserveOriginal="1" />
</analyzer>
</fieldType>
MySQL字符编码细节如下
+--------------------------+-----------------------------------------+
| Variable_name | Value |
+--------------------------+-----------------------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /opt/local/share/mysql5/mysql/charsets/ |
+--------------------------+-----------------------------------------+
我正在使用java param -Dfile.encoding=UTF-8
启动Solr .
输入文本是 JavaOne Tokyo 2012での発表スライド
当我将其导入Solr并使用ID查询该文档时,我将文本显示为 JavaOne Tokyo 2012ã§ã®ç™ºè¡¨ã‚¹ãƒ©ã‚¤ãƒ‰
谁能告诉我哪里出错了?
1 回答
所以我最终不得不改变我的MySQL表来存储UTF8中的字符串 . 有关如何将现有表从latin1转换为utf8的详细信息,请参见here .