首页 文章

在Python和TypeError中编码中文字符:需要类似字节的对象,而不是'str'

提问于
浏览
0

嗨,大家好我已经用Python编写了一个web scraper,试图从一个字典网站上搜索一些单词样本句子,等等我的GRE单词列表并将它们放在一个csv文件中 . 抓内容包括汉字 .

我的脚本唯一的问题是当我尝试将这些文件写入CSV文件时,我可能会收到错误

UnicodeEncodeError:'ascii'编解码器无法对位置13-15中的字符进行编码:序数不在范围内(128)

要么

TypeError:需要类似字节的对象,而不是'str'

这是我的完整代码:

#!/usr/bin/python
# -*- coding: <encoding name> -*-

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

# make a word list (grabbed from the wordlist pdf, converted to Excel and extracted)

wordList = '''Day One
abandon
abate
abbreviate
Day Two
abate
abbreviate
Day Three
abandon
abate
Day Four
abandon
abate
abandon
abate
Day Five
abandon
abate
Day Six
abandon
abate
Day Seven
abandon
abate'''

wordList = [y for y in (x.strip() for x in wordList.splitlines()) if y]

dayIndex = 0
dayArray = ['Day One', 'Day Two', 'Day Three', 'Day Four', 'Day Five', 'Day Six', 'Day Seven']

for item in wordList:
        if item == dayArray[dayIndex]:
                if dayIndex == 0:
                        fileName = "Word " + dayArray[dayIndex] + ".csv"
                        f = open(fileName, 'w')
                        headers = "word, separater, detail, lineSep\n"
                        f.write(headers)
                        dayIndex += 1
                elif dayIndex == 6:
                        f.close()
                else:
                        f.close()
                        fileName = "Word " + dayArray[dayIndex] + ".csv"
                        f = open(fileName, 'w')
                        headers = "word, separater, detail, lineSep\n"
                        f.write(headers)
                        dayIndex += 1
        else:
                # construct url for each word
                myUrl = 'http://gre.kmf.com/vocab/detail/' + item

                # opening up the connection, grabbing the page
                uClient = uReq(myUrl)
                page_html = uClient.read()
                uClient.close()

                # html parsing
                pageSoup = soup(page_html, "html.parser")

                # grab word container
                container = pageSoup.findAll("div", {"class", "word-d-maintile"})
                contain = container[0]# actually only 1 item in the container array

                # grab the word(should be the same as item)
                word = contain.span.text

                # grab word detail
                wordDetail_container = contain.findAll("div", {"class": "word-g-translate"})
                wordDetail = wordDetail_container[0].text.strip()# again should be only 1 item in the array.strip() the extra spaces and useless indentation

                # manipulate the string wordDetail(string is immutable but you know what I mean)
                detailArray = []
                for letter in wordDetail:
                        if letter != '【' and letter != '例' and letter != '近' and letter != '反':
                            detailArray.append(letter)
                        elif letter == '【':
                            detailArray.append("\n\n\n" + letter)
                        else:
                            detailArray.append("\n\n" + '[' + letter + ']' + ' ')
                        newWordDetail = ''.join(detailArray)
                #print("CUT\n") debug
                #print(word + '\n') debug
                #print(newWordDetail) debug
                f.write(word +',' + '&' + ',' + newWordDetail.replace(',', 'douhao') + ',' + '$')

问题出在最后一行 . 当第一个错误发生时,我在newWordDetail尝试编码这些中文字符后添加了一个“.encode('gb2312')”,但是在我这样做之后我得到了第二个错误 . 我在网上查了一下但很难找到适合我情况的解决方案 .

谢谢你们拯救我的生命!

1 回答

  • 1

    你的代码,写成了面条式代码,造成有的情况下,文件关闭了,不能写 .

    f.write(word',''&'','newWordDetail.replace(',','douhao')',''$')

    有时你写一个关闭文件,所以这是错误的 . 下面的代码是正确的,运行此代码,我可以得到正确的内容 .

    #!/usr/bin/env python
    # coding:utf-8
    '''黄哥Python'''
    
    
    from urllib.request import urlopen as uReq
    from bs4 import BeautifulSoup as soup
    
    # make a word list (grabbed from the wordlist pdf, converted to Excel and
    # extracted)
    
    wordList = '''Day One
    abandon
    abate
    abbreviate
    Day Two
    abate
    abbreviate
    Day Three
    abandon
    abate
    Day Four
    abandon
    abate
    abandon
    abate
    Day Five
    abandon
    abate
    Day Six
    abandon
    abate
    Day Seven
    abandon
    abate'''
    
    wordList = [y for y in (x.strip() for x in wordList.splitlines()) if y]
    
    dayIndex = 0
    dayArray = ['Day One', 'Day Two', 'Day Three',
                'Day Four', 'Day Five', 'Day Six', 'Day Seven']
    
    for item in wordList:
        if item == dayArray[dayIndex]:
            if dayIndex == 0:
                fileName = "Word " + dayArray[dayIndex] + ".csv"
                f = open(fileName, 'w')
                headers = "word, separater, detail, lineSep\n"
                f.write(headers)
                dayIndex += 1
            elif dayIndex == 6:
                f.close()
            else:
                f.close()
                fileName = "Word " + dayArray[dayIndex] + ".csv"
                f = open(fileName, 'w')
                headers = "word, separater, detail, lineSep\n"
                f.write(headers)
                dayIndex += 1
        else:
            # construct url for each word
            myUrl = 'http://gre.kmf.com/vocab/detail/' + item
    
            # opening up the connection, grabbing the page
            uClient = uReq(myUrl)
            page_html = uClient.read()
            uClient.close()
    
            # html parsing
            pageSoup = soup(page_html, "html.parser", )
    
            # grab word container
            container = pageSoup.findAll("div", {"class", "word-d-maintile"})
            contain = container[0]  # actually only 1 item in the container array
    
            # grab the word(should be the same as item)
            word = contain.span.text
    
            # grab word detail
            wordDetail_container = contain.findAll(
                "div", {"class": "word-g-translate"})
            # again should be only 1 item in the array.strip() the extra spaces and
            # useless indentation
            wordDetail = wordDetail_container[0].text.strip()
    
            # manipulate the string wordDetail(string is immutable but you know
            # what I mean)
            detailArray = []
            for letter in wordDetail:
                if letter != '【' and letter != '例' and letter != '近' and letter != '反':
                    detailArray.append(letter)
                elif letter == '【':
                    detailArray.append("\n\n\n" + letter)
                else:
                    detailArray.append("\n\n" + '[' + letter + ']' + ' ')
                newWordDetail = ''.join(detailArray)
            # print("CUT\n") debug
            # print(word + '\n') debug
            # print(newWordDetail) debug
            # print(f)
            try:
                f.write(word + ',' + '&' + ',' +newWordDetail.replace(',', 'douhao') + ',' + '$')
            except Exception as e:
                pass
    

    输出结果,其中一个文件的内容如下.dord,separater,detail,lineSep abandon,&,

    【考法1】N . 放纵:carefreedouhao不受约束

    完全放弃肆无忌惮地向炖菜里面加调料

    [近] unstraintdouhao uninhibitednessououhao unrestraint

    【考法2】诉放纵:无拘无束地给予(自己)

    放弃自己的情感感情用事‖放弃自己完全无所事事的生活她放纵自己过着闲散的生活

    [近] indulgedouhao投降

    【考法3】诉放弃:经常在面临危险或侵占时退出

    [弃]放弃船/家弃船;离家

    [反]救助救援

    【考法4】诉停止做某事:结束(有计划或事先同意的事情)

    恶劣天气迫使美国宇航局放弃发射 . 坏天气迫使NASA停止了发射 .

    [近] abortdouhao dropdouhao repealdouhao rescinddouhao revokedouhao call offdouhao放弃

    [反] keepdouhao continueouhao maintaindouhao继续继续,$ abate,&,

    【考法1】诉减轻(程度或者强度):减少程度或强度

    减轻他的愤怒/痛苦平息他的愤怒/减轻他的痛苦

    [近] moderatedouhao recededouhao subsidedouhao remitdouhao wanedouhao die(off or down or out)douhao let updouhao phase downdouhao taper off

    [反]加强加强,加剧

    【考法2】诉减少(数量),降低( Value ):减少数量或 Value

    减税减税

    [近] de-escalatedouhao depletedouhao underscaledouhao dwindledouhao ratchet(down)

    [反] augmentdouhao促进增加

    【考法3】诉停止,撤销:杜绝

    减轻了令人讨厌的停止伤害

    [近] abrogatedouhao annuldouhao invalidatedouhao nullifydouhao rescinddouhao vacate,$

相关问题