首页 文章

从pandas中的文本中删除unicode

提问于
浏览
3

对于一个字符串,下面的代码删除unicode字符和新行/回车:

t = "We've\xe5\xcabeen invited to attend TEDxTeen, an independently organized TED event focused on encouraging youth to find \x89\xdb\xcfsimply irresistible\x89\xdb\x9d solutions to the complex issues we face every day.,"

t2 = t.decode('unicode_escape').encode('ascii', 'ignore').strip()
import sys
sys.stdout.write(t2.strip('\n\r'))

但是当我尝试在pandas中编写一个函数来将它应用于列的每个单元格时,它会因为属性错误而失败,或者我收到一条警告,表示正在尝试在DataFrame的一个切片副本上设置一个值

def clean_text(row):
    row= row["text"].decode('unicode_escape').encode('ascii', 'ignore')#.strip()
    import sys
    sys.stdout.write(row.strip('\n\r'))
    return row

应用于我的数据帧:

df["text"] = df.apply(clean_text, axis=1)

如何将此代码应用于系列的每个元素?

3 回答

  • 1

    问题似乎是你试图访问和更改 row['text'] 并在执行apply函数时返回行本身,当你在 DataFrame 上执行 apply 时,它适用于每个系列,所以如果改为这应该有帮助:

    import pandas as pd
    
    df = pd.DataFrame([t for _ in range(5)], columns=['text'])
    
    df 
                                                    text
    0  We've������been invited to attend TEDxTeen, an ind...
    1  We've������been invited to attend TEDxTeen, an ind...
    2  We've������been invited to attend TEDxTeen, an ind...
    3  We've������been invited to attend TEDxTeen, an ind...
    4  We've������been invited to attend TEDxTeen, an ind...
    

    def clean_text(row):
        # return the list of decoded cell in the Series instead 
        return [r.decode('unicode_escape').encode('ascii', 'ignore') for r in row]
    
    df['text'] = df.apply(clean_text)
    
    df
                                                    text
    0  We'vebeen invited to attend TEDxTeen, an indep...
    1  We'vebeen invited to attend TEDxTeen, an indep...
    2  We'vebeen invited to attend TEDxTeen, an indep...
    3  We'vebeen invited to attend TEDxTeen, an indep...
    4  We'vebeen invited to attend TEDxTeen, an indep...
    

    或者,您可以使用 lambda ,如下所示,并直接仅适用于 text 列:

    df['text'] = df['text'].apply(lambda x: x.decode('unicode_escape').\
                                              encode('ascii', 'ignore').\
                                              strip())
    
  • 8

    我实际上无法重现您的错误:以下代码为我运行没有错误或警告 .

    df = pd.DataFrame([t,t,t],columns = ['text'])
    df["text"] = df.apply(clean_text, axis=1)
    

    如果它有帮助,我认为更接近这种类型问题的方法可能是使用一个 DataFrame.str 方法的正则表达式,例如:

    df["text"] =  df.text.str.replace('[^\x00-\x7F]','')
    
  • 5

    像这样的东西,其中column_to_convert是你想要转换的列:

    series = df['column_to_convert']
    df["text"] =  [s.encode('ascii', 'ignore').strip()
                   for s in series.str.decode('unicode_escape')]
    

相关问题