当前位置:  开发笔记 > 编程语言 > 正文

如何有效地将pos_tag_sents()应用于pandas数据帧

如何解决《如何有效地将pos_tag_sents()应用于pandas数据帧》经验,为你挑选了1个好方法。

如果您希望POS标记存储在pandas数据帧中的文本列,每行1个句子,则SO上的大多数实现都使用apply方法

dfData['POSTags']= dfData['SourceText'].apply(
                 lamda row: [pos_tag(word_tokenize(row) for item in row])

NLTK文档建议使用pos_tag_sents()来有效标记多个句子.

这是否适用于这个例子中,如果是将代码那样改变简单pso_tagpos_tag_sents或不NLTK意味着段落的文本来源

正如评论中所提到的那样,pos_tag_sents()目的是每次都减少负载的负载,但问题是如何做到这一点并仍然在pandas数据帧中产生一个列?

链接到示例数据集20kRows



1> alvas..:

输入

$ cat test.csv 
ID,Task,label,Text
1,Collect Information,no response,cozily married practical athletics Mr. Brown flat
2,New Credit,no response,active married expensive soccer Mr. Chang flat
3,Collect Information,response,healthy single expensive badminton Mrs. Green flat
4,Collect Information,response,cozily married practical soccer Mr. Brown hierachical
5,Collect Information,response,cozily single practical badminton Mr. Brown flat

TL; DR

>>> from nltk import word_tokenize, pos_tag, pos_tag_sents
>>> import pandas as pd
>>> df = pd.read_csv('test.csv', sep=',')
>>> df['Text']
0    cozily married practical athletics Mr. Brown flat
1       active married expensive soccer Mr. Chang flat
2    healthy single expensive badminton Mrs. Green ...
3    cozily married practical soccer Mr. Brown hier...
4     cozily single practical badminton Mr. Brown flat
Name: Text, dtype: object
>>> texts = df['Text'].tolist()
>>> tagged_texts = pos_tag_sents(map(word_tokenize, texts))
>>> tagged_texts
[[('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('athletics', 'NNS'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')], [('active', 'JJ'), ('married', 'VBD'), ('expensive', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Chang', 'NNP'), ('flat', 'JJ')], [('healthy', 'JJ'), ('single', 'JJ'), ('expensive', 'JJ'), ('badminton', 'NN'), ('Mrs.', 'NNP'), ('Green', 'NNP'), ('flat', 'JJ')], [('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('hierachical', 'JJ')], [('cozily', 'RB'), ('single', 'JJ'), ('practical', 'JJ'), ('badminton', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')]]

>>> df['POS'] = tagged_texts
>>> df
   ID                 Task        label  \
0   1  Collect Information  no response   
1   2           New Credit  no response   
2   3  Collect Information     response   
3   4  Collect Information     response   
4   5  Collect Information     response   

                                                Text  \
0  cozily married practical athletics Mr. Brown flat   
1     active married expensive soccer Mr. Chang flat   
2  healthy single expensive badminton Mrs. Green ...   
3  cozily married practical soccer Mr. Brown hier...   
4   cozily single practical badminton Mr. Brown flat   

                                                 POS  
0  [(cozily, RB), (married, JJ), (practical, JJ),...  
1  [(active, JJ), (married, VBD), (expensive, JJ)...  
2  [(healthy, JJ), (single, JJ), (expensive, JJ),...  
3  [(cozily, RB), (married, JJ), (practical, JJ),...  
4  [(cozily, RB), (single, JJ), (practical, JJ), ... 

在龙:

首先,您可以将Text列提取到字符串列表:

texts = df['Text'].tolist()

然后你可以应用这个word_tokenize功能:

map(word_tokenize, texts)

请注意,@ Boud的建议几乎相同,使用df.apply:

df['Text'].apply(word_tokenize)

然后将标记化的文本转储到字符串列表的列表中:

df['Text'].apply(word_tokenize).tolist()

然后你可以使用pos_tag_sents:

pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )

然后将列添加回DataFrame:

df['POS'] = pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )


您的“ TL; DR”比“ In Long”版本长:)
推荐阅读
手机用户2402851155
这个屌丝很懒,什么也没留下!
DevBox开发工具箱 | 专业的在线开发工具网站    京公网安备 11010802040832号  |  京ICP备19059560号-6
Copyright © 1998 - 2020 DevBox.CN. All Rights Reserved devBox.cn 开发工具箱 版权所有