如果您希望POS标记存储在pandas数据帧中的文本列,每行1个句子,则SO上的大多数实现都使用apply方法
dfData['POSTags']= dfData['SourceText'].apply( lamda row: [pos_tag(word_tokenize(row) for item in row])
NLTK文档建议使用pos_tag_sents()来有效标记多个句子.
这是否适用于这个例子中,如果是将代码那样改变简单pso_tag
以pos_tag_sents
或不NLTK意味着段落的文本来源
正如评论中所提到的那样,pos_tag_sents()
目的是每次都减少负载的负载,但问题是如何做到这一点并仍然在pandas数据帧中产生一个列?
链接到示例数据集20kRows
输入
$ cat test.csv ID,Task,label,Text 1,Collect Information,no response,cozily married practical athletics Mr. Brown flat 2,New Credit,no response,active married expensive soccer Mr. Chang flat 3,Collect Information,response,healthy single expensive badminton Mrs. Green flat 4,Collect Information,response,cozily married practical soccer Mr. Brown hierachical 5,Collect Information,response,cozily single practical badminton Mr. Brown flat
TL; DR
>>> from nltk import word_tokenize, pos_tag, pos_tag_sents >>> import pandas as pd >>> df = pd.read_csv('test.csv', sep=',') >>> df['Text'] 0 cozily married practical athletics Mr. Brown flat 1 active married expensive soccer Mr. Chang flat 2 healthy single expensive badminton Mrs. Green ... 3 cozily married practical soccer Mr. Brown hier... 4 cozily single practical badminton Mr. Brown flat Name: Text, dtype: object >>> texts = df['Text'].tolist() >>> tagged_texts = pos_tag_sents(map(word_tokenize, texts)) >>> tagged_texts [[('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('athletics', 'NNS'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')], [('active', 'JJ'), ('married', 'VBD'), ('expensive', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Chang', 'NNP'), ('flat', 'JJ')], [('healthy', 'JJ'), ('single', 'JJ'), ('expensive', 'JJ'), ('badminton', 'NN'), ('Mrs.', 'NNP'), ('Green', 'NNP'), ('flat', 'JJ')], [('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('hierachical', 'JJ')], [('cozily', 'RB'), ('single', 'JJ'), ('practical', 'JJ'), ('badminton', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')]] >>> df['POS'] = tagged_texts >>> df ID Task label \ 0 1 Collect Information no response 1 2 New Credit no response 2 3 Collect Information response 3 4 Collect Information response 4 5 Collect Information response Text \ 0 cozily married practical athletics Mr. Brown flat 1 active married expensive soccer Mr. Chang flat 2 healthy single expensive badminton Mrs. Green ... 3 cozily married practical soccer Mr. Brown hier... 4 cozily single practical badminton Mr. Brown flat POS 0 [(cozily, RB), (married, JJ), (practical, JJ),... 1 [(active, JJ), (married, VBD), (expensive, JJ)... 2 [(healthy, JJ), (single, JJ), (expensive, JJ),... 3 [(cozily, RB), (married, JJ), (practical, JJ),... 4 [(cozily, RB), (single, JJ), (practical, JJ), ...
在龙:
首先,您可以将Text
列提取到字符串列表:
texts = df['Text'].tolist()
然后你可以应用这个word_tokenize
功能:
map(word_tokenize, texts)
请注意,@ Boud的建议几乎相同,使用df.apply
:
df['Text'].apply(word_tokenize)
然后将标记化的文本转储到字符串列表的列表中:
df['Text'].apply(word_tokenize).tolist()
然后你可以使用pos_tag_sents
:
pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )
然后将列添加回DataFrame:
df['POS'] = pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )