如何在熊猫中进行关键字映射

时间:2017-10-06 08:24:13

标签: python pandas word2vec

我有关键字

class WorkflowTransitionsType extends AbstractType {

    /**
     * @var Workflow
     */
    private $workflow;

    /**
     * @var Session
     */
    private $session;

    /**
     * {@inheritdoc}
     */
    public function buildForm(FormBuilderInterface $builder, array $options) {

        /** @var Workflow $workflow */
        $this->workflow = $options['data'];

        /** @var Session $session */
        $this->session = $options['session'];

        // If the workflow is stored in the session we know that this method is called a 2. time!
        if($this->session->has($this->getBlockPrefix() . '_workflow')) $this->workflow = $this->session->get($this->getBlockPrefix() . '_workflow');

            $builder
                ->setMethod('PATCH')
                ->addEventListener(FormEvents::PRE_SET_DATA, function(FormEvent $event) {
                    dump($event);
                    // This always gets called AFTER storing the workflow if it is present in the current session
                    $this->session->set($this->getBlockPrefix() . '_workflow', $this->workflow);
                })
                ->addEventListener(FormEvents::PRE_SUBMIT, function(FormEvent $event) {
                    // Here we manipulating the passed workflow data by setting all previous values! 
                    $eventForm = $event->getForm();

                    /** @var Workflow $submitWorkflow */
                    $submitWorkflow = $eventForm->getData();

                    $submitWorkflow->setName($this->workflow->getName());
                    foreach($this->workflow->getStates() as $state) $submitWorkflow->addState($state);

                    $eventForm->setData($submitWorkflow);
                })
                ->addEventListener(FormEvents::POST_SUBMIT, function(FormEvent $event) {
                    // After submitting the workflow object is no longer required!
                    $this->session->remove($this->getBlockPrefix() . '_workflow');
                })
                ->add('initialState', ChoiceType::class, array(
                    ...
                    // Didn´t change (look at my question)
                ))
                ->add('transitions', CollectionType::class, array(
                    ...
                    // Didn´t change (look at my question)
                ))
                ->add('save', SubmitType::class, array(
                    ...
                    // Didn´t change (look at my question)
                ));
    }

    /**
     * {@inheritdoc}
     */
    public function configureOptions(OptionsResolver $resolver) {
        $resolver->setDefaults(array(
            'data_class'            => Workflow::class,
            'translation_domain'    => 'MyBundle',
        ));
        $resolver->setRequired(array(
            // This is necessary to prevent an error about an unknown option!
            'session'
        ));
    }
}

此处的示例数据框

India
Japan
United States
Germany
China

我的目标是制作

id    Address 
1     Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, Japan
2     Arcisstraße 21, 80333 München, Germany
3     Liberty Street, Manhattan, New York, United States
4     30 Shuangqing Rd, Haidian Qu, Beijing Shi, China
5     Vaishnavi Summit,80feet Road,3rd Block,Bangalore, Karnataka, India

基本的想法是创建关键字检测器,我想使用id Address India Japan United States Germany China 1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, Japan 0 1 0 0 0 2 Arcisstraße 21, 80333 München, Germany 0 0 0 1 0 3 Liberty Street, Manhattan, New York, USA 0 0 1 0 0 4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China 0 0 0 0 1 5 Vaishnavi Summit,80feet Road,Bangalore, Karnataka, India 1 0 0 0 0 str.contain,但我无法获得逻辑

3 个答案:

答案 0 :(得分:3)

In [58]: df = df.join(df.Address.str.extract(r'.*,(.*)', expand=False).str.get_dummies())

In [59]: df
Out[59]:
   id                                            Address   China   Germany   India   Japan   United States
0   1  Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, J...       0         0       0       1               0
1   2             Arcisstra?e 21, 80333 Munchen, Germany       0         1       0       0               0
2   3  Liberty Street, Manhattan, New York, United St...       0         0       0       0               1
3   4   30 Shuangqing Rd, Haidian Qu, Beijing Shi, China       1         0       0       0               0
4   5  Vaishnavi Summit,80feet Road,3rd Block,Bangalo...       0         0       1       0               0

注意:如果国家/地区不在Address列的最后位置或国家/地区名称包含,

,则此方法无法使用

答案 1 :(得分:3)

利用pd.get_dummies()

countries = df.Address.str.extract('(India|Japan|United States|Germany|China)', expand = False)
dummies = pd.get_dummies(countries)
pd.concat([df,dummies],axis = 1)

此外,最直接的方法是将国家/地区列入列表并使用for循环,例如

countries = ['India','Japan','United States','Germany','China']
for c in countries:
    df[c] = df.Address.str.contains(c) * 1

但如果您拥有大量数据和国家/地区,它可能会很慢。

答案 2 :(得分:2)

from numpy.core.defchararray import find

kw = 'India|Japan|United States|Germany|China'.split('|')
a = df.Address.values.astype(str)[:, None]

df.join(
    pd.DataFrame(
        find(a, kw) >= 0,
        df.index, kw,
        dtype=int
    )
)

   id                        Address  India  Japan  United States  Germany  China
0   1  Chome-2-8 Shibakoen, Minat...      0      1              0        0      0
1   2  Arcisstraße 21, 80333 Münc...      0      0              0        1      0
2   3  Liberty Street, Manhattan,...      0      0              1        0      0
3   4  30 Shuangqing Rd, Haidian ...      0      0              0        0      1
4   5  Vaishnavi Summit,80feet Ro...      1      0              0        0      0